Recent comments in /f/MachineLearning
Zer0D0wn83 t1_jc062d9 wrote
Reply to comment by Taenk in [R] Introducing Ursa from Speechmatics | 25% improvement over Whisper by jplhughes
Seems super cheap to me tbf - no problem with paying for stuff like this.
Zer0D0wn83 t1_jc060s2 wrote
Reply to comment by rshah4 in [R] Introducing Ursa from Speechmatics | 25% improvement over Whisper by jplhughes
Seems like there's a very generous free tier, then super cheap after that.
boyetosekuji t1_jc04pvl wrote
what is the difference between $1.25/hr for Standard, $1.90/hr for Enhanced
currentscurrents t1_jc03yjr wrote
Reply to comment by Dendriform1491 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
You could pack more bits in your bit with in-memory compression. You'd need hardware support for decompression inside the processor core.
[deleted] t1_jc02vok wrote
Reply to comment by Dendriform1491 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
[deleted]
Taenk t1_jc02fzb wrote
Excellent demo on your page, I just used it on a YT video featuring a non-native English speaker. There was only a slight error in punctuation due to an ambiguously long pause in the speech.
Is this a purely commercial product or will there be an open source release?
toothpastespiders t1_jc01mr9 wrote
Reply to comment by Amazing_Painter_7692 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
> BUT someone has already made a webUI like the automatic1111 one!
There's a subreddit for it over at /r/Oobabooga too that deserves more attention. I've only had a little time to play around with it but it's a pretty sleek system from what I've seen.
> it looked really complicated for me to set up with 4-bits weights
I'd like to say that the warnings make it more intimidating than it really is. I think it was just copying and pasting four or five lines for me onto a terminal. Then again I also couldn't get it to work so I might be doing something wrong. I'm guessing it's just that my weirdo gpu wasn't really accounted for somewhere. I'm going to bang my head against it when I've got time just because it's frustrating having tons of vram to spare and not getting the most out of it.
vintage2019 t1_jbzzadd wrote
Reply to comment by phys_user in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Has anyone ranked models with that and published the results?
[deleted] t1_jbzyaoz wrote
boyetosekuji t1_jbzwwo7 wrote
th3nan0byt3 t1_jbzw23a wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
only if you turn your pc case upside down
rshah4 t1_jbzvzsl wrote
Is it open source?
[deleted] t1_jbzva06 wrote
The_frozen_one t1_jbzv0gt wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
It takes about 7 seconds to generate a full response using 13B to a prompt with the default (128) number of predicted tokens.
megacewl t1_jbzts4h wrote
Reply to comment by ID4gotten in [P] vanilla-llama an hackable plain-pytorch implementation of LLaMA that can be run on any system (if you have enough resources) by poppear
I think so? As the model being converted to 8-bit or 4-bit literally means that it was shrunk and is now smaller (and ironically this almost doesn't change the output quality at all), which is why it requires less VRAM to load.
There's tutorials to setup llama.cpp with 4-bit converted LLaMA models which may be worth checking out to help you achieve your goal. llama.cpp is an implementation of LLaMA in C++, that uses the CPU and system RAM. Someone got it running the 7B model on a Raspberry Pi 4 4GB so llama.cpp may be worth checking out if you're low on VRAM.
[deleted] t1_jbztbxc wrote
Reply to comment by kkg_scorpio in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
[removed]
Mayfieldmobster t1_jbzspe0 wrote
Reply to [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256
There are tools that allow you to train models of very large sizes on much smaller hardware like colossal ai
MinaKovacs t1_jbzso7m wrote
Reply to comment by MurlocXYZ in [D] Is anyone trying to just brute force intelligence with enormous model sizes and existing SOTA architectures? Are there technical limitations stopping us? by hebekec256
One of the few things we know for certain about the human brain is it is nothing like a binary computer. Ask any neuroscientist and they will tell you we still have no idea how the brain works. The brain operates at a quantum level, manifested in mechanical, chemical, and electromagnetic characteristics, all at the same time. It is not a ball of transistors.
remghoost7 t1_jbzro03 wrote
Reply to comment by The_frozen_one in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
Nice!
How's the generation speed...?
The_frozen_one t1_jbzqvwc wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
I'm running it using https://github.com/ggerganov/llama.cpp. The 4-bit version of 13b runs ok without GPU acceleration.
[deleted] t1_jbzqsrt wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
[removed]
remghoost7 t1_jbzqf5m wrote
Reply to comment by Amazing_Painter_7692 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
Most excellent. Thank you so much! I will look into all of these.
Guess I know what I'm doing for the rest of the day. Time to make more coffee! haha.
You are my new favorite person this week.
Also, one final question, if you will. What's so unique about the 4-bit weights and why would you prefer to run it in that manner? Is it just VRAM optimization requirements....? I'm decently versed in Stable Diffusion, but LLMs are fairly new territory for me.
My question seemed to have been answered here, and it is a VRAM limitation. Also, that last link seems to support 4-bit models as well. Doesn't seem too bad to set up.... Though I installed A1111 when it first came out, so I learned through the garbage of that. Lol. I was wrong. Oh so wrong. haha.
Yet again, thank you for your time and have a wonderful rest of your day. <3
Amazing_Painter_7692 OP t1_jbzov27 wrote
Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
https://github.com/qwopqwop200/GPTQ-for-LLaMa
Performance is quite good.
Amazing_Painter_7692 OP t1_jbzoq05 wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
There's an inference engine class if you want to build out your own API:
And there's a simple text inference script here:
Or in the original repo:
https://github.com/qwopqwop200/GPTQ-for-LLaMa
BUT someone has already made a webUI like the automatic1111 one!
https://github.com/oobabooga/text-generation-webui
Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P
z_fi t1_jc06htn wrote
Reply to comment by Psychological-Ear896 in [D] I’m a Machine Learning Engineer for FAANG companies. What are some places looking for freelance / contract work for ML? by doctorjuice
The key difference is a non compete or other clause that prevents you from working multiple contracts.
with a 1099 you are independent