Recent comments in /f/MachineLearning

toothpastespiders t1_jc01mr9 wrote

> BUT someone has already made a webUI like the automatic1111 one!

There's a subreddit for it over at /r/Oobabooga too that deserves more attention. I've only had a little time to play around with it but it's a pretty sleek system from what I've seen.

> it looked really complicated for me to set up with 4-bits weights

I'd like to say that the warnings make it more intimidating than it really is. I think it was just copying and pasting four or five lines for me onto a terminal. Then again I also couldn't get it to work so I might be doing something wrong. I'm guessing it's just that my weirdo gpu wasn't really accounted for somewhere. I'm going to bang my head against it when I've got time just because it's frustrating having tons of vram to spare and not getting the most out of it.

6

megacewl t1_jbzts4h wrote

I think so? As the model being converted to 8-bit or 4-bit literally means that it was shrunk and is now smaller (and ironically this almost doesn't change the output quality at all), which is why it requires less VRAM to load.

There's tutorials to setup llama.cpp with 4-bit converted LLaMA models which may be worth checking out to help you achieve your goal. llama.cpp is an implementation of LLaMA in C++, that uses the CPU and system RAM. Someone got it running the 7B model on a Raspberry Pi 4 4GB so llama.cpp may be worth checking out if you're low on VRAM.

2

MinaKovacs t1_jbzso7m wrote

One of the few things we know for certain about the human brain is it is nothing like a binary computer. Ask any neuroscientist and they will tell you we still have no idea how the brain works. The brain operates at a quantum level, manifested in mechanical, chemical, and electromagnetic characteristics, all at the same time. It is not a ball of transistors.

0

remghoost7 t1_jbzqf5m wrote

Most excellent. Thank you so much! I will look into all of these.

Guess I know what I'm doing for the rest of the day. Time to make more coffee! haha.

You are my new favorite person this week.

Also, one final question, if you will. What's so unique about the 4-bit weights and why would you prefer to run it in that manner? Is it just VRAM optimization requirements....? I'm decently versed in Stable Diffusion, but LLMs are fairly new territory for me.

My question seemed to have been answered here, and it is a VRAM limitation. Also, that last link seems to support 4-bit models as well. Doesn't seem too bad to set up.... Though I installed A1111 when it first came out, so I learned through the garbage of that. Lol. I was wrong. Oh so wrong. haha.

Yet again, thank you for your time and have a wonderful rest of your day. <3

4

Amazing_Painter_7692 OP t1_jbzoq05 wrote

There's an inference engine class if you want to build out your own API:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96

And there's a simple text inference script here:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py

Or in the original repo:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

BUT someone has already made a webUI like the automatic1111 one!

https://github.com/oobabooga/text-generation-webui

Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P

15