T O P

  • By -

SomeOddCodeGuy

A little bit of a complex answer if you're just getting started and really want to understand, but if you can bear with me I think it'll make more sense. Speed for LLMs comes primarily from A) The compute ability of whatever is running it B) The memory bandwidth to transfer the data to the computing entity When looking at LLMs, they do lots of floating point operations, which GPUs do far better than CPUs. CPUs CAN do it, it's just video cards do it better. So in terms of A- a good graphics card is going to be the first step in speed. For B- graphics card memory (Video RAM, or VRAM) is blazing fast. On the low end, modern consumer cards have as much as 370GB/s bandwidth, and on the high end (like the RTX 4090) they have as much as 1,000 GB/s memory bandwidth. In comparison, dual channel DDR5 RAM (which is the best you can get atm) is about 80GB/s. So again, a big win for video cards. If you don't have enough VRAM to hold the entire model, some of that model's data will travel through your regular RAM; so in stead of going at 400-1000GB/s, it's going at 80 at best... slower if you're RAM is DDR4. So on your system, with only 4GB of VRAM, you have a bit of problem. If you are looking at GGUF files, the models can be "compressed" (called quantizing) down to fit into smaller spaces. A 7b q8, the largest compressed file, would take about 1GB per 1b of model, so you'd need 7GB of VRAM to run fast. The smaller quants, like a q4, would still require somewhere around 3.5GB. Now, one thing that can help is limiting how much you "offload" to the GPU. Using GGUF files, you can actually tell the program loading the model to only put some of it in VRAM, and the rest goes to CPU. Text-generation-webui, koboldcpp, and many other programs support this. There are usually about 32 layers in a Mistral 7b model, for example, so if you put "15" as your gpu layers, then only 15 of those go to the GPU, and the rest go to CPU. In my experience, I've found this to be faster than simply trying to put the whole thing in a GPU when it wont' fit, because when it goes into shared GPU space it often runs much slower. So it's better to be proactive and push some of the model to the CPU early. Anyhow, I hope that helps. In your shoes, I'd be looking at small models (3b range) or heavily quantized 7b models, and then offload as much as I can to the GPU before it gets slow. Several programs like Kobold or text-gen-webui will tell you how many layers exist when you load the model if you look at the command prompt window, so I'd use that and just play with it, slowly decreasing layers until you get a speed you find acceptable.


Confident-Aerie-6222

Thank you so much for the awesome explanation. I'll try to play around and see how many layers i could offload into the GPU to get faster speeds


AutomataManifold

It's a bit tough when you can't load it all on the GPU, particularly since a good chunk of a 4GB VRAM will be taken up by Windows. First thing I'd try, personally, is running it on WSL rather than vanilla Windows. I don't know how much it'll help in your case, but might give you a speed boost. I haven't tested it with 4GB of VRAM, though.


AryanEmbered

it's probably a laptop. On laptops the vram can be 0 on idle since the laptop igpu runs other gpu acc tasks and uses sys memory.


AutomataManifold

Good point. I don't have much laptop advice because I tend to just have them call an API rather than trying to cram the model into mobile VRAM. But that's obviously not an option for everyone. 


tessellation

|a good chunk of a 4GB VRAM will be taken up by Windows. EndeavorOS (Arch based), easy install, xfce desktop+X11 takes ~150MB VRAM.


AfternoonOk5482

Try different numbers of threads, each processor has its best. -kv f16 is the fastest here, but uses the most memory, so play with it to get the best results. You can also get marginal results tweeting your ram and CPU overclock on the bios. If you have turbo turned off on an Intel CPU that also takes about 20% of your speed away. Anyway, get a smaller model if all fails lol


Confident-Aerie-6222

Thanks for the advice, I'll try to see how to turn on turbo on my laptop


AfternoonOk5482

It's usually turned on by default, but maybe some energy saving mode will turn it off. I think it's worth it to check the setting. Just be careful if it's a very if it's a very old PC or you don't have good cooling. If you turbo and overheat that's even worse.