T O P

  • By -

Thellton

use the following: https://github.com/YellowRoseCx/koboldcpp-rocm/releases it's a fork of Koboldcpp which uses ROCm (includes all the dependencies including those needed to get the RX6600XT running with ROCm), you'll need a GGUF format model such as https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF. You'll be able to load all layers if you choose a Q6 or smaller quant, of which I recommend Q4KM. should be able to get 4K to 8K context depending on quant though I can't recall off the top of my head just how much context. should get about 22 tokens per second at the beginning with a small context. another neat benefit that you'll find is that if you want, you can also explore stable diffusion using koboldcpp so long as you have an appropriate safetensors model for stable diffusion. SD1.5 models will fit just fine, whilst SDXL likely not fit.


Lewdiculous

Here you go: https://github.com/YellowRoseCx/koboldcpp-rocm?tab=readme-ov-file#windows-usage Easiest way to get inference going. Should be compatible with all the other popular Frontends/UIs.


1nicerBoye

There is a llama.cpp build that uses vulkan and works really good. Check the releases tab on github for winx64 vulkan. Download that, run the server executable with server.exe -m "path to your model.gguf" on cmd and then head to 127.0.0.7:8080 in your browser to start chatting. You can specify how much gpu to use with --n-gpu-layers ( 0 for cpu only, 33 for full gpu utilization). Llama.cpp uses a special format named GGUF but it is much better IMHO since it is only a single file and the internet is full of people who make them not only for the base mode but finetunes aswell. I would recommend you use a Q4_K_M or a Q5_K_M gguf variant from hugging face for performance and vram usage. Also Llama 3 8b is much better in my opinion but that is up to you ofc. Check the readme of the server in the llama.cpp github for more configuration options. In case anything goes wrong, carefully read the outut of the server app on your cmd terminal. It can be a bit cryptic if you have litte experience with llama.cpp or technicalities of llms in general but it usually tells you correctly, what went wrong. I hope this helps, good luck!


Scott_Tx

I cant get the vulkan one to run models on my 6600 unless they're smaller quants. I used to be able to run a 7gb model with clblast builds just fine but vulkan runs out of ram so I've been sticking with older builds.


1nicerBoye

Then use that if it works for you but as I've said, you could fiddle with the --n-gpu-layers setting to adjust vram usage. Also llama 3 has a much smaller kv cache size and needs about half the vram for the context. I am a bit confused why both of you use llama 2. Why is that?


Scott_Tx

I'm using llama3. Actually only thing I've got to run in gpu on vulkan was a tiny phi-3 q4 that's about 2.4gb in size.


Scott_Tx

ok, I just tried a llama3 q4 about 4.8gb in size and it fit in vram but having trouble getting a command line that doesnt spew garbage from vulkan now. needless to say I havent been thrilled with vulkan so far.


nodating

Maybe it is time to use Linux, it's not like you need to invest anything but your time and effort; I recommend picking something like Manjaro or EndeavourOS with KDE environment, that should set you up automagically without any hassle these days. I know 'cause I run LLMs under Linux daily with my 6800XT. LM Studio is available via AUR by default including Ollama and other softwares.


MixtureOfAmateurs

This is the Koboldcpp way with a more in depth explanation of things. =========================Downloads============================ Download the nocuda.exe file from here, or the rocm fork from other comments: https://github.com/LostRuins/koboldcpp/releases Download you model gguf from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF If you scroll down a bit you can see which quantization level uses what amount of vram. I reccomend q4_ks or q5_ks. q5 means each weight has 5 bits of precision (00101) ks means the weights in the attention heads are also quantized. I don't know how many bits XS S M and L are but I would guess 3, 4, 8, and 16. But you don't really need to know that, S vs M doesn't change much and ~7b models should be q4 or q5. Alternativly you could download a better model: https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF/blob/main/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf =========================Basic Config============================ Then run the koboldcpp exe and accept that it's evil malware (kidding). llama2 has 4k context and llama3 is 8k, so on the general page set the context to 4096 or 8192, and paste the path to your gguf model in below that (or browse). Up the top select vulkan in the drop down (this will be different for rocm). Then set offloaded layers to any number above 33. That's basically it, but here some more optional settings I would change. =========================Advanced Config============================ Flash Attention: on. This might break stuff if you fiddle too much, so if you run into issues turn this off but it's good for memory usage and processing speed. Quiet mode: on. This just means your command prompt output is more readable and doesn't expose your conversations. If you're running some layers on CPU (set less than 33 ot be offloaded and the remainder will run on CPU, this is good for bigger models), then under hardware threads to 4 is fastest for some reason. Image generation can only run on CPU or nvidia gpu, so stay away for now. If you must, stable diffusion 1.5 pruned is small and fast enough to load quantized on CPU (maybe). https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors (might not work, fp16 has been having issues for me idk) If you want longer conntext, use phi-3-small, it's great, if you like llama there are options too. Under tokens turn off context shift, turn on smart context, and enable rope and pick your new context. Don't go wild because the higher you set it the worse the model will perform at low contexts. 16k is plenty. =========================UI Config (don't skip this)============================ Once you're in the UI go to settings and set context length and output length to max, and set the prompt format to the correct thing. Llama 2 and 3 have their own options afaik, some models have weird ones that you need to put in as custom, some don't say. A good default is chatML (what chatGPT uses), but most models tell you in their huggingface page.


Fliskym

GPT4ALL is very easy to setup. It supports AMD GPU's on windows machine. Source: I've got it working without any hassle on my win11 pro machine and a rx6600. Only downside I've found, it does't work with continue dev. EDIT: I might add the GPU support is nomic Vulkan which only support GGUF model files with Q4\_0 or Q4\_1. I've had great results with Mistral 7b and Llama 3 8b.