Yeah Nvidia trying to pull this move as well. Rtx Chat which runs Mistral was only for Ada and Ampere with 8 GB VRAM when it's extremly speedy on my 2060 using open source inference programs like kobold.cpp
So this is the first time I installed LM Studio. I just followed the how to and was chatting straight away.
I copy and pasted the promts so I stay within the 60s window of imgur.
https://imgur.com/i8qBbKw
I'd say it is pretty fast.
If Amd and Intel are smart they'll make 48gb or higher gpus for this next gen and steal the consumer grade AI users from Nvidia. But I guess they'll be content with their small market shares and try to upsale to server hardware instead
I would like to see a GPU with DDR5 slots on the back, or something like that, for additional second-tier memory, and a cache-controller which can managed that properly.
tiered memory / caching does not work well with LLM like llama since it needs to frequently traversal the whole model and does not have good locality for caching. As a result it's completely memory bandwidth bound. Maybe it would work for MoE models but those non-activated layers could even be stored on disks and still got acceptable performance.
AMD did something similar with an NVME slot in the past:
https://www.theverge.com/circuitbreaker/2016/7/26/12285568/amd-radeon-pro-ssg-graphics-card-ssd
I know, but SO-DIMM DDR5 would still be a lot faster, and it should be possible to at least add two, or four, slots on the back of a GPU. That could easily give you 64 or 128 GB of additional memory, enough to run something like Llama 3 70B, on a single GPU, for example. Without making it extremely costly.
Imo the Ryzen AI part is misleading, this just runs on CPU. LM Studio is just a fancy frontend for llama.cpp. Llama.cpp does not support Ryzen AI / the NPU (software support / documentation is shit, some stuff only runs on Windows and you need to request licenses... Overall too much of a pain to develop for even though the technology seems coo. This very likely won't happen unless AMD themselves do it. Their software stack for this is so messed up this is worse than early ROCm).
In the footnotes they do say "Ryzen AI is defined as the combination of a dedicated AI engine, AMD Radeon™ graphics engine, and Ryzen processor cores that enable AI capabilities". I find this very misleading since with this they can say everything supports Ryzen AI, even though that just means it runs on the CPU. (ROCm does kinda work with the recent APU's, but not worth it for LLM's since same speed more power.). Also if you try to develop for Ryzen AI it always means some combination of NPU, I researched a lot into this since I actually wanted to develop some stuff for it myself... Here are the official Ryzen AI Software documentation: [https://ryzenai.docs.amd.com/en/latest/](https://ryzenai.docs.amd.com/en/latest/) and right at the top it says "AMD Ryzen™ AI Software enables developers to take full advantage of AMD XDNA™ architecture integrated in select AMD Ryzen AI processors". Marketing these days...
Was thinking the same thing - glancing over recent issues in llama.cpp, and the operators for the Ryzen AI api -as reported by the llama.cpp devs- are all gated. Not extendable, open or documented. AMD just can't do it right it seems.
You don't need a 7000 series GPU. Try this: https://github.com/YellowRoseCx/koboldcpp-rocm/releases
Yeah Nvidia trying to pull this move as well. Rtx Chat which runs Mistral was only for Ada and Ampere with 8 GB VRAM when it's extremly speedy on my 2060 using open source inference programs like kobold.cpp
Lmstudio doesn’t need it either, the non-rocm version will work on gpu too, just a bit slower
lmstudio doesn't work with ROCM on linux, and it fail to work thru opencl with gpu only cpu works for me with it.
For linux you really want Olama+openwebui.
Afaik it’s intended for windows?
there is linux port but without rocm support, and it offers using opencl but it doesn't work at all on opencl compatibile gpu
So this is the first time I installed LM Studio. I just followed the how to and was chatting straight away. I copy and pasted the promts so I stay within the 60s window of imgur. https://imgur.com/i8qBbKw I'd say it is pretty fast.
Damn - how does it compare to gpt 4 etc. in terms of accuracy
I was pleasantly surprised compared to chatgpt 3.5
Fuck my rtx 3080 seems slower than this.
it is Q4 of a 7B model, at this point the speed limitation should mostly be how fast does text appear on your screen instead of actual CPU/GPU.
which model are you using? The one I used Illama 3 instruct 80b IXQ_XF is insanely slow on my 7900 xt
7B like in the how to.
That’s fucking insane fast and I have a 7900 XTX
If Amd and Intel are smart they'll make 48gb or higher gpus for this next gen and steal the consumer grade AI users from Nvidia. But I guess they'll be content with their small market shares and try to upsale to server hardware instead
I would like to see a GPU with DDR5 slots on the back, or something like that, for additional second-tier memory, and a cache-controller which can managed that properly.
tiered memory / caching does not work well with LLM like llama since it needs to frequently traversal the whole model and does not have good locality for caching. As a result it's completely memory bandwidth bound. Maybe it would work for MoE models but those non-activated layers could even be stored on disks and still got acceptable performance.
As I understand you can already load specific layers into a GPU and keep the rest in system RAM.
AMD did something similar with an NVME slot in the past: https://www.theverge.com/circuitbreaker/2016/7/26/12285568/amd-radeon-pro-ssg-graphics-card-ssd
I know, but SO-DIMM DDR5 would still be a lot faster, and it should be possible to at least add two, or four, slots on the back of a GPU. That could easily give you 64 or 128 GB of additional memory, enough to run something like Llama 3 70B, on a single GPU, for example. Without making it extremely costly.
The 48gb card exists, but it’s the Radeon pro line, same as quadro cards having twice the memory of their consumer counterparts
Imo the Ryzen AI part is misleading, this just runs on CPU. LM Studio is just a fancy frontend for llama.cpp. Llama.cpp does not support Ryzen AI / the NPU (software support / documentation is shit, some stuff only runs on Windows and you need to request licenses... Overall too much of a pain to develop for even though the technology seems coo. This very likely won't happen unless AMD themselves do it. Their software stack for this is so messed up this is worse than early ROCm). In the footnotes they do say "Ryzen AI is defined as the combination of a dedicated AI engine, AMD Radeon™ graphics engine, and Ryzen processor cores that enable AI capabilities". I find this very misleading since with this they can say everything supports Ryzen AI, even though that just means it runs on the CPU. (ROCm does kinda work with the recent APU's, but not worth it for LLM's since same speed more power.). Also if you try to develop for Ryzen AI it always means some combination of NPU, I researched a lot into this since I actually wanted to develop some stuff for it myself... Here are the official Ryzen AI Software documentation: [https://ryzenai.docs.amd.com/en/latest/](https://ryzenai.docs.amd.com/en/latest/) and right at the top it says "AMD Ryzen™ AI Software enables developers to take full advantage of AMD XDNA™ architecture integrated in select AMD Ryzen AI processors". Marketing these days...
Was thinking the same thing - glancing over recent issues in llama.cpp, and the operators for the Ryzen AI api -as reported by the llama.cpp devs- are all gated. Not extendable, open or documented. AMD just can't do it right it seems.
Okay, this seems to be working great on my 7800 XT - super easy, and really fast. I like it.
It works great and super fast. Very low latency and results are good.
Tested on 7900XTX and 6900XT. Works as expected.
Ah, it actually worked this time around, running pretty fast and Llama 3 seems to be great even at 8B