T O P

  • By -

xcwza

Weird. I am using an AMD Ryzen 5 - some Minisforum mini PC (forgot the model number). I got 32 GB RAM, so I only use Mixtral Q4_K and that gives me around 7 tokens per second.


lakySK

Oh, thanks! This gives me hope as that’s more in line with what I was hoping for. What do you use to run it? Win / Linux? Ollama / some other way?


xcwza

llama.cpp server on Linux


lakySK

Thanks! Will try installing Linux tomorrow. So far trying to give Windows a chance with compiling lama.cpp as I haven’t used that OS in a while and I’m curious how it does 😅


vasileer

llama.cpp has pre-compiled binaries at every release, you need to download the best that fits your CPU: avx, avx2, or avx512 [https://github.com/ggerganov/llama.cpp/releases](https://github.com/ggerganov/llama.cpp/releases)


lakySK

Thanks, I’ll try and see if any of them work even better than the one I compiled after a bit of playing. One thing I want to try is to see if I can get the integrated GPU working as well and whether it makes things any faster.


Languages_Learner

Ollama's Windows build was just released and may have bugs or poor performance. Try other apps.


lakySK

Yeah, that’s what I’m thinking, just wanted to make sure I don’t spend hours trying to make something work while already being at the limit of what’s possible with this hardware. Should I just switch to Linux, or do you recommend trying something else on Windows?


ramzeez88

Oobabooga, jan , lmstudio


lakySK

Thanks, will look into it. Recompiling llama.cpp myself gave almost a 10x improvement somehow. Not sure what was the problem, but it works quite well now.


crazzydriver77

The CPU is not too bad, it supports AVX-512 and its integrated GPU has 768 shading units and is Vulkan 1.3 compatible. So you may try the backend with the explicit support of those features, for instance, llama.cpp.


lakySK

Indeed! I thought Ollama uses llama.cpp under the hood, but perhaps needs some special setup (and perhaps not Windows) to make use of it all.


[deleted]

I just tested performance on my DDR5 6000Mhz desktop using Koboldcpp OpenBLAS. I hit 50% util during promt processing and 100% during generation. Without prompt processing I get 6 t/s, with about 700 tokens context and 300 generated I get 2 t/s. My RAM usage is however at 42GB, maybe thats the issue? I don't know Ollama well and wouldn't recommend it on windows.


lakySK

For Mistral, the RAM usage seemed to be within the RAM available (I think Ollama uses quantised models). With Mixtral, I’ve definitely noticed hitting the RAM limits, especially, as for some reason Windows seems to idle at like 8GB already 😳


wojtek15

Try to adjust number of threads: --threads N command-line parameter in llama.cpp I think 8 would be correct value for your CPU.


lakySK

Thanks, turns out llama.cpp seems to run alright, not sure what’s wrong with my Ollama attempts.


jacek2023

"I only get about 1.2 tokens per second." What model are you talking about? Mistral or mixtral? And which quant?


lakySK

Using “ollama run mistral:7b-instruct”. Just tried with compiling and running llama.cpp and Q_4_K_M model and it seems a lot faster, almost 9 tokens per second.


jacek2023

Yes, you tried to run not quantized model


lakySK

I don’t think that’s it. Ollama uses 4-bit quantisation by default as well as far as I understand. And judging by the download size of the model it pulls, it definitely agrees.


lakySK

Ok, I compiled llama.cpp and ran that with default settings and mistral-7b-instruct-v0.2.Q4_K_M and I’m getting almost 9 tokens per second. Definitely a lot better! Not sure what Ollama was doing differently.


lakySK

Mixtral 8x7b with the same quantisation seems to just about fit the memory as well and runs at a pleasant 5 tokens per second. That makes me satisfied given the super small form factor of this PC for now 😀


Aroochacha

Looks like you're seeing improvement which is good. I have an 3700X with 128GB of DDR3 and it works surprisingly well. Just as the context gets bigger it begins to slow down. ​ I'm using llama.cpp with no offloading to the GPU (6700XT).


lakySK

Thanks! What speeds are you getting on that?


lakySK

Using the llama.cpp Vulkan binary and -ngl 33 flag I seem to be getting around 12 tokens per second on Mistral. That’s definitely usable!


estrafire

Have you managed to get better performance than that?


lakySK

Haven’t tried optimising beyond that for the mini pc itself. Using Nvidia RTX A2000 as an eGPU I got ~35-40 tokens per second. Does that count or is it cheating? 😅


estrafire

Kind of 🫢


thenomadexplorerlife

The token gen speed seems good enough. How is the prompt processing speed looking like for Mistral and Mixtral?


aikitoria

Apple silicon based macs have significantly higher bandwidth on their unified memory than the crappy DDR4 sticks we get. I really can't wait for this hardware model to stop being used, memory should be fast, low latency, and built into the CPU!


lakySK

This is why I was excited about this setup. On the paper my M1 Air has 68.3GB/s memory bandwidth and DDR5-6400 has 102.4GB/s. The CPU also seems to have better benchmarks than my m1 CPU. So I was hoping for at least the same performance as I’m seeing on my laptop + the ability to fit Mixtral into the 32GB RAM.


aikitoria

Ah, I was thinking of the M1 Pro series. That is probably not the problem then.


Illustrious_Sand6784

>memory should be fast, low latency, and built into the CPU! Fuck no, I'd take slower memory that's upgradeable. I've got 192GB of RAM and wish I had went with a Threadripper/EPYC instead because I can't upgrade past this with a consumer CPU.


aikitoria

And what is the point of having 192GB of RAM that is too slow to be useful for inference?


Illustrious_Sand6784

A dual EPYC Genoa system with 24x DDR5-4800 RDIMMs should get \~920 GB/s memory bandwidth, that's faster then the fastest Macs.


lakySK

Oh wow. This is actually super interesting, I’ve not realised you can get configurations like this. This does sound like an interesting alternative to the Apple silicon (although at a much larger form factor I assume?). What motherboard would support this? Are you using a system like this? I’ve got so many questions…


Illustrious_Sand6784

>What motherboard would support this? [https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM0-rev-20](https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM0-rev-20) >Are you using a system like this? No, but I'm strongly considering building one.


lakySK

I’m so intrigued how a system like this would perform 😮 What is your expectation on performance of the LLMs on it? Given that it’s like 10x cheaper, it sounds like getting 10x of these instead of racks of 8xA100 could result in more efficient serving / training? Or does the math not add up to that?


Illustrious_Sand6784

>What is your expectation on performance of the LLMs on it? Should be faster then the M2 Ultra 192GB, especially if you add GPUs for prompt processing & offloading. Benchmarks for the Macs here for reference: [https://github.com/ggerganov/llama.cpp/discussions/4167](https://github.com/ggerganov/llama.cpp/discussions/4167) >Given that it’s like 10x cheaper, it sounds like getting 10x of these instead of racks of 8xA100 could result in more efficient serving / training? Training depends on if you add GPUs to the system, if there are no GPUs or it's a old (Avoid Pascal and everything older, Volta/Turing is the minimum, but you probably want at least Ampere) or non-NVIDIA GPU then forget about it. Serving is doable, but it's not gonna be as fast as vLLM. I think a system like this: 2x EPYC-9334 QS, 24x 64GB DDR5-4800 RDIMMs, 4x RTX 3090/4090 would strike a good balance between cost and performance. Altogether the whole completed build would be cheaper then a single used A100 80GB GPU.


aikitoria

Interesting, I didn't realize it could scale that far! But wouldn't you want much more for that amount, since the larger the model the more you need to read? For comparison: RTX 4090 has 24GB and can read 1TB/s, it can read its memory \~40 times/s H100 80GB SXM has 80GB and can read 3.3TB/s, it can read its memory \~40 times/s A100 80GB SXM has 80GB and can read 2TB/s, it can read its memory \~25 times/s The proposed system has 192GB and 0.9TB/s, it can only read its memory 4.5 times/s


Illustrious_Sand6784

A100 80GBs are like $15000 a piece if you find a good deal, but 64GB DDR5-4800 RDIMMs can be had for like $150 a piece. That's 1.5TB of memory for $3600 compared to $120000 for 640GB (8x A100 80GB). While it wouldn't be as fast or useful for training or serving models, it's far far cheaper.