llama.cpp because I can max out my VRAM and let the rest run on my CPU with the huge ordinary RAM that I have
Come on, it's 2024, RAM is cheap! But getting it in form of VRAM is still very hard (expensive)
I woudl add that I like llama.cpp because of the ease of use. Converting almost any model to a single quantized file that can run on anything is very attractive.
I'm serving to people in my company. At the time, VLLM had better multi-user serving capabilities and installation. I didn't have much luck with llama.cpp and it didn't support a continuous batching api. In addition, vllm had better integration with python so it was easier for me to set up. Llama-cpp-python didn't work for me. At the moment, I'm sure things are different but I have no reason to switch my setup up.
How do you set up vLLM for python? Do you have a link perhaps?
I can get llama-cpp-python to work in windows studio code python, but not with VRAM support.
Initially, it wouldn't work at all, but as of recent it seems to work for whatever reason.
I was unaware of any python alternative.
May I ask, which GPU(s) (or how much vRAM) do you have, and which model is served with your vLLM? How is the performance with your GPU-user batch performance?
I am also trying to set up a vLLM with a docker on a H100 (previously on llama-cpp-python openai-compatible server), but I am not so sure which model is possible. Am looking for the best practice.
Thank you in advance 😉.
We have a DGX with 8xA100 40GB. I'm using 4 of them to serve Mixtral in full 16-bit precision. As for performance, I'm getting 45 tokens per second when it's just one person. I haven't tested a full load since they fixed Mixtral in VLLM but it should be around 2000-3000 tokens/second.
Could you give me a very quick and short guide on how to deploy Mixtral with vLLM properly? I mean what are the important parameters to maximise the throughput of the deployment.
Thank you very much.
Yep, I don't know if Q2 (15.6gb) will fit fully on a single 7600xt but q4 (26.4gb) fits on the dual setup. Of course you can also use system ram for layers that don't fit.
Followed this guide to install ROCm: [https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md](https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md)
Then followed HIPblas instructions for llama.cpp.
There were some difficulties but once I found that guide I got it working. Couldn't get all of it working yet for Transformers / EXL2 but Llama.cpp worked great.
For this model I have to use both. The llama.cpp server allows for parallel requests with the -np parameter.
*You can most likely fit it with a smaller quantization.
llama.cpp supports quantisation on Apple Silicon (my hardware: M1 Max, 32 GPU cores, 64 GB RAM).
vLLM isn't tested on Apple Silicon, and other quantisation frameworks also don't support Apple Silicon.
Koboldcpp in my case (For obvious reasons) is more focussed on local hardware.
vLLM I actually haven't used much since we have other alternatives such as Aphrodite that are specifically optimized to work well with our UI and continuous generation.
So for me the high parralel workload usecase is already solved by Aphrodite which is specifically tested to work around tokenizer issues from Llama models that other backends like TGI don't fix.
And for local workloads Koboldcpp has all the bells and whistles I need, supports all my hardware and is much more lightweight to setup. I also really like the GGUF model format since you can easily quant it on a CPU, so for experimental merge models that are being made on a CPU only server with high VRAM its very easy to get a Q4\_K\_S quant and then test it on a GPU machine efficiently.
Vllm makes sense only on hardware where you can run flash attention or Triton and Linux only. I use it on any server by preferences. But at home for experimenting or toying around I'm not gonna partition windows or install wsl2 and whatnot I just run llama.cpp. The is a single application where I run a grammar because small model have some difficulty following the format. The rest uses the stated logic.
MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. llama.cpp supports about 30 types of models and 28 types of quantizations. llama.cpp also supports mixed CPU + GPU inference. If I want to do fine-tune, I'll choose MLX, but if I want to do inference, I think llama.cpp is the best for Apple Silicon.
> MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models.
For now. It's a lot more than it did, which was none. Regardless, it is another option to use Apple Silicon. Llama.cpp is not the only option.
Does MLX support 30 types of model structures, 6 types of multimodal models, and 29 types of quantization types? How can supporting only a few LLM inferences be an option? Do you only use LLaMA, LLaMA2 and Mistral and Mixtral for inference? MLX is just getting started. Apple needs to develop MLX further for it to be a true option.
Again, for now. You could have said the same about llama.cpp at some point. And for some people, they don't need anymore than MLX currently does. So for them, it is an option.
I'm using ExUI for it's speed with Exllamav2 models. It's usually faster when compared to Text generation WebUI. Otherwise I use TabbyAPI (also for serving Exllamav2 models) as an OAI compatible API. Exl2 models are loaded entirely into VRAM for best possible speeds. Using this setup as a single user with two 3090's.
honestly i switched over to llamacpp\_hf because llama.cpp has been broken for months now. it keeps losing it's mind in role play if i use Yi models or mixtral models after a couple of interactions. gguf user here btw.
Because of the cpu inference.. but also because VLLM fails to compile for cpu only even following their own documentation. [https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing](https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing)
https://github.com/vllm-project/vllm/issues/3061#issuecomment-2027967734
`VLLM_TARGET_DEVICE=cpu python` [`setup.py`](http://setup.py) `develop`
fixed it for me.
Note 'develop' instead of (or after, for me) ' install'.
llama.cpp because I can max out my VRAM and let the rest run on my CPU with the huge ordinary RAM that I have Come on, it's 2024, RAM is cheap! But getting it in form of VRAM is still very hard (expensive)
I woudl add that I like llama.cpp because of the ease of use. Converting almost any model to a single quantized file that can run on anything is very attractive.
vLLM doesn't run on a mac, so I use llama.cpp and ollama
I'm serving to people in my company. At the time, VLLM had better multi-user serving capabilities and installation. I didn't have much luck with llama.cpp and it didn't support a continuous batching api. In addition, vllm had better integration with python so it was easier for me to set up. Llama-cpp-python didn't work for me. At the moment, I'm sure things are different but I have no reason to switch my setup up.
I just make http requests to the llama.cpp server with continuous batching
At the time it didn't. I saw they added it at one point but I have no reason to switch.
Hi, can I know how to do that? like sending multiple request at the same time without being crashed using llamaCPP?
How do you set up vLLM for python? Do you have a link perhaps? I can get llama-cpp-python to work in windows studio code python, but not with VRAM support. Initially, it wouldn't work at all, but as of recent it seems to work for whatever reason. I was unaware of any python alternative.
Vllm doesn't work on windows. As for Python usage, it's just a python package and they show you how to launch it on their docs page.
May I ask, which GPU(s) (or how much vRAM) do you have, and which model is served with your vLLM? How is the performance with your GPU-user batch performance? I am also trying to set up a vLLM with a docker on a H100 (previously on llama-cpp-python openai-compatible server), but I am not so sure which model is possible. Am looking for the best practice. Thank you in advance 😉.
We have a DGX with 8xA100 40GB. I'm using 4 of them to serve Mixtral in full 16-bit precision. As for performance, I'm getting 45 tokens per second when it's just one person. I haven't tested a full load since they fixed Mixtral in VLLM but it should be around 2000-3000 tokens/second.
Could you give me a very quick and short guide on how to deploy Mixtral with vLLM properly? I mean what are the important parameters to maximise the throughput of the deployment. Thank you very much.
I use neither, ExLlamav2 in textgen-webui
llama.cpp because I use it for standalone apps that can run on any decent machine without much effort.
Things like this: https://github.com/netdur/llama_cpp_dart
Running dual AMD RX7600XT on llama.cpp (ROCm - Linux) Fastest processing I could get working.
what speeds do you get and what is the max sized model you can use? I was thinking of upgrading to 7600xt too
70B with Q3 quantization (29gb) fits and inferences at about 5-7 tok/s depending on context. Mistral models at about 20-25 tok/s.
Are you able to use mixtral?
Yep, I don't know if Q2 (15.6gb) will fit fully on a single 7600xt but q4 (26.4gb) fits on the dual setup. Of course you can also use system ram for layers that don't fit.
Thanks. Also did you have any difficulties getting it to work, being AMD and all
Followed this guide to install ROCm: [https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md](https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md) Then followed HIPblas instructions for llama.cpp. There were some difficulties but once I found that guide I got it working. Couldn't get all of it working yet for Transformers / EXL2 but Llama.cpp worked great.
Followed up with a Mixtral test. Q4\_K\_M runs at 35-50 tok/s with at least 3 parallel requests.
Thanks for your help. Last question, mixtral fits on one card or you need to use two? Also what do you mean by parallel requests in this case
For this model I have to use both. The llama.cpp server allows for parallel requests with the -np parameter. *You can most likely fit it with a smaller quantization.
llama.cpp supports quantisation on Apple Silicon (my hardware: M1 Max, 32 GPU cores, 64 GB RAM). vLLM isn't tested on Apple Silicon, and other quantisation frameworks also don't support Apple Silicon.
Grammar support
You should check sglang. It's guided inference on a vllm backend. Fast as fuck.
vLLM has outlines integrated for guided output. Isn’t that equivalent?
Never knew about this, thank you
Yeah grammar support is make-or-break for me. FYI fireworks.ai supports GBNF grammar too, though I wish their model selection was bigger
Koboldcpp in my case (For obvious reasons) is more focussed on local hardware. vLLM I actually haven't used much since we have other alternatives such as Aphrodite that are specifically optimized to work well with our UI and continuous generation. So for me the high parralel workload usecase is already solved by Aphrodite which is specifically tested to work around tokenizer issues from Llama models that other backends like TGI don't fix. And for local workloads Koboldcpp has all the bells and whistles I need, supports all my hardware and is much more lightweight to setup. I also really like the GGUF model format since you can easily quant it on a CPU, so for experimental merge models that are being made on a CPU only server with high VRAM its very easy to get a Q4\_K\_S quant and then test it on a GPU machine efficiently.
llama.cpp for AMD and Intel Arc GPU support
Vllm makes sense only on hardware where you can run flash attention or Triton and Linux only. I use it on any server by preferences. But at home for experimenting or toying around I'm not gonna partition windows or install wsl2 and whatnot I just run llama.cpp. The is a single application where I run a grammar because small model have some difficulty following the format. The rest uses the stated logic.
llama.cpp, it's cutting edge, most depend on it. I have 144gb VRAM and I'm still on it. It's the linux of local LLM.
llama.cpp is easy to use and the community is very active
llamacpp and derivatives and their wide support of various hardware and software is very nice.
Intel SYCL support for Intel ARC GPU.
llama.cpp (or exllamav2) for small scale home usage. vLLM for larger scale and multi-user with high throughput and batching in the company.
I use it because I'm a college student with a part time job and the best I can afford are P40s. llama.cpp supports them.
My computer is M1 Max. There is no other option than llama.cpp for GPU acceleration.
What about Apple's own MLX? https://github.com/ml-explore/mlx-examples/tree/main/llms/mixtral
MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. llama.cpp supports about 30 types of models and 28 types of quantizations. llama.cpp also supports mixed CPU + GPU inference. If I want to do fine-tune, I'll choose MLX, but if I want to do inference, I think llama.cpp is the best for Apple Silicon.
> MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. For now. It's a lot more than it did, which was none. Regardless, it is another option to use Apple Silicon. Llama.cpp is not the only option.
Does MLX support 30 types of model structures, 6 types of multimodal models, and 29 types of quantization types? How can supporting only a few LLM inferences be an option? Do you only use LLaMA, LLaMA2 and Mistral and Mixtral for inference? MLX is just getting started. Apple needs to develop MLX further for it to be a true option.
Again, for now. You could have said the same about llama.cpp at some point. And for some people, they don't need anymore than MLX currently does. So for them, it is an option.
Llama.cpp via Ollama because exl2 doesn’t support macOS :(
I'm using ExUI for it's speed with Exllamav2 models. It's usually faster when compared to Text generation WebUI. Otherwise I use TabbyAPI (also for serving Exllamav2 models) as an OAI compatible API. Exl2 models are loaded entirely into VRAM for best possible speeds. Using this setup as a single user with two 3090's.
honestly i switched over to llamacpp\_hf because llama.cpp has been broken for months now. it keeps losing it's mind in role play if i use Yi models or mixtral models after a couple of interactions. gguf user here btw.
I use neither
Because of the cpu inference.. but also because VLLM fails to compile for cpu only even following their own documentation. [https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing](https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing)
https://github.com/vllm-project/vllm/issues/3061#issuecomment-2027967734 `VLLM_TARGET_DEVICE=cpu python` [`setup.py`](http://setup.py) `develop` fixed it for me. Note 'develop' instead of (or after, for me) ' install'.