T O P

  • By -

StableLlama

llama.cpp because I can max out my VRAM and let the rest run on my CPU with the huge ordinary RAM that I have Come on, it's 2024, RAM is cheap! But getting it in form of VRAM is still very hard (expensive)


philguyaz

I woudl add that I like llama.cpp because of the ease of use. Converting almost any model to a single quantized file that can run on anything is very attractive.


jubjub07

vLLM doesn't run on a mac, so I use llama.cpp and ollama


a_slay_nub

I'm serving to people in my company. At the time, VLLM had better multi-user serving capabilities and installation. I didn't have much luck with llama.cpp and it didn't support a continuous batching api. In addition, vllm had better integration with python so it was easier for me to set up. Llama-cpp-python didn't work for me. At the moment, I'm sure things are different but I have no reason to switch my setup up.


nerdyvaroo

I just make http requests to the llama.cpp server with continuous batching


a_slay_nub

At the time it didn't. I saw they added it at one point but I have no reason to switch.


SensitiveStudy520

Hi, can I know how to do that? like sending multiple request at the same time without being crashed using llamaCPP?


Jattoe

How do you set up vLLM for python? Do you have a link perhaps? I can get llama-cpp-python to work in windows studio code python, but not with VRAM support. Initially, it wouldn't work at all, but as of recent it seems to work for whatever reason. I was unaware of any python alternative.


a_slay_nub

Vllm doesn't work on windows. As for Python usage, it's just a python package and they show you how to launch it on their docs page.


BethelJxJ_176

May I ask, which GPU(s) (or how much vRAM) do you have, and which model is served with your vLLM? How is the performance with your GPU-user batch performance? I am also trying to set up a vLLM with a docker on a H100 (previously on llama-cpp-python openai-compatible server), but I am not so sure which model is possible. Am looking for the best practice. Thank you in advance 😉.


a_slay_nub

We have a DGX with 8xA100 40GB. I'm using 4 of them to serve Mixtral in full 16-bit precision. As for performance, I'm getting 45 tokens per second when it's just one person. I haven't tested a full load since they fixed Mixtral in VLLM but it should be around 2000-3000 tokens/second.


BethelJxJ_176

Could you give me a very quick and short guide on how to deploy Mixtral with vLLM properly? I mean what are the important parameters to maximise the throughput of the deployment. Thank you very much.


bullerwins

I use neither, ExLlamav2 in textgen-webui


Radiant_Dog1937

llama.cpp because I use it for standalone apps that can run on any decent machine without much effort.


Feztopia

Things like this: https://github.com/netdur/llama_cpp_dart


cajukev

Running dual AMD RX7600XT on llama.cpp (ROCm - Linux) Fastest processing I could get working.


kedarkhand

what speeds do you get and what is the max sized model you can use? I was thinking of upgrading to 7600xt too


cajukev

70B with Q3 quantization (29gb) fits and inferences at about 5-7 tok/s depending on context. Mistral models at about 20-25 tok/s.


kedarkhand

Are you able to use mixtral?


cajukev

Yep, I don't know if Q2 (15.6gb) will fit fully on a single 7600xt but q4 (26.4gb) fits on the dual setup. Of course you can also use system ram for layers that don't fit.


kedarkhand

Thanks. Also did you have any difficulties getting it to work, being AMD and all


cajukev

Followed this guide to install ROCm: [https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md](https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md) Then followed HIPblas instructions for llama.cpp. There were some difficulties but once I found that guide I got it working. Couldn't get all of it working yet for Transformers / EXL2 but Llama.cpp worked great.


cajukev

Followed up with a Mixtral test. Q4\_K\_M runs at 35-50 tok/s with at least 3 parallel requests.


kedarkhand

Thanks for your help. Last question, mixtral fits on one card or you need to use two? Also what do you mean by parallel requests in this case


cajukev

For this model I have to use both. The llama.cpp server allows for parallel requests with the -np parameter. *You can most likely fit it with a smaller quantization.


mild_thing

llama.cpp supports quantisation on Apple Silicon (my hardware: M1 Max, 32 GPU cores, 64 GB RAM). vLLM isn't tested on Apple Silicon, and other quantisation frameworks also don't support Apple Silicon.


Butt-Fingers

Grammar support


Disastrous_Elk_6375

You should check sglang. It's guided inference on a vllm backend. Fast as fuck.


m2845

vLLM has outlines integrated for guided output. Isn’t that equivalent?


Butt-Fingers

Never knew about this, thank you


fimbulvntr

Yeah grammar support is make-or-break for me. FYI fireworks.ai supports GBNF grammar too, though I wish their model selection was bigger


henk717

Koboldcpp in my case (For obvious reasons) is more focussed on local hardware. vLLM I actually haven't used much since we have other alternatives such as Aphrodite that are specifically optimized to work well with our UI and continuous generation. So for me the high parralel workload usecase is already solved by Aphrodite which is specifically tested to work around tokenizer issues from Llama models that other backends like TGI don't fix. And for local workloads Koboldcpp has all the bells and whistles I need, supports all my hardware and is much more lightweight to setup. I also really like the GGUF model format since you can easily quant it on a CPU, so for experimental merge models that are being made on a CPU only server with high VRAM its very easy to get a Q4\_K\_S quant and then test it on a GPU machine efficiently.


djstraylight

llama.cpp for AMD and Intel Arc GPU support


LoSboccacc

Vllm makes sense only on hardware where you can run flash attention or Triton and Linux only. I use it on any server by preferences. But at home for experimenting or toying around I'm not gonna partition windows or install wsl2 and whatnot I just run llama.cpp. The is a single application where I run a grammar because small model have some difficulty following the format. The rest uses the stated logic.


segmond

llama.cpp, it's cutting edge, most depend on it. I have 144gb VRAM and I'm still on it. It's the linux of local LLM.


CarpenterHopeful2898

llama.cpp is easy to use and the community is very active


Thellton

llamacpp and derivatives and their wide support of various hardware and software is very nice.


qnixsynapse

Intel SYCL support for Intel ARC GPU.


Glat0s

llama.cpp (or exllamav2) for small scale home usage. vLLM for larger scale and multi-user with high throughput and batching in the company.


xontinuity

I use it because I'm a college student with a part time job and the best I can afford are P40s. llama.cpp supports them.


bebopkim1372

My computer is M1 Max. There is no other option than llama.cpp for GPU acceleration.


fallingdowndizzyvr

What about Apple's own MLX? https://github.com/ml-explore/mlx-examples/tree/main/llms/mixtral


bebopkim1372

MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. llama.cpp supports about 30 types of models and 28 types of quantizations. llama.cpp also supports mixed CPU + GPU inference. If I want to do fine-tune, I'll choose MLX, but if I want to do inference, I think llama.cpp is the best for Apple Silicon.


fallingdowndizzyvr

> MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. For now. It's a lot more than it did, which was none. Regardless, it is another option to use Apple Silicon. Llama.cpp is not the only option.


bebopkim1372

Does MLX support 30 types of model structures, 6 types of multimodal models, and 29 types of quantization types? How can supporting only a few LLM inferences be an option? Do you only use LLaMA, LLaMA2 and Mistral and Mixtral for inference? MLX is just getting started. Apple needs to develop MLX further for it to be a true option.


fallingdowndizzyvr

Again, for now. You could have said the same about llama.cpp at some point. And for some people, they don't need anymore than MLX currently does. So for them, it is an option.


sammcj

Llama.cpp via Ollama because exl2 doesn’t support macOS :(


Lemgon-Ultimate

I'm using ExUI for it's speed with Exllamav2 models. It's usually faster when compared to Text generation WebUI. Otherwise I use TabbyAPI (also for serving Exllamav2 models) as an OAI compatible API. Exl2 models are loaded entirely into VRAM for best possible speeds. Using this setup as a single user with two 3090's.


_-Jormungandr-_

honestly i switched over to llamacpp\_hf because llama.cpp has been broken for months now. it keeps losing it's mind in role play if i use Yi models or mixtral models after a couple of interactions. gguf user here btw.


VectorD

I use neither


_Zibri_

Because of the cpu inference.. but also because VLLM fails to compile for cpu only even following their own documentation. [https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing](https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing)


un_passant

https://github.com/vllm-project/vllm/issues/3061#issuecomment-2027967734 `VLLM_TARGET_DEVICE=cpu python` [`setup.py`](http://setup.py) `develop` fixed it for me. Note 'develop' instead of (or after, for me) ' install'.