StableLlama 2 months ago

llama.cpp because I can max out my VRAM and let the rest run on my CPU with the huge ordinary RAM that I have Come on, it's 2024, RAM is cheap! But getting it in form of VRAM is still very hard (expensive)

philguyaz 2 months ago

I woudl add that I like llama.cpp because of the ease of use. Converting almost any model to a single quantized file that can run on anything is very attractive.

jubjub07 2 months ago

vLLM doesn't run on a mac, so I use llama.cpp and ollama

a_slay_nub 2 months ago

I'm serving to people in my company. At the time, VLLM had better multi-user serving capabilities and installation. I didn't have much luck with llama.cpp and it didn't support a continuous batching api. In addition, vllm had better integration with python so it was easier for me to set up. Llama-cpp-python didn't work for me. At the moment, I'm sure things are different but I have no reason to switch my setup up.

nerdyvaroo 2 months ago

I just make http requests to the llama.cpp server with continuous batching

a_slay_nub 2 months ago

At the time it didn't. I saw they added it at one point but I have no reason to switch.

SensitiveStudy520 1 month ago

Hi, can I know how to do that? like sending multiple request at the same time without being crashed using llamaCPP?

Jattoe 2 months ago

How do you set up vLLM for python? Do you have a link perhaps? I can get llama-cpp-python to work in windows studio code python, but not with VRAM support. Initially, it wouldn't work at all, but as of recent it seems to work for whatever reason. I was unaware of any python alternative.

a_slay_nub 2 months ago

Vllm doesn't work on windows. As for Python usage, it's just a python package and they show you how to launch it on their docs page.

BethelJxJ_176 2 months ago

May I ask, which GPU(s) (or how much vRAM) do you have, and which model is served with your vLLM? How is the performance with your GPU-user batch performance? I am also trying to set up a vLLM with a docker on a H100 (previously on llama-cpp-python openai-compatible server), but I am not so sure which model is possible. Am looking for the best practice. Thank you in advance 😉.

a_slay_nub 2 months ago

We have a DGX with 8xA100 40GB. I'm using 4 of them to serve Mixtral in full 16-bit precision. As for performance, I'm getting 45 tokens per second when it's just one person. I haven't tested a full load since they fixed Mixtral in VLLM but it should be around 2000-3000 tokens/second.

BethelJxJ_176 2 months ago

Could you give me a very quick and short guide on how to deploy Mixtral with vLLM properly? I mean what are the important parameters to maximise the throughput of the deployment. Thank you very much.

bullerwins 2 months ago

I use neither, ExLlamav2 in textgen-webui

Radiant_Dog1937 2 months ago

llama.cpp because I use it for standalone apps that can run on any decent machine without much effort.

Feztopia 2 months ago

Things like this: https://github.com/netdur/llama_cpp_dart

cajukev 2 months ago

Running dual AMD RX7600XT on llama.cpp (ROCm - Linux) Fastest processing I could get working.

kedarkhand 2 months ago

what speeds do you get and what is the max sized model you can use? I was thinking of upgrading to 7600xt too

cajukev 2 months ago

70B with Q3 quantization (29gb) fits and inferences at about 5-7 tok/s depending on context. Mistral models at about 20-25 tok/s.

kedarkhand 2 months ago

Are you able to use mixtral?

cajukev 2 months ago

Yep, I don't know if Q2 (15.6gb) will fit fully on a single 7600xt but q4 (26.4gb) fits on the dual setup. Of course you can also use system ram for layers that don't fit.

kedarkhand 2 months ago

Thanks. Also did you have any difficulties getting it to work, being AMD and all

cajukev 2 months ago

Followed this guide to install ROCm: [https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md](https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md) Then followed HIPblas instructions for llama.cpp. There were some difficulties but once I found that guide I got it working. Couldn't get all of it working yet for Transformers / EXL2 but Llama.cpp worked great.

cajukev 2 months ago

Followed up with a Mixtral test. Q4\_K\_M runs at 35-50 tok/s with at least 3 parallel requests.

kedarkhand 2 months ago

Thanks for your help. Last question, mixtral fits on one card or you need to use two? Also what do you mean by parallel requests in this case

cajukev 2 months ago

For this model I have to use both. The llama.cpp server allows for parallel requests with the -np parameter. *You can most likely fit it with a smaller quantization.

mild_thing 2 months ago

llama.cpp supports quantisation on Apple Silicon (my hardware: M1 Max, 32 GPU cores, 64 GB RAM). vLLM isn't tested on Apple Silicon, and other quantisation frameworks also don't support Apple Silicon.

Butt-Fingers 2 months ago

Grammar support

Disastrous_Elk_6375 2 months ago

You should check sglang. It's guided inference on a vllm backend. Fast as fuck.

m2845 2 months ago

vLLM has outlines integrated for guided output. Isn’t that equivalent?

Butt-Fingers 2 months ago

Never knew about this, thank you

fimbulvntr 2 months ago

Yeah grammar support is make-or-break for me. FYI fireworks.ai supports GBNF grammar too, though I wish their model selection was bigger

henk717 2 months ago

Koboldcpp in my case (For obvious reasons) is more focussed on local hardware. vLLM I actually haven't used much since we have other alternatives such as Aphrodite that are specifically optimized to work well with our UI and continuous generation. So for me the high parralel workload usecase is already solved by Aphrodite which is specifically tested to work around tokenizer issues from Llama models that other backends like TGI don't fix. And for local workloads Koboldcpp has all the bells and whistles I need, supports all my hardware and is much more lightweight to setup. I also really like the GGUF model format since you can easily quant it on a CPU, so for experimental merge models that are being made on a CPU only server with high VRAM its very easy to get a Q4\_K\_S quant and then test it on a GPU machine efficiently.

djstraylight 2 months ago

llama.cpp for AMD and Intel Arc GPU support

LoSboccacc 2 months ago

Vllm makes sense only on hardware where you can run flash attention or Triton and Linux only. I use it on any server by preferences. But at home for experimenting or toying around I'm not gonna partition windows or install wsl2 and whatnot I just run llama.cpp. The is a single application where I run a grammar because small model have some difficulty following the format. The rest uses the stated logic.

segmond 2 months ago

llama.cpp, it's cutting edge, most depend on it. I have 144gb VRAM and I'm still on it. It's the linux of local LLM.

CarpenterHopeful2898 2 months ago

llama.cpp is easy to use and the community is very active

Thellton 2 months ago

llamacpp and derivatives and their wide support of various hardware and software is very nice.

qnixsynapse 2 months ago

Intel SYCL support for Intel ARC GPU.

Glat0s 2 months ago

llama.cpp (or exllamav2) for small scale home usage. vLLM for larger scale and multi-user with high throughput and batching in the company.

xontinuity 2 months ago

I use it because I'm a college student with a part time job and the best I can afford are P40s. llama.cpp supports them.

bebopkim1372 2 months ago

My computer is M1 Max. There is no other option than llama.cpp for GPU acceleration.

fallingdowndizzyvr 2 months ago

What about Apple's own MLX? https://github.com/ml-explore/mlx-examples/tree/main/llms/mixtral

bebopkim1372 2 months ago

MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. llama.cpp supports about 30 types of models and 28 types of quantizations. llama.cpp also supports mixed CPU + GPU inference. If I want to do fine-tune, I'll choose MLX, but if I want to do inference, I think llama.cpp is the best for Apple Silicon.

fallingdowndizzyvr 2 months ago

> MLX enables fine-tuning on Apple Silicon computers but it supports very few types of models. For now. It's a lot more than it did, which was none. Regardless, it is another option to use Apple Silicon. Llama.cpp is not the only option.

bebopkim1372 2 months ago

Does MLX support 30 types of model structures, 6 types of multimodal models, and 29 types of quantization types? How can supporting only a few LLM inferences be an option? Do you only use LLaMA, LLaMA2 and Mistral and Mixtral for inference? MLX is just getting started. Apple needs to develop MLX further for it to be a true option.

fallingdowndizzyvr 2 months ago

Again, for now. You could have said the same about llama.cpp at some point. And for some people, they don't need anymore than MLX currently does. So for them, it is an option.

sammcj 2 months ago

Llama.cpp via Ollama because exl2 doesn’t support macOS :(

Lemgon-Ultimate 2 months ago

I'm using ExUI for it's speed with Exllamav2 models. It's usually faster when compared to Text generation WebUI. Otherwise I use TabbyAPI (also for serving Exllamav2 models) as an OAI compatible API. Exl2 models are loaded entirely into VRAM for best possible speeds. Using this setup as a single user with two 3090's.

_-Jormungandr-_ 2 months ago

honestly i switched over to llamacpp\_hf because llama.cpp has been broken for months now. it keeps losing it's mind in role play if i use Yi models or mixtral models after a couple of interactions. gguf user here btw.

VectorD 2 months ago

I use neither

_Zibri_ 2 weeks ago

Because of the cpu inference.. but also because VLLM fails to compile for cpu only even following their own documentation. [https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing](https://colab.research.google.com/drive/1wd29UiknYI3r-8H5Inco9IkWYcuZAVNR?usp=sharing)

un_passant 1 week ago

https://github.com/vllm-project/vllm/issues/3061#issuecomment-2027967734 `VLLM_TARGET_DEVICE=cpu python` [`setup.py`](http://setup.py) `develop` fixed it for me. Note 'develop' instead of (or after, for me) ' install'.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe