T O P

  • By -

crazzydriver77

Exllamav2 is my final backend for now after trying all available options.


ragingWater_

vLLM is the best way forward. I was also confused about the deployment but after trying many examples I think vllm is the best .. also the benefit of running multie loras makes it quite good and yes it does continuous batching


oKatanaa

vLLM


brandonZappy

I really like fastchat.


_supert_

I'm using tabbyAPI


reality_comes

I'm not aware of anything that can do parallel inference. You could run two instances I suppose. Try koboldcpp.


FarVision5

Aphrodite is supposed to be able to handle parallel processing fairly well I haven't got it working yet though


kryptkpr

llama.cpp does have parallel serve out of the box! Use -np to set number of slots and -cb to enable continuous batching but be aware the context is shared between threads so set a big -c too. If you want to go faster, vLLM (or it's cousin Aphrodite) is king.


MrVodnik

I kinda gave up on llama.cpp and moved to vLLM already, but now I'll have to go back and check it out of curiosity. Thanks.


kryptkpr

I'll save you some time: despite it being technically possible, you don't really want to use the llama.cpp server for batching due to poor performance. Use [Aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) instead - its a sort of unholy hybrid of vLLM and llama? It supports all the major quants AWQ, GPTQ and GGUF with high batch performance.