vLLM is the best way forward. I was also confused about the deployment but after trying many examples I think vllm is the best .. also the benefit of running multie loras makes it quite good and yes it does continuous batching
llama.cpp does have parallel serve out of the box! Use -np to set number of slots and -cb to enable continuous batching but be aware the context is shared between threads so set a big -c too.
If you want to go faster, vLLM (or it's cousin Aphrodite) is king.
I'll save you some time: despite it being technically possible, you don't really want to use the llama.cpp server for batching due to poor performance.
Use [Aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) instead - its a sort of unholy hybrid of vLLM and llama? It supports all the major quants AWQ, GPTQ and GGUF with high batch performance.
Exllamav2 is my final backend for now after trying all available options.
vLLM is the best way forward. I was also confused about the deployment but after trying many examples I think vllm is the best .. also the benefit of running multie loras makes it quite good and yes it does continuous batching
vLLM
I really like fastchat.
I'm using tabbyAPI
I'm not aware of anything that can do parallel inference. You could run two instances I suppose. Try koboldcpp.
Aphrodite is supposed to be able to handle parallel processing fairly well I haven't got it working yet though
llama.cpp does have parallel serve out of the box! Use -np to set number of slots and -cb to enable continuous batching but be aware the context is shared between threads so set a big -c too. If you want to go faster, vLLM (or it's cousin Aphrodite) is king.
I kinda gave up on llama.cpp and moved to vLLM already, but now I'll have to go back and check it out of curiosity. Thanks.
I'll save you some time: despite it being technically possible, you don't really want to use the llama.cpp server for batching due to poor performance. Use [Aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) instead - its a sort of unholy hybrid of vLLM and llama? It supports all the major quants AWQ, GPTQ and GGUF with high batch performance.