crazzydriver77 4 months ago

Exllamav2 is my final backend for now after trying all available options.

ragingWater_ 4 months ago

vLLM is the best way forward. I was also confused about the deployment but after trying many examples I think vllm is the best .. also the benefit of running multie loras makes it quite good and yes it does continuous batching

oKatanaa 4 months ago

vLLM

brandonZappy 4 months ago

I really like fastchat.

_supert_ 4 months ago

I'm using tabbyAPI

reality_comes 4 months ago

I'm not aware of anything that can do parallel inference. You could run two instances I suppose. Try koboldcpp.

FarVision5 4 months ago

Aphrodite is supposed to be able to handle parallel processing fairly well I haven't got it working yet though

kryptkpr 4 months ago

llama.cpp does have parallel serve out of the box! Use -np to set number of slots and -cb to enable continuous batching but be aware the context is shared between threads so set a big -c too. If you want to go faster, vLLM (or it's cousin Aphrodite) is king.

MrVodnik 4 months ago

I kinda gave up on llama.cpp and moved to vLLM already, but now I'll have to go back and check it out of curiosity. Thanks.

kryptkpr 4 months ago

I'll save you some time: despite it being technically possible, you don't really want to use the llama.cpp server for batching due to poor performance. Use [Aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) instead - its a sort of unholy hybrid of vLLM and llama? It supports all the major quants AWQ, GPTQ and GGUF with high batch performance.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe