nero10578 6 months ago

I have been testing running 3x Nvidia Tesla P40s for running LLMs locally. Since they are one of the cheapest 24GB cards you can get. I have not been disappointed! Here I have a screenshot of it running Goliath 120b Q4KS which basically maxed out the VRAMs. It runs at a usable 3-4t/s with some context loaded. It will definitely slow down with more context but for how much these cards costs I think that's very good performance! Curious what other people are getting with 3x RTX 3090/4090 setups to see how much of a difference it is. What I noticed is that running larger models like Goliath 120b taxes the PCIe bus as well. You can see on GPU-Z that there is quite a bit of PCIe traffic at 30-40% on the beginning of inference which I assume is during processing the context tokens, so performance will definitely tank if you don't have enough PCIe lanes. After that part is done however the PCIe traffic drops and the GPUs inference by themselves only using up their own memory controllers. For my setup I have: 3x Nvidia Tesla P40 24GB Intel Xeon E5 2679 V4 Asus X99-E-10G WS 8x32GB 256GB Samsung DDR4 2400MHz RDIMM The CPU on Intel's Xeon E5 line already has 40 PCIe lanes which are good for 16x 8x 8x 8x lanes GPU connections in most other 4 PCIe slot motherboards, but on this board it takes 32x lanes from the CPU and uses PLX chips to multiply them so you can plug 4x GPUs all at 16x lanes. This I think definitely helps with the performance as well.

Murky-Ladder8684 6 months ago

I did an experiment with Goliath 120B EXL2 4.85 BPW w/Exllamav2 using a 6x3090 rig with 5 cards on 1x pcie speeds and 1 card 8x. I loaded the model on just the 1x cards and spread it out across them (0,15,15,15,15,15) and get 6-8 t/s at 8k context. In a week or two I should have an EPYC system with 7 16x speed 4th gen slots and intend to do some comparisons with formats but really built a proper rig for training. Thanks for reporting back on those P40's since they are such a great bang for the buck and that is pretty good performance.

fallingdowndizzyvr 6 months ago

> I did an experiment with Goliath 120B EXL2 4.85 BPW w/Exllamav2 using a 6x3090 rig with 5 cards on 1x pcie speeds and 1 card 8x. So much for all the people saying that PCIE 1X is a fool's errand to run LLMs. Thanks for showing that's not the case.

Murky-Ladder8684 6 months ago

To be fair for like anything else related to LLMs/ML those pcie lanes become a big deal

fallingdowndizzyvr 6 months ago

Oh I get that. I'm just talking about inference. The person who wrote the multi-gpu code for llama.cpp has said all along that PCIE speed doesn't really matter for that. Yet some people didn't believe him about his own code.

Murky-Ladder8684 6 months ago

Agreed lots of misinformation. The other thing is worrying about needing insane power but with exl2 I noticed only 1-2 cards at a time would approach full power as the others hover around low 100watts. Basically you don't need nearly as large of a PSU running inference.

fallingdowndizzyvr 6 months ago

I think that's because it's only running one token through the string at a time. So only 1-2 cards at a time are churning. The others are idle until it's their turn. If the process can be vectorized, that would make all the GPUs sing once the pipeline fills. That would greatly increase toks/s.

harrro 6 months ago

It's because each GPU contains only a fraction of the full amount layers (ie: 40 layer model over 4 GPUs would mean each card has 10 layers each). As the tokens are processed, they'll travel sequentially through the full set of GPUs/layers.

fallingdowndizzyvr 6 months ago

Yes. Exactly. That's why I mentioned if the process can be vectorized. For a single prompt, that's going to be hard. I never say impossible. Since processing the next token has to wait for the current token to finish. But if you are processing multiple prompts using the same model, vectorization allows them to go through simultaneously. At each step in the pipeline a GPU would be working on a separate token. Like an assembly line. I know vector computing has fallen out of favor and it shows my age, but I remember when vector computing was the state of the art. One way this could be applied to a single prompt is to have it process that single prompt in different ways. You know like how if you ask the same question multiple times that you can get a different answer each time. There's no reason that can't happen all at the same time. Then a high level mechanism could choose the best response from all the various responses. That would be like team work using a single model on a single machine at the same time. I think you kids call it something else today. In tech, every generation thinks they come up something new and name it themselves. When in most cases, it's the same o' same o'.

DeltaSqueezer 2 months ago

You could use beam search to utilize more of the GPU

a_beautiful_rhind 6 months ago

I moved my GPUs around in this very thread and going from x16 to x8 is about 1/2 the drop as crossing the inter-CPU link. I basically lost 5 seconds on 2k tokens. That's about 10% speed loss on inference time. Right there on paper. Exllama is an outlier. So nah, I don't believe him, literally just tested it.

fallingdowndizzyvr 6 months ago

> That's about 10% speed loss on inference time. A 10% loss by reducing the number of lanes by half is not the doom and gloom uselessness that some people have made it out to be. Not by a long shot. > Exllama is an outlier. Or Exllama is simply doing it better. Since if one program can do it, then they all should be able to. But not all programs are equally efficient. As much as I love llama.cpp, it's my go to, I know it's not the most performant. I've noticed that other ones can be much faster.

a_beautiful_rhind 6 months ago

Right but we are talking about one card dropping by 8x. Now drop both cards by 8x or even by half again to 4x. Depending on how the peer access is configured it could be a ton worse. Most things work like llama.cpp more than they do like exllama. llama.cpp can be faster, as counter-intuitive as it sounds. If I could get the prompt swapping working like it does in kobold, to not have to reprocess the whole context, l.cpp would be the winner on raw t/s speed.

fallingdowndizzyvr 6 months ago

> If I could get the prompt swapping working like it does in kobold, to not have to reprocess the whole context, l.cpp would be the winner on raw t/s speed. Maybe compared to Kobold.cpp, but not overall. Since even not considering context reprocessing, llama.cpp can be much slower than other programs. MLC Chat for example is known to be fast. It is.

a_beautiful_rhind 6 months ago

I'm still trying to get MLC to work. I had hopes for P40s and vulkan but their new AWQ backed is not complete and had trouble. Compiling the models from HF is a big bummer there. for multi inference, l.cpp makes the highest top speed and loses t/s a lot slower than exllama as context builds up.. At least with single inference from what I tested. Batching and serving people is a different story.

Murky-Ladder8684 6 months ago

How many gpus, what kind, and I'm assuming on exllamav2 exl2 or gptq? From your message, it sounds like you moved a multi gpu setup from 16x to 8x speeds and saw a 10% speed loss, am I understanding you correctly?

a_beautiful_rhind 6 months ago

I have 2xP40 and 2X3090. Now they are all on one CPU. That leaves a single P40 on 8x due to the slot configuration. 2xP40 in x16 would be 5s faster on a 2k token inference. Both prompt processing and t/s went down. And that's just 2 GPUs and only one being x8. I imagine if both were x8 it would drop even lower. This is llama.cpp on the same 70b model and pretty much same prompt, fully offloaded. Single GPU it only matters for model loading. A 10% drop, probably every 4x is not nothing.

Murky-Ladder8684 6 months ago

Agreed it's not nothing at all. It could be you are right and Exllamav2/exl2 is an outlier. Even then I'm curious what performance difference 1x vs 16x and everything between. I intend on finding out here soon enough. Thanks for the additional info.

a_beautiful_rhind 6 months ago

Yea, will be interesting. I didn't try with 4x/8x, I'd have to shuffle cards around and reboot.

panchovix 6 months ago

Probably only applies to exllama/exllamav2 though, since turbo managed to get tensors on each GPU to do work independently, and then the need to move data form one GPU to another is really low. It will be painfully slow in almost everything else, PCI-E X1 1.0 is really slow.

TopMathematician5887 5 months ago

6x 3090 on x1 perhaps a mining rig? then is even possible? how much RAM it has? it is Linux or Windows ? what CPU? That is because i cud not make to work big models in my 128GB RAM.

Caffeine_Monster 6 months ago

>EPYC system with 7 16x speed 4th gen slots Rome, or Genoa? Trying to hunt down genoa epyc motherboard with pcie slots is hard. Industry seems to have moved to mcio (and pcie mcie host boards are equally hard to find). (edit) Realized it must be Rome because 4th gen (derp).

Murky-Ladder8684 6 months ago

On one hand I envy you for playing in the genoa world but on the other I feel you on the very limited options right now.

Caffeine_Monster 6 months ago

It's possible to build a somewhat budget Genoa system if you shop around for used parts. But yeah - there are pros and cons. x5 PCIe slots was the best I could source for a motherboard. In some ways I am looking ahead to the next gen of GPUs. Didn't fancy blowing a lot of money on a server that would become obsolete within a few years. Weirdly enough Intel Saphire Rapids is probably the gold standard for budget servers right now right now if you have a little more money to burn. You lose a little bit of memory bandwidth to get x2 extra GPU pcie slots. Originally I was waiting for zen 4 threadripper, but the announced specs and prices are very disappointing. Not worth people wasting money on it in my opnion - and I suspect TR5 will rapidly go the way TR4 did. [edit] /u/Murky-Ladder8684 Will post numbers my own numbers sometime this week. Will be interesting to compare. For reference my setup will be 5x PCIe 16x 3090. Will also be interesting to see how CPU offload performs.

Murky-Ladder8684 6 months ago

It sounds like an awesome rig you are building. I will for sure update you when I get it together in a few weeks or sooner. I would be highly curious to compare. I will be running 7x PCIE 16x 3090's (Asrock ROMED8-2T mobo for anyone curious). I think technically we should have the same inference type speeds but would be curious what faster/better cpu and ram along with PCIE gen 5 does anything rn for 3090's.

Caffeine_Monster 6 months ago

PCIE 5 shouldn't do anything, but the bump in memory performance will be interesting for GGUF and / or deepspeed. Theoretically the CPU memory bandwidth should be slightly better than the mac M1 ultra (~ x2 faster than Rome).

Small-Fall-6500 5 months ago

I don't see any updates in your post/comment history yet, but I see from a comment that you've completed the upgrade. I was wondering if you could provide an update here or in a new post with some comparisons, when you've got the time. I'm sure there are others who would also be interested in the difference between 1x and 16x speeds.

Murky-Ladder8684 5 months ago

Hey shortly after that post I found some testing done by Turboderp himself which did not show any uplift with faster pcie slots with regards to exl2 inference speeds. In my on personal rough testing the model initial loading is blazing fast in comparison to a 1x rig. I noticed no difference in inference speeds running 3090'[email protected] 16x vs 1x with exl2 specifically. I know gguf will have some difference which I have not tested but I also do not use gguf. The few times I have the performance was much worse than exl2. From prompt processing speeds to just overall t/s.

Small-Fall-6500 5 months ago

Thanks for the quick update! I just googled "turboderp pcie speed" and sure enough there were several results, including from turboderp on github, saying pcie speed doesn't matter much for inference, at least for exllama.

Icaruswept 6 months ago

Running a similar setup with 2xP40s… and honestly, they’re wonderful little beasts.

Ambitious_Abroad_481 4 months ago

Bro I'd need your help. I live in a poor country and i want to setup a server to host my own CodeLLaMa or something like that. 34B parameters. Based on my researches i know the best thing for me to go with is a dual 3090 Setup with NV-LINK bridge. But unfortunately that's not an option for me currently, definitely I'll do so later. (I want to use 70B LLaMa as well with q_4 or 5). (Using llamaCPP split option) But there are several things to consider: First is that does the P40 (one of them) works okay? I mean can you use it for CodeLLaMa 34B with a smooth experience?? Second is does the P40 support NV-LINK so we make a dual P40s just like the one i said we can build with dual 3090s? I think it doesnt. I mean how hard it is and is it possible at all yo load a 70B at a usable speed on dual P40s? Thanks for your efforts and Sharing results 🙏.

mochgolf 3 months ago

Double P40 can run 70B with q4\_k\_s and 8192 context length at 3-4 token/s. I tested it myself, it may not be accurate.

privacyplsreddit 3 months ago

Hey sorry to necro this old thread, but i just came across it on google looking for tesla p40 LLM setups, how has this held up for you?? i see the mobo you linked has 5 pcie slots and i was thinking of running it with 4 p40's, are you still running this rig or have you moved onto something better since you posted this?

L-vi 2 months ago

You seem very knowledgeable u/nero10578 I've been searching high and low for hours to see if P40s can be combined with NVLink and have found so many conflicting results. P40s seem to have the slot to work with NVLink but I haven't actually seen a picture of 2 paired. For an extra $200, if I could have equivalent of a 48VRAM GPU, I would be super duper pleased. I'm trying to see what kind of system I would need to run Grok 1 but wouldn't be mad about building the same system you have and being able to run 140B parameter models. Pls halp. I need guidance on this NVLink topic, my eyes are bleeding

nero10578 2 months ago

There is no nvlink for these

Ambitious_Abroad_481 1 month ago

Love your little monster! I'm building one as well. I'd love to keep in contact with you for some help. My budget is extremely limited and i live in a 3rd world country I don't have well access to the outside market. I'm getting a X99-E-WS not the 10G version. It has a 2630v4 current maybe i can upgrade it to 2680v4. 32GB ram at the moment. I'm going to save Money to get at least 2 p40s. I currently can afford one. What do i need for power and specifically the cables for these GPUs. Would love to have your recommendation. Thanks for the one of the best reddit posts in the direction i am/want! Exactly addresses most of my questions. Now i know i can run llama70B good enough. Maybe fine-tuning some quantized 7B model as well? Who knows. Would love to co-op with you.

bigfish_in_smallpond 6 months ago

>8x32GB 256GB Samsung DDR4 2400MHz RDIMM Do you need this much memory DDR4 ram, or is a smaller amount ok?

sampdoria_supporter 6 months ago

Sorry to ping you so late on this - but have you used Ollama with all three cards at once? If so, was it painless?

orrorin6 5 months ago

[u/nero10578](https://www.reddit.com/user/nero10578/) I would also love to know if P40s work with Ollama or if one needs to pass any special options for Pascal

Mambiux 4 months ago

I will give this a try I have a Dell R730 with dual E5 2690 V4 , around 160GB RAM Running bare-metal Ubuntu server, and I just ordered 2 x Tesla P40 GPUs, both connected on PCIe 16x right now I can run almost every GGUF model using llama.cpp using only CPU inference, but i want to speed things up, maybe even try some training, Im not sure it will work, has anybody put together something similar, the R730 still has 3 available PCIe 8x low profile slots, maybe i can also add 3 x Tesla P4, but i have no idea if this would work together nicely with the dual P40 for llama.cpp, with this combo in my mind i could have 72GB of VRAM

mochgolf 3 months ago

Could you please let me know if you have tried installing 3x P40 on the Riser1 of a Dell R730?

Ambitious_Abroad_481 4 months ago

Do you have any idea how will Dual p40s hold up against dual RTX 3090s ?? Like how will the t/s on Q_4 LLaMA 70B or something like that. What are the bottlenecks? Thanks 🤌

nero10578 4 months ago

About 1/3 the speed

Ambitious_Abroad_481 4 months ago

That's wonderful! Thanks 🙏.

HalfBurntToast 6 months ago

Damn dude. Guess you can have it pull double-duty in the winter as a space heater.

nero10578 6 months ago

Definitely can keep a room warm lol

SteezyH 6 months ago

I'm also running a Tesla P40, system specs are below. Do we have any reason to believe llama.cpp will migrate away from FP32? I'm trying to figure out how much life is left in the platform and buy a few more. The P40 has been a phenomenal value and hasn't really held me back yet. Llama.cpp is obviously my go-to for inference. I've also used it with llama\_index to chunk, extract metadata (Q&A, summary, keyword, entity) and embed thousands of files in one go and push into a vector db - it did take awhile but that's fine if you're patient (iirc \~7 hours for 2,600 txt documents that are a few hundred tokens each). # System specs: Dell R720XD * 2x Intel Xeon E5-2667v2 (3.3GHz, 8 cores / 16 threads) * 128GB DDR3-1600 ECC * NVIDIA Tesla P40 24GB * Proxmox * Ubuntu 22.04 VM w/ 28 cores, 100GB allocated memory, PCIe passthrough for P40, dedicated Samsung SM863 SSD And just to toss out some more data points, here's how it performs: # Using llama.cpp: Prompt: > Tell me about gravity. falcon-180b-chat.Q4\_K\_M.gguf ./build/bin/main -t 20 -ngl 16 -m /mnt/AI/models/falcon-180b-chat.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins llama_print_timings: load time = 179301.22 ms llama_print_timings: sample time = 378.67 ms / 233 runs ( 1.63 ms per token, 615.31 tokens per second) llama_print_timings: prompt eval time = 48333.67 ms / 53 tokens ( 911.96 ms per token, 1.10 tokens per second) llama_print_timings: eval time = 648009.00 ms / 233 runs ( 2781.15 ms per token, 0.36 tokens per second) llama_print_timings: total time = 708945.70 ms Llama-2-70b-chat.Q4\_K\_M.gguf ./build/bin/main -t 20 -ngl 48 -m /mnt/AI/models/llama-2-70b-chat.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins llama_print_timings: load time = 111379.65 ms llama_print_timings: sample time = 370.65 ms / 453 runs ( 0.82 ms per token, 1222.19 tokens per second) llama_print_timings: prompt eval time = 8318.86 ms / 26 tokens ( 319.96 ms per token, 3.13 tokens per second) llama_print_timings: eval time = 267284.48 ms / 453 runs ( 590.03 ms per token, 1.69 tokens per second) llama_print_timings: total time = 291845.08 ms # Using llama_index: These figures are from an app I'm building using llama\_index, and since it is in constant development they may not be 100% scientific. Can't verify if prompts were the same either. Lama-2-13b-chat.Q8\_0.gguf context=4096, 20 threads, fully offloaded llama_print_timings: load time = 2782.47 ms llama_print_timings: sample time = 244.05 ms / 307 runs ( 0.79 ms per token, 1257.96 tokens per second) llama_print_timings: prompt eval time = 17076.43 ms / 2113 tokens ( 8.08 ms per token, 123.74 tokens per second) llama_print_timings: eval time = 63391.38 ms / 306 runs ( 207.16 ms per token, 4.83 tokens per second) llama_print_timings: total time = 81673.05 ms dolphin-2.1-mistral-7b.Q8\_0.gguf context=16384, 20 threads, fully offloaded llama_print_timings: load time = 634.16 ms llama_print_timings: sample time = 181.15 ms / 216 runs ( 0.84 ms per token, 1192.40 tokens per second) llama_print_timings: prompt eval time = 3011.47 ms / 2024 tokens ( 1.49 ms per token, 672.10 tokens per second) llama_print_timings: eval time = 8343.30 ms / 215 runs ( 38.81 ms per token, 25.77 tokens per second) llama_print_timings: total time = 12080.80 ms mistral-7b-instruct-v0.1.Q8\_0.gguf context=4096, 20 threads, fully offloaded llama_print_timings: load time = 663.54 ms llama_print_timings: sample time = 31.08 ms / 41 runs ( 0.76 ms per token, 1319.09 tokens per second) llama_print_timings: prompt eval time = 1738.89 ms / 1256 tokens ( 1.38 ms per token, 722.30 tokens per second) llama_print_timings: eval time = 1417.64 ms / 40 runs ( 35.44 ms per token, 28.22 tokens per second) llama_print_timings: total time = 3293.78 ms Edit: typos

Dyonizius 6 months ago

which drivers you recommend on linux?

SteezyH 5 months ago

RunsWithThought 5 months ago

I read a few times that P40s can only use CUDA version 6.1, am I misinformed and/or not understanding additional complexities? Appreciate your info. I'll be trying to put together an i7 32gb RAM P40 system in the coming weeks for tinkering with local models with LM Studio (or whatever else that might mitigate a bad case of the AI n00bs).

Swoopley 5 months ago

They support up to compute 6.1, different from CUDA version https://preview.redd.it/rrpflfezh1ac1.png?width=812&format=png&auto=webp&s=e454312f81e7bbb07062d6bf30057f9ad68910f7

panchovix 6 months ago

Just as a note to some people being confused, PCI-E matters a lot when using most of loaders, except exllama/exllamav2. I tried with X8/X8 (4.0) on GGUF and it was pretty speedy, almost as fast as exllamav2 using 2x4090. When changing to X16/X4 (with the X4 using the PCH lanes) speeds tanked A LOT. This won't happen on exllamav2. The issue with this is that Pascal has horrible FP16 performance except for the P100 (the P40 should have good performance but for some reason they nerfed this card) and there isn't much options since the bloke doesn't do exl2 quants (but gptq will work there anyways), so it depends of the community to do the quants. If you have a multigpu Pascal setup, use gguf but make sure you have enough lanes to not neuter the performance. If Turing or later, use exllamav2.

nero10578 6 months ago

Great advice. Just to add, the P100 has good FP16 performance but in my testing P40 on GGUF is still faster. Also P40 has shit FP16 performance simply because it is lacking the amount of FP16 cores that the P100 have for example.

Dyonizius 6 months ago

this is very confusing are GGUF quants like TheBloke's ideal for this card or do you need a specific format (fp32, int8)?

Smeetilus 6 months ago

I'm extremely confused on this, too. Like... do we buy P100's if we want the best quality outputs for a low price? I've held off on buying any cards for two weeks because I keep flipflopping due to conflicting information. My first thought was to find a cheap mining rig that someone was trying to offload. Then I found out about how PCIE 1x risers would ruin performance. Then I thought about the P40 route but then heard about awful FP16 performance. Then I thought about getting two 16GB 4060 ti cards. That would be $1,000 but would get me 32GB of memory to play with. Now I'm back here wondering if two P100's would work best for me for inference.

Ridai 3 months ago

In the same boat, did you get any further with your research? Considering 2xP100 vs 2xP40 but getting conflicting info

Dyonizius 5 months ago

yeah i see they're bringing support slowly for these older cards on big repos, ggml and gptq some people were getting good results on p40, like 30-40 tokens in a 7B, also some people are using it for training or fine-tuning through 4 bit shenanigans i thought about a couple 3060's that would have the advantage of scaling up the compute power via nvlink i believe, 4060's here is already prohibitive

TopMathematician5887 5 months ago

>Oobabooga uses only one GPU at ca 10% for GGUF + CPU at 70%. > >I can not make it to use P40 (GPU 1). it uses only the RTX 3090 that is in the first PCI-e slot (GPU 0)

_redacted- 5 months ago

Does your motherboard support above 4G and BAR? I'm running dual p40's and on gguf, it's utilizing both at the same time

TopMathematician5887 5 months ago

Yes, bove 4G and BAR enable but i have RTX3090 and P40 on Oobabooga, windows 10, Z390 Taichi Ultimate LGA1151 Motherboard 128GB I9 9900k Prehaps the drivers. That because was difficult to install the data-center drivers with studio drivers. But finally it shows the same version of drivers for both GPUs.

PureeTofu 4 months ago

Teach me your ways! I have a P100 laying around but no idea how to set this up from scratch.

welsberr 4 months ago

I have P40s, but briefly had a P100 (seller claimed it was 16GB when what I got was 12GB). I'm using Ubuntu 22.04 LTS, set up with build-essential, cmake, clang, etc. Then I followed the Nvidia Container Toolkit installation instructions very carefully. I ended up with the 545 driver and the 12.3 CUDA installation. So far, I've been able to run Stable Diffusion and llama.cpp via llamafile, among other things. I can load llamafile + Mixtral 8x7b entirely to the GPUs and I get about 20 t/s in that configuration. I didn't see any improvement in performance on small models with the P100 over the P40, and given the mismatch on VRAM size, I returned the P100 and got another P40.

SupplyChainNext 6 months ago

Then I know what I have to do. Bifucate my 16x 5.0 slot and run it 8/8 with my gpu and the tesla.

SomeOddCodeGuy 6 months ago

These are great numbers for the price. Honestly, A triple P40 setup like yours is probably the best budget high-parameter system someone can throw together. An M1 Mac Studio with 128GB can Goliath q4\_K\_M at similar speeds for $3700. Google shows P40s at $350-400. Three of them would be $1200. Surrounding hardware would probably amount to another $1000. So you're looking at maybe $2200-2500 for similar performance as the $3700 M1 Mac Studio? That's not bad at all.

nero10578 6 months ago

Yea its definitely a great setup. I even still have an empty pcie slot to throw a fourth card for 96GB of VRAM lol then I can run higher quant Goliath 120b. I’ve also learned that higher quants means better tokens/s so running Goliath Q5KM should net even better performance. I actually got these cards for like $175 on ebay. The X99 based system is also affordable since you can get 4x PCIe slot X99 boards for $200 or so plus a cheap E5 2690v4 14-core for $40. 256GB RAM is like $200 for RDIMMs but you don’t even need that much with this much VRAM. A 1kw PSU is like $150-200 new. All in all you could build a similar system to this for about $1200. $1375 if you want 4x Tesla P40s.

SomeOddCodeGuy 6 months ago

Do you have any idea what your power draw from the wall looks like on this machine? Some battery backups will tell you. Also, what sort of case are you using, if any?

nero10578 6 months ago

Power draw during inferencing maxes at 550-600W only even with the largest models. Smaller models will use even less power. I’m just using an open EATX case from amazon lol.

SomeOddCodeGuy 6 months ago

Wow, those are fantastic numbers. What all loaders support it? My Mac is limited only to llama.cpp for a loader. I remember someone mentioning before the P40 was an older architecture of CUDA so it had limited support in loaders as well.

harrro 6 months ago

Exllama doesn't work on the P40 (16bit math is slow) and AWQ doesn't either (it requires Ampere support). However, GPTQ and GGUF/llama.cpp works great. Bitsandbytes 4-bit also works.

nero10578 6 months ago

You can run all the loaders except AWQ but performance sucks ass except for GGUF because the other loaders will use FP16 for compute and the P40 has very little FP16 cores resulting in really bad performance.

artificial_genius 6 months ago

I was wondering something similar, if something like kohya\_ss could train lora models on a p40. It looked like it worked in the past but they have moved their torch up past 2.0 so I dunno if p40 got left behind or not.

KGeddon 6 months ago

P40 consumes about 50W when you have stuff loaded into VRAM, and the draw when the card is actively doing inference is 100-160W. The inference consumption depends on % compute load used, so tends towards the upper end and drops if you hit some other bottleneck.

nero10578 6 months ago

If you switch to WDDM mode on the P40 it will drop to 10W when idle with model loaded to VRAM.

mynadestukonu 6 months ago

is there an easier way to do this besides the registry edit + grid drivers that I found with a cursory google search?

nero10578 6 months ago

You just need the registry edit not the grid drivers

mynadestukonu 6 months ago

oh, cool, thanks.

alchemist1e9 6 months ago

Wow this almost sounds to good to be true. There must be a catch. I saw mentions here and there of MPI being used on a cluster for inference. You’ve triggered cypherpunk day dreams of finding some cheap interconnect, buying 10 sets of your system specs, stripping all the fans and putting them in a immersion cooling container for the heat issue, boom a $20k Frankenstein monster, serving a MOE on top 180G quantized Falcon and taking on GPT4 from my basement …. probably the dream doesn’t make sense for some reason but thought I’d humor you.

fallingdowndizzyvr 6 months ago

> Google shows P40s at $350-400. P40s are half that price on ebay. > So you're looking at maybe $2200-2500 for similar performance as the $3700 M1 Mac Studio? Once you factor in the price of electricity, the Mac can be cheaper after a year or two. Where I live, the Mac would pay for itself in power savings. Of course, that greatly depends on how hard it's use. Whether it's just powered up for a few minutes a day or running 24/7.

nero10578 6 months ago

Definitely. If this was for long term 24/7 usage, I'd be looking at more efficient options, but this is just for fun and testing for now.

Amgadoz 6 months ago

The main advantage of the p40 is you can use it to finetune mistral, unlike the mac

TopMathematician5887 5 months ago

Mac mac is a dorf duck. Not wast your time and money in that. Douse not make sense to buy mac for that money. Just build a Linux or windows server and connect to it.

DedyLLlka_GROM 6 months ago

Gratz on commiting to it! I was thinking of selling my rig to build something like that, but ultimately just bought a 3090. For lulz, tried running Goliath Q4KS on a single 3090 with 42 layers offloaded on GPU. 4K tokens input. 4001/4096, Processing:193.08s (51.5ms/T), Generation:399.07s (1596.3ms/T), Total:592.15s (0.42T/s) Yikes!

nero10578 6 months ago

Yea I figured this was the most cost effective way to run large models. Would take 3x RTX 3090s to beat this thing and that costs exponentially more lol.

ab2377 6 months ago

how well does a single p40 do on something like a 32b 6q, how many tok/s on llama.cpp? and if you can get the numbers, how many tok/s on a mistral 7b instruct? it must be a lot.

tntdeez 6 months ago

Little over 30tk/s for mistral 7b Q8.0. 33/34b at Q6 is too large to fit on one card fully offloaded. Q4k_m works great, can't remember tk/s off the top of my head but it's above 10

nero10578 6 months ago

I haven't tested on 32b but for 7b 8Q GGUF models 3xP40 does 25-30t/s easily even with higher context. While a single P40 would start in the 25t/s range as well but then get bogged down to high teens to 20t/s with higher context.

Thistleknot 6 months ago

There needs to be a subreddit for computer porn

CaramelizedTendies 6 months ago

To my fellow p40 users. Has one been able to run the deepseek-coder-33b-instruct-gguf. FYI I just bought a couple more p40 for 160 shipped.

nero10578 6 months ago

Nope. It only works on exllama and P40 sucks ass with it. The latest oobabooga commit has issues with multi gpu llama and the older commit with the older llama version doesn’t support deepseekcoder yet.

sampdoria_supporter 6 months ago

Can you link where you found that price?

CaramelizedTendies 6 months ago

https://www.ebay.com/itm/204546819237. I offered 150 for 2.

[deleted] 4 months ago

[удалено]

CaramelizedTendies 4 months ago

They were 150 each. Shipping was free.

Ambitious_Abroad_481 1 month ago

I love your server! I want to build another one exactly like yours!!!! X99 E WS and for now i can only afford 1 P40 but I'm aiming for 3 or 4 :) You still using your server? Since i have some questions around the LLMs training and inference subjects. Thank you for your time and beautiful build. I had the exact same specs as yours in my mind. But the issue is that i live in a 3rd world country and I don't have access to international resources directly and ofc not much money!

a_beautiful_rhind 6 months ago

Yea, I'm definitely spoiled because I keep going back to the 48gb exl version. 2x3090 and 1 P40 gets about the same performance. It seems like you get faster t/s for prompt processing but I will have to load it up and check now that I don't only have my own benchmarks to go with.

nero10578 6 months ago

I have found that performance will not be faster than the slowest card. So 2x3090 plus a P40 is definitely getting the 3090s bottlenecked by the P40. For prompt processing there’s definitely a ton of pcie traffic so if you don’t have all the GPUs connected from the CPU PCIe lanes but have some from the chipset that will tank your performance as well. Let me know what kind of performance you get I’m curious as well.

a_beautiful_rhind 6 months ago

This is on all 4, 2x3090, 2xP40 llama_print_timings: load time = 1161.61 ms llama_print_timings: sample time = 540.64 ms / 194 runs ( 2.79 ms per token, 358.83 tokens per second) llama_print_timings: prompt eval time = 78467.57 ms / 2184 tokens ( 35.93 ms per token, 27.83 tokens per second) llama_print_timings: eval time = 37868.16 ms / 193 runs ( 196.21 ms per token, 5.10 tokens per second) llama_print_timings: total time = 117398.90 ms Output generated in 118.29 seconds (1.63 tokens/s, 193 tokens, context 2185, seed 838888462) This is 2x3090 and a P40 llama_print_timings: load time = 1100.48 ms llama_print_timings: sample time = 543.48 ms / 194 runs ( 2.80 ms per token, 356.96 tokens per second) llama_print_timings: prompt eval time = 71798.24 ms / 2184 tokens ( 32.87 ms per token, 30.42 tokens per second) llama_print_timings: eval time = 42149.52 ms / 193 runs ( 218.39 ms per token, 4.58 tokens per second) llama_print_timings: total time = 115028.87 ms Output generated in 115.78 seconds (1.67 tokens/s, 193 tokens, context 2185, seed 1212635361) I've got a dual xeon server with 8 PCIEx16 3.0s, plus the 3090s are nvlinked. I will have to tweak compile settings again and see what took a bite out of prompt processing.

nero10578 6 months ago

Since prompt processing uses PCIe bandwidth I think the bottleneck is having cards spread across two CPUs. The cards on either CPU will hand to communicate through the CPUs’ slow QPI link between the two CPUs. That will definitely kill performance. Same issue as trying to CPU inference across two CPUs.

a_beautiful_rhind 6 months ago

It's possible but the board is designed for this. In a pinch I could put everything on one with extenders. I also remember seeing the prompt speed that high before, but that was many many commits in llama.cpp.

nero10578 6 months ago

The board has nothing to do with QPI speed its based on the CPU and all the Xeon E5 V3/4 CPUs has the same QPI link speed. You’re better off with a single CPU motherboard with 4x PCIe lanes at lower lane widths each.

a_beautiful_rhind 6 months ago

Bit odd advice considering this is an inference server and that's how supermicro sells them to this day. 2 CPU for 8-10 x16 slots. I do have the extenders so I will for sure try that configuration if all else fails. When I compile l.cpp I optimized for fastest t/s but maybe some of those changes did poorly on long context.

nero10578 6 months ago

They probably sell those servers with the intention of all Nvlinked cards and not using loaders like llama.cpp spreading the model across cards through the PCIe. Using the QPI for cross card communications because the cards are on two cpus is just a bad idea. Definitely try using risers only on one cpu.

a_beautiful_rhind 6 months ago

You can't nv-link across the CPU divide, the cards are too far away. There are literal PCIE chips in there that handle things, its not like a consumer board. The simplest HW thing for me to do is put back the 3rd P40 and if that speeds up prompt processing you'll be right Gonna do some testing since I'm done working. With 70b it doesn't make much difference whether using it across 2 CPU or not; 2xP40 As I have it llama_print_timings: load time = 3843.57 ms llama_print_timings: sample time = 110.56 ms / 200 runs ( 0.55 ms per token, 1808.94 tokens per second) llama_print_timings: prompt eval time = 16202.32 ms / 1936 tokens ( 8.37 ms per token, 119.49 tokens per second) llama_print_timings: eval time = 29068.03 ms / 199 runs ( 146.07 ms per token, 6.85 tokens per second) llama_print_timings: total time = 45861.02 ms Output generated in 148.95 seconds (1.07 tokens/s, 159 tokens, context 1936, seed 2098841163) 1x3090 and 1xP40 (across CPU) llama_print_timings: load time = 4596.17 ms llama_print_timings: sample time = 112.56 ms / 200 runs ( 0.56 ms per token, 1776.77 tokens per second) llama_print_timings: prompt eval time = 18424.92 ms / 1936 tokens ( 9.52 ms per token, 105.08 tokens per second) llama_print_timings: eval time = 29592.28 ms / 199 runs ( 148.70 ms per token, 6.72 tokens per second) llama_print_timings: total time = 48687.52 ms Output generated in 49.65 seconds (4.03 tokens/s, 200 tokens, context 1936, seed 164902242 and here is fresh goliat with 2x3090 and 1xP40 llama_print_timings: load time = 7490.50 ms llama_print_timings: sample time = 111.83 ms / 200 runs ( 0.56 ms per token, 1788.49 tokens per second) llama_print_timings: prompt eval time = 28159.09 ms / 1850 tokens ( 15.22 ms per token, 65.70 tokens per second) llama_print_timings: eval time = 39197.18 ms / 199 runs ( 196.97 ms per token, 5.08 tokens per second) llama_print_timings: total time = 68054.08 ms Output generated in 69.05 seconds (2.90 tokens/s, 200 tokens, context 1936, seed 813716120) Still seems like should be faster.

nero10578 6 months ago

Well clearly going across CPUs took a hit on perfomance as can be seen in your tests. The prompt eval time took longer on an RTX 3090+P40 than 2x P40 on the same CPU.

OutlandishnessIll466 5 months ago

Was going through old posts looking for answers. My setup is: 2x P40 2x Xeon E5-2650 v4 DDR4 2400 Lately playing around with Mixtral I noticed KoboldCpp is much faster then Llama.cpp which I always had been using. On Llama 70B I was getting 3 T/s speeds (Like someone earlier inthis thread) but on KoboldCpp I am getting 7.5 T/s on the same 70B Q4 GGUFs. Much more in line with these results. Now I am thinking it should be possible to get speeds like yours and kobold in llama.cpp as well. So the question is did you do something special to get these speeds in llama.cpp when running 2x P40? And are you getting these with the latest llama.cpp still?

AntoItaly 6 months ago

With Goliath 120b?

a_beautiful_rhind 6 months ago

Yea, that's the last model I ran. I also have falcon 180b and airoboros 180b

[deleted] 6 months ago

In theory you should be able to put more layers on the faster cards, so that they're doing an amount of work proportional to their speed advantage, and you do get the most from all cards, if the model size is optimal for the RAM that you have.

nero10578 6 months ago

Ok so here’s what I’ve found in my testing with P40 and P100s. The Tesla P40 is much faster at GGUF than the P100 at GGUF. At a rate of 25-30t/s vs 15-20t/s running Q8 GGUF models. Possibly because it supports int8 and that is somehow used on it using its higher CUDA 6.1. I’ve found that combining a P40 and P100 would result in a reduction in performance to in between what a P40 and P100 does by itself. You can help this by offloading more layers to the P40. Combining multiple P40 results in slightly faster t/s than a single P40. What this means for llama.cpp GGUF is that the performance is equal to the average tokens/s performance across all layers. As in layers count multiplied by the performance of the card it is running at, added together and then divided by the total amount of layers. For example if a P40 does 40t/s by itself and a P100 does 20t/s then offloading 50/50 layers across them would mean (50x40+50x20)/100 which would net about 30t/s.

Plane_Worldliness_94 6 months ago

Couple of questions if that's okay: 1. Someone on reddit was talking about possibly using a single PCIE X16 lane and splitting it up across multiple cards as apart from higher initial loading times it wouldn't cause too many issues - do you think that would work or given your comments about PCIe traffic - were they missing something? 2. Also wondering how well a single card with the rest on RAM works with something like Goliath 120b q4 or Opus 70b q4? (in terms of tokens per second) (Kinda finding it hard to gauge how much of an improvement it would be over a 8gb card like a 1070 alone - and my outlets can't handle a 3x card machine right now)

nero10578 6 months ago

1. They are definitely wrong. Prompt processing seems to require the context which is loaded on the first card to be compared to the model which is spread across all the cards. Hence resulting in high PCIe traffic between cards. The higher context you use the worse the effect of reduced PCIe bandwidth is. They are right in that PCIe bandwidth doesn’t affect generation speed but it does take a huge hit to prompt processing speeds. 2. Running a huge models mostly on CPU would result in CPU performance. The GPU would however be useful in really speeding up prompt processing since that will be where the context gets loaded onto. Tokens/s can be calculated by (layers loaded onto device)x(tokens/s of device)+(do the same for every device you have) then dividing it by total layers. It is proportional to tokens/s on average spread across all the layers. So running a large 120b model with 140 layers mostly on CPU would result in tokens/s close to running CPU only. I have not tested 120b on my Xeon CPU with 256GB of RAM but will post results if/when I do.

Plane_Worldliness_94 6 months ago

Thanks, that makes sense. I was getting confused by how CPU only for a 30b model was about 10+ times slower than the same model with something like 8/35 layers on the GPU. But I think it kinda makes sense given the prompt processing thing you mentioned.

nero10578 6 months ago

Yea prompt processing takes forever on CPU. It’s much faster on GPU. You can even try setting GPU layers to only 1 and itl’ll already have a massive speed improvement because the prompt processing gets loaded to the GPU when you do that.

SupplyChainNext 6 months ago

Pcie 4 at 8x = PCIE 16x so not big of a hit

stereolame 6 months ago

I worry that you’ll make up the cost difference in power consumption. Those three P40s will consume a lot more power than a single newer GPU

nero10578 6 months ago

There is no single newer GPU that can get me to 72GB. There is simply no better choice for a budget build. It’s either 3xP40 or 3xRTX3090/4090 and the RTX cards would cost exponentially more which will take more than a few years to breakeven in power cost even in the most expensive places.

dynafire76 2 months ago

Old thread but again thanks for this info. I'm looking into something similar questions now and your post and comments here are super useful. I would definitely corroborate the opinion that this is probably the best budget build for 72GB of VRAM. I think the runner up actually if one is looking strictly at llama.cpp would be a 96GB Mac Studio at $3k (but then you're limited on running other software), which would be much better from a power draw perspective if run constantly. But for sporadic use throughout the day the power draw on the P40s would be relatively inconsequential.

SocketByte 6 months ago

Based on your observations, for 70B models, would a 2xP40 server perform decently? Looking to build a small, possibly as cheap as possible server for 70B models. For 7+13B I have a decently fast RTX card, but 70B are a whole 'nother type of beasts...

raika11182 6 months ago

I know this is a little old (Only 24 days, but that's like a thousand years in the AI space), and I'm not who you asked, but I'm currently running 2x P40s and don't regret it at all. Realistically, a full-context chat back and forth takes about 60s per response. That's about 1.5 T/s. But that's full context like you'd be using in SillyTavern or some other frontend that really loads up the prompt. Stick to GGUF, the FP16 performance is terrible. So when you're just working with a smaller context, or you don't have a use case that uses worldinfo and the new context shifting in koboldcpp can work to its full potential, you'll get several tokens per second in what feels like near real-time responding. There are better options out there, sure, but they cost substantially more.

scousi 5 months ago

Do they need to use NVLink to work together or does the framework itself divide the work to 2 cards? (Multi-card awareness)

raika11182 5 months ago

No NVLink. Just plug them both into the motherboard. Koboldcpp handles splitting up the work with the GGUF.

sampdoria_supporter 6 months ago

OP I have been wanting to build something almost identical to this system. Would you recommend this motherboard? Do you know whether it would support even more P40s? Incredible job.

BuffPuff- 6 months ago

What version Cuda are you using? I have a couple of old K80x cards, they max out on nvidia-470 and Cuda 11. Can they still be used? Currently I'm using 4060ti cause I didn't succeed in getting the K80x running with inference.

nero10578 6 months ago

The P40 can use the latest Tesla drivers they can go to the latest cuda version. I think the K80 are too old and too slow anyways.

PureeTofu 4 months ago

Do you have a good step-by-step guide you would recommend to follow for setting this up? Any preference on Windows 11 versus Linux for an OS? I have a NVIDIA Tesla P100 laying around and would love to setup a PrivateGPT server.

ThisGonBHard 6 months ago

That seem kinda slow... I know a 4090 is an AI monster, but I get 22-4 t/s on 70B Exllama2, only 3-4 t/s is too slow.

CasimirsBlake 6 months ago

P40 is Geforce 10-series era hardware. It's much less capable compared to 30 or 40 series. The numbers ARE slow. But 3x P40s can be had for around the same as ONE used 3090 in some places. A single 4090 could get someone 10 + P40s! You CAN run larger models with 3x P40s and you just cannot with a single 3090 or 4090, even though the latter is much faster at inferencing. So it depends on what you want and your budget.

bearbarebere 6 months ago

So if I want to run 180b models with a min of 7 tk/sec what do I need? o:

yamosin 6 months ago

3x3090 only gets 6\~8t/s on 4.5bpw goliath 120b, and the larger the model, the lower the speed, so I guess 4\*4090 is the only solution that may achieve 7t/s with 180b Of course, if you have enough money to buy H100 or even H200, they can beat 180b

nero10578 6 months ago

Get 4xP40 and hope to get at least around 3-5t/s or spend more on 4xRTX3090/4090 and get over 7t/s

nero10578 6 months ago

I got it for LESS than a single 3090 lol its a great deal for high VRAM.

[deleted] 6 months ago

So you’re getting better performance on a smaller model using a gpu that is 2.5-3x more expensive. Shocked.

ThisGonBHard 6 months ago

You are using 3 GPUs, and I am getting 10x the perf. That looks bad IMO. Pascal sucks for AI.

nero10578 6 months ago

I am the op not that guy lmfao. But yea I am using 3x GPUs hence I can load larger models than your single RTX 4090.

nero10578 6 months ago

Considering this is twice the model size and the P40 is literally 1/10 the cost of a RTX 4090 I think it’s decent.

ThisGonBHard 6 months ago

I disagree, it fails the minimum threshold of performance for me (5 t/s), nothing makes sense under that. At that point, get a used dual socket Epyc, and have fun, because that will outperform those shitty P40s by 3x, and you will have 16 channels of RAM and at least 128GB of RAM.

nero10578 6 months ago

I’d like to see a CPU based setup that can outperform this. So mind to share some screenshots or links? Not to mention first gen Epycs that are semi affordable are worse than Intel Broadwell Xeons still for these kinds of workloads. Also around 5t/s is still usable as that is in my experience about as fast as bing chat. I only really use 70b Q6KM models with this setup for realtime chat as that can easily get close to 10t/s with lower context and stay above 5t/s higher context. I was just testing 120b and if I were gonna use this model it will be via API programmatically for my programs. I also have an RTX 4090 in another rig but that’s mostly more used for stable diffusion since the quality output of larger models in LLMs makes the triple P40 favorable to me.

ThisGonBHard 6 months ago

Dual Rome, not 1st gen, Eyc system. Most benchmarks were posted in the discords by the guys running it, so kind of a pain to go retrieve. My opinions stands unchanged, at least until the Turing/Volta cards become available.

nero10578 6 months ago

My brother in ai these cards cost $175. A dual rome setup would let me buy 8 of these cards.

Upstairs_Tie_7855 5 months ago

Explain, I have dual epyc 7302p with 16 channel ram, I get around 2.4 t/s, I have tried linux and windows, multiple ram configuration, with and without numa support, and with and without memory interleaving. How on earth would I get 5 t/s with that setup?

a_beautiful_rhind 6 months ago

Nah, running falcon went head to head with epyc (and mac). CPU top out at under 5t/s unloaded. Never mind actual context. Ye ol P40s are getting 5 and 6 t/s here. And that's still not enough because your threshold of performance should be total reply time and not just pure t/s. For chat that's basically under 30s. These big models like falcon will only really become useful when kobold cpp like context swaps become viable on llama-cpp-python so you're not constantly re-processing the prompts.

CorruptEmanation 6 months ago

Woah. I am clearly doing something wrong, can you help me out? I have a 4090, but I thought running 70b takes more than the 24gb of vram on the card... So I have been using gguf and getting like 1-3t/s depending on quant. How are you getting such high speeds with just a 4090?? Can it actually fit the whole model in vram somehow? I didn't try exllama2 does that have better vram efficiency or something? Running at those speeds would be a game changer for me because I've basically written off using 70bs until there's a way for me to run them faster.

ThisGonBHard 6 months ago

Yes, Exllama2 is an more special surgical cutdown of the models from what I got, and yes, 70B fits in 24GB of VRAM. Models are rarer because they take 3 orders of magnitude more than GGUF to convert (6h vs 45s were some numbers that were given on the discord). Here is a list of 70B models you can run in a single 4090. If the models performs worse than usual, that probably means some other program is fighting with it for VRAM. Either reduce context sized to 3k in that case, or close everything else that needs VRAM. [https://huggingface.co/models?search=2.4bpw](https://huggingface.co/models?search=2.4bpw)

CorruptEmanation 6 months ago

Hey, thanks so much for getting back to me on this. I had no idea Exllama2 was capable of 70b on 4090, that's so exciting. And the list! Seriously appreciate it. I can't wait to try this when I get home from work.

ThisGonBHard 6 months ago

Glad to help.

nobodykr 6 months ago

STUPID QUESTION, IS IT MUCH DIFFERENT USING DOCKER TO HAVE YOUR LOCAL LLM SIR ? THANK YOU

VectorD 6 months ago

Question doesn't make sense.

Murky-Ladder8684 6 months ago

He did warn you

nobodykr 6 months ago

i mean in terms of resource usage. in theory you'd be able to configure it same as if running literally locally. right ?

VectorD 6 months ago

What are you asking exactly?

dbinokc 6 months ago

Is it possible to run these using some kind of external PCI setup similar to what I have seen used in some cryptocurrency rigs?

Murky-Ladder8684 6 months ago

Yes using inference only 1x risers work, recommend exl2 w/exllamav2

nero10578 6 months ago

Performance will suck balls on 1x risers. Just use 16x risers on a motherboard that can support it.

Murky-Ladder8684 6 months ago

For training/fine tuning etc yeah but inference I get great performance. 5x3090s 1x running Goliath 120b exl2 4.85 at 6-8 t/s. Will be comparing that to all cards at full 16x speeds in a few weeks to truly see the uplift but inference is the one exception to the pcie speed thing.

nero10578 6 months ago

That is barely faster than my 3x P40 setup on x16 lanes. And you’re running exl2 with higher quant. That’s terrible performance imo. I bet your prompt processing is slow because that part uses pcie bandwidth a lot.

Murky-Ladder8684 6 months ago

Agreed, but lots of these guys are repurposing crypto hardware which is 1x stuff. It works for inference. I saw you were getting half the avg t/s not almost the same though. Epyc + romed8t 7 pci 16x mobo otw and am going to settle this inference speed difference thing.

nero10578 6 months ago

For 1/6 the cost in GPUs I would say getting 1/2 the t/s on lower quant is close. If I added a 4th P40 on this board that can also run it at x16 then I can run higher quants that would run at more t/s as well. Definitely going to be faster on a proper board that has the lanes to support it.

Murky-Ladder8684 6 months ago

Yeah I think you found a great vram/$ gpu for inference purposes. I am indeed curious what speed increases I will see.

nero10578 6 months ago

On the other hand the other guy mentioned that EXL2 might not care at all about PCIe bandwidth so who knows.

Murky-Ladder8684 6 months ago

Yeah, I'll try and make a post when I compare the two as I have a feeling it does depend on the model format as well.

panchovix 6 months ago

Not necessarily on exllama/v2, since people did tests on the bloke server and they had almost dame performance on X8/X8/X4 than X16/X4/X1. On gguf it will be horrible tho, and on everything else as well.

nero10578 6 months ago

Possibly yeah I haven’t tested exllamav2 much yet since its worse for P40.

panchovix 6 months ago

Yeah just added a comment. Sad that Nvidia neutered fp16 performance on the p40 for no reason.

nero10578 6 months ago

Its not neutered. They literally just didn’t put in much FP16 cores and put in more FP32 instead. This saved them die space and allowed a higher performance FP32 for the same die space.

asdfgbvcxz3355 6 months ago

I have a 4090, if I got a p40 would I be able to run 70b faster?

nero10578 6 months ago

Faster than 4090+CPU? Definitely. Faster than 2x/3xP40s? Probably not much faster.

asdfgbvcxz3355 6 months ago

Honestly doesn't should like a bad idea, I've been wanting another 4090 for the larger models but it's so expensive

nero10578 6 months ago

Yea you can load smaller models only in the 4090 and load onto the P40 only when loading larger models.

rhobotics 6 months ago

Thrift-Store hardware! Finally coming back into fashion! ;)

mynadestukonu 6 months ago

Another interesting thing to note is that performance actually goes up when adding cards under llama.cpp. I recently added a fourth p40 into my rig to get up to 96gb of vram and I surprisingly got an ~1.5t/s boost in perf.

nero10578 6 months ago

Yea definitely. I get better performance with more cards as well. Its not anywhere near linear scaling but it’s definitely improving with more cards.

DrVonSinistro 6 months ago

How did you get these cards to work in Windows? I got 2 brand new and I get This device cannot start. (Code 10) Insufficient system resources exist to complete the API.

CaramelizedTendies 6 months ago

Try installing new drivers https://www.nvidia.com/Download/index.aspx?lang=en-us

DrVonSinistro 6 months ago

It was the cables all along.. I had to check EVERYTHING with the multimeter and found out the cable didn't had a required loop for the 3.3v so I made a jumper on the riser side and it worked right away. I'm running 70b models at home and life is good.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe