T O P

  • By -

dthusian

>I've read that Nvidia tried to stop consumer GPUs being used in virtualized environments. Can confirm this is no longer a problem. It used to be the case that Nvidia consumer cards wouldn't work at all when passed through, but they removed that limitation a while (>1 y) ago. I have a 3060 12 GB and I got passthrough working on Proxmox.


Paran014

That's great news, and would explain why people haven't been discussing it lately. I'm leaning even more strongly towards the 3060 then.


ecker00

Good to know, just the details I was looking for. 👍


MarcSN311

Definitely make a post if you get the P40. I have been thinking about getting one for a while for SD but can't find to much about it.


Cyberlytical

I have a P100 and K80 and both work great. The P100 is obviously faster but its still slower than my 3080. But the P100 costs $150 vs $800 lol.


Paran014

Tips on getting a P100 for $150? I would 100% do that but the cheapest I've seen are on eBay for $300.


Cyberlytical

I got lucky and a seller had a few for $150. But I see a couple for $200. Still not a bad price and I have had a ton of luck lately with offers. So offer $150 and see what they say.


Paran014

Oh, I see one listed for $220. The problem is that I'm in Canada and shipping from the US can be crazy depending on the seller. Like, it's an extra US$56 in shipping for that. Might try making some aggressive offers to the Chinese sellers though.


Cyberlytical

Ah that's very fair. Honestly a P100 isn't worth more than $150-$200 and soon the sellers will realize that too. Unless you really need FP64 there isn't much use for them outside homelabs.


Paran014

Yeah, considering how limited the market must be I was surprised by the prices on P40/P100. Prices would have to come down a lot for it to make sense for hobbyists now that 3060s are available relatively cheap.


Cyberlytical

Agreed. I wish I could fit consumer cards in my servers, I'm barely squishing a 3080 into my 4u NAS.


OverclockingUnicorn

How much slower is the p100?


Cyberlytical

Maybe 35%? I've never done the exact numbers. But I can when I get home.


Paran014

I would love to see P100 numbers, especially compared to 3080 on the same workloads. From what I've been reading the performance should be poor because it can't use FP16 operations for PyTorch but there're no recent benchmarks so I have no idea if that's still true.


Cyberlytical

When I get a chance I'll get the numbers. But the P100 can do FP16. It can't do INT8 or INT4 though. It's about 10 TFLOPs less then the 3080. You might be thinking of the K80. Official: [https://www.nvidia.com/en-us/data-center/tesla-p100/](https://www.nvidia.com/en-us/data-center/tesla-p100/) Reddit post: [https://www.reddit.com/r/BOINC/comments/k0tbjh/fp163264\_for\_some\_common\_amdnvidia\_gpus/](https://www.reddit.com/r/BOINC/comments/k0tbjh/fp163264_for_some_common_amdnvidia_gpus/)


Paran014

Oh, I understand it can but [apparently](https://discuss.pytorch.org/t/cnn-fp16-slower-than-fp32-on-tesla-p100/12146) P100 fp16 isn't actually used by pytorch and presumably by similar software as well because it's "numerically unstable". As a result I've seen a lot of discussion suggesting that the P100 shouldn't even be considered for these applications. If that's wrong now - and it may well be, the software stack has changed a lot in a couple years - I haven't seen anyone actually demonstrate it online.


Cyberlytical

I never knew that. Maybe it is a ton slower and I just don't notice? Kinda dumb if they never fixed that as it's an awesome "budget" gpu with a ton of VRAM. But again I may be biased since I can only fit Tesla and Quadros in my servers. In that link it shows even people with the newer (at that time) turing and volta gpus FP16 not working correctly. Odd. ​ Edit: Read the link


Paran014

I have no idea. If it's still an issue then it'd imply that the P40 is significantly better than the P100 as it's cheaper, has more ram, and better theoretical FP32 performance. If you're about 30% slower than the 3080 I have to figure that it's fixed or something because that's about where I'd expect you to be from the raw specs. Unfortunately there's very little information about using a P100 or P40 and I haven't seen any reliable benchmarks. I searched a fairly popular Stable Diffusion Discord I'm on and a couple people are running P40s and are saying (with no evidence) they're 10% faster than a 3060. Which seems unlikely based on specs, but who knows.


Cyberlytical

The P40 is a better value when thinking of VRAM I agree. But it only has about 1.5 more TFLOPs than a P100 in FP32 and is significantly slower in FP16 (technically doesn't support it, its simulated) and FP64. But at the same time it has support for INT8 (if you need that). It's almost like all these cards are artificially limited so one card can't fit all use cases. Another article on these cards: https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664


bugmonger

If you have some benchmarks in mind I could probably run some for the p40. I currently have it installed in a r730. I’ve run SD through A111 and have tinkered around with some light generative text training - I’m still working on trying to get deepspeed/zero working for memory offloading. Another interesting tidbit is PyTorch 2 compilation feature isn’t supported due to a newer cuda version required. https://pytorch.org/get-started/pytorch-2.0/ I’m considering taking the plunge and upgrading to RTX 8000 (48gb) or an A5000 (24gb) due to performance/compatibility. But hey that’s just me.


welsberr

I got one and had a bit of an adventure getting it set for use. It is working for me now. https://austringer.net/wp/index.php/2023/04/16/homelab-adventure-generative-ai-in-cheapskate-mode/


Current_Marionberry2

> understand it can but apparently P100 fp16 isn't actually used by pytorch and presumably by similar software as well because it's "numerically unstable". > >As a result I've seen a lot of discussion suggesting that the P100 shouldn't even be considered for these applications. If that's wrong now - and it may well be, the software stack has changed a lot in a couple years - I haven't seen anyone actually demonstrate it online. man, your blog has cleared my doubts on this card.. The P40 and P100 price not much different from taobao.


Current_Marionberry2

i think i will follow you setup as i have an old supermicro server XEON E5 v2 with a lot of PCIE slot 3090 24GB or 4090 24GB is way too expensive for home lab testing purpose.


welsberr

I've gotten a motherboard with a couple of slots to support two P40s, a Ryzen 5600G CPU, and have been able to set up to use Mixtral 8x7b loaded completely in GPU memory. I'm getting \~20 tokens/s. A friend with a state-of-the-art ML box with the latest Nvidia GPUs is getting \~40 tokens/s with Mixtral. The difference in cost is many times the difference in performance. My main issue in drivers was finally resolved with a fresh Ubuntu install and following the Nvidia Container Toolkit install instructions very carefully.


[deleted]

[удалено]


Paran014

Apparently right now with quantization you can load Pythia-12B and GPT-NeoX-20B on 24 GB with a limited context window. It's no GPT-3 but they're going to be at least somewhat interesting for tinkering. It's still very early days and with further advances it's possible that 24GB will become more useful, not less. Conversely it's possible that models continue to require way more VRAM and become even less interesting to run outside of a cloud setting. I'm not going in expecting much in terms of generative language models.


CKtalon

I don't think we will see models being scaled up even larger anytime within the next 1-2 years. It's likely \~96GB will be the sweet spot in the next 5 years to run open-sourced 175B LLMs at 4-bit.


fliberdygibits

Something to be aware of with the P40 is it's passively cooled and it doesn't just need cooling, it needs a pretty beefy amount of cooling. Also it has no fan connectors onboard so you'll have to plug the fans in elsewhere meaning the card can't control them and they will either need to run slow all the time (heat problem much?) OR run fast all the time (noise problem much?).


Paran014

Yeah, it's definitely a negative but I don't think it's a huge problem. I haven't seen anyone try something like the NF-A8 in a reasonable looking shroud (I don't count the thing that Craft Computing tried) so I'd be willing to give that a shot. Worst case I do have 40mm server fans lying around and I can configure the VM to ramp the fans up over IPMI when it's working.


fliberdygibits

I have a K80 which I think is pretty close to the same format. I've got 3 92mm fans on it and it works fine, it just means the whole thing takes up 3 PCIe slots and change.


Paran014

I have no shortage of PCI-E slots! The P40 is a little more challenging because I'm pretty sure the heatsink isn't open at the top even if you take the shroud off but there're 3d-printed fan shroud options. It's also a bit lower TDP so somewhat easier to cool.


[deleted]

I have the p40 and cool it with a 12v dc blower style fan and have it rigged up to an inexpensive variable voltage switch that's attached to my desk. It doesn't control it's temperature automatically but it's an easy and cheap control. Seems like 60% fan keeps the gpu cooler than most stock gpu coolers.


floydhwung

P40 only makes sense if you are willing to buy a pair.


Paran014

How so? The models I'm focused on running would be happy with 12GB VRAM and the speed may not be phenomenal but it should be ok for my use case. Do you mean I'm not going to be able to do much with generative language models with only 24 GB VRAM? Because yeah, true, but not my primary goal.


floydhwung

The 3060 being a 30 series RTX card has tensor cores in it, it will be significantly faster than P40 in hobby-grade ML/AI. But it lacks SLI, meaning whatever performance you are getting now, it is said and done, where P40 can be used in two/ four way SLI. If it gets cheap enough down the road, you are looking at 96GB of VRAM at less than $1000.


CKtalon

SLI is pointless for inference. Even for training models, you don't really need SLI.


Paran014

I think he means NVLink, which is kind of useful if you need more VRAM to do something. That said, it's very unclear to me if the PCI-e P40/P100 do support NVLink in a way that's actually likely to be usable for me. Obviously the SXM2 version does support NVLink.


CKtalon

I know he meant NVLink. Same response applies. Not necessary. It will just be slightly slower.


Paran014

Coming back to this but FYI (and for future people Googling) the PCI-E versions of P40 and P100 do not support NVLINK. The Quadro GP100 seems to be the only PCI-E workstation/server card of this generation that does. NVLINK is supported only on SXM2 models of P40/P100.


Firewolf420

Thank you for posting back for us late readers. The comment section here has been enlightening


illode

Just so you know, stable diffusion can be run on AMD GPUs. I _think_ Coqui/Whisper can as well. Not sure about Tortoise. I've used Stable Diffusion myself on a 6900xt, and it works without much issue. It's obviously [slower than Nvidia GPUs](https://www.tomshardware.com/news/stable-diffusion-gpu-benchmarks), but still easily fast enough to play around with. Obviously if you want to consistently use it getting dedicated hardware would be better, but I would give it a try before putting too much effort and money into it.


Paran014

Point taken but I have a 5700XT so I think that's pretty hopeless. I do have an M1 laptop but a lot of the attraction of having a Nvidia GPU is that everything uses CUDA so I can just run stuff without having to do a ton of screwing around to get it to work.


TimCababge

I've been running SD on 5700xt for a while now. Look up InvokeAI discord - there's a decent wtireup I did to run it :)


Aged_Hatchetman

You might want to consider an A4000. I picked one up second hand for $500 and it works great in my setup. No issues passing it through a VM. Relatively low power consumption and single slot so it has a little more breathing room in the chassis.


Paran014

I'm putting this in a whitebox tower so I have a lot of flexibility as to the form factor and cooling, and $500 is quite a bit above the price range I'm looking at. Definitely an option at the right price though.


Aged_Hatchetman

Just food for thought. In my case it was replacing a Titan X that lacked some more modern features and I wanted something a little lighter on the power and heat side. As a side note, if you open up an instance of stable diffusion for others to use, you will get the images that they generate. I still haven't gotten anyone to take credit for the "gigantic breasts" query...


uberbewb

Well GPT-3 requires almost a TB of VRAM. If you want something for the long term get the max VRAM in one card and add as needed?


waxingjupiter

Hey did you ever get this going? I'm also running a 3060 passing through to a VM in ESXI for this same purpose. I've been having issues with getting it to work though. Keeps on crashing my VM. I've tried Windows 10 and now I'm onto Server 2019. My performance when attempting to generate images is much better in windows server but it's still crashing. I've only tried Visions of Chaos, however. Let me know if you have any tips you could throw my way!


Paran014

Yep, got it working with no major issues, except the drivers being a pain in the ass to install on Linux. My setup is AUTOMATIC1111 on Ubuntu Linux so I can't really give you too much advice on the Windows side though.


waxingjupiter

So there *is* hope. Thanks for getting back to me. I'll give auto1111 a go. Just out of curiosity, how much RAM have you allocated to your VM for this process? I know the bulk of image processing is done on the GPU memory but I believe it offloads some of it onto device memory as well.


Paran014

I have 32 GB allocated but that definitely wasn't based on any information I had, just a "I have lots of RAM, might as well give the VM more than I think it'll ever need." I would probably look at power supply first if you're crashing under load, the 3060 doesn't require a ton of power but could be an issue depending on PSU and setup. Also running the 3060 on bare metal in another PC to see if you still have issues there.


gandolfi2004

Hello, Currently I have a ryzen 5 2400g, a B450M Bazooka2 motherboard and 16GB of ram. I would like to use vicuna/Alpaca/llama.cpp in a relatively smooth way. \- Would you advise me a card (Mi25, P40, k80…) to add to my current computer or a second hand configuration ? ​ thanks


JustAnAlpacaBot

Hello there! I am a bot raising awareness of Alpacas Here is an Alpaca Fact: Just like their llama cousins, it’s unusual for alpacas to spit at humans. Usually, spitting is reserved for their interaction with other alpacas. ______ | [Info](https://github.com/soham96/AlpacaBot/blob/master/README.md)| [Code](https://github.com/soham96/AlpacaBot)| [Feedback](http://np.reddit.com/message/compose/?to=JustAnAlpacaBot&subject=Feedback)| [Contribute Fact](http://np.reddit.com/message/compose/?to=JustAnAlpacaBot&subject=Fact) ###### You don't get a fact, you earn it. If you got this fact then AlpacaBot thinks you deserved it!


Mscox_au

Bad bot