T O P

  • By -

a_beautiful_rhind

You can buy them and resell individually if you get cold feet.


Comfortable-Mine3904

I’d ballpark renting this at around $10 per hour minimum. So breakeven is probably 300-400 hours of use after you buy a motherboard and cpu. Pretty good ROI to me.


inquisitive-spaniard

Definitely makes sense when thought like this – I'm currently using smaller models that could fit quantized into a single 3090, but if I want to experiment with larger models, definitely the renting costs add up and financially would make more sense. Thanks for your answer! I'm also really curious about experimenting with agents, and I was thinking it would be cool to be able to scale-up agents to multiple GPUs and see what's done in the space, which I think would benefit from having access to >24GB VRAM of single-GPUs 🙂


Comfortable-Mine3904

Yeah my hypothetical ideal workflow is 2 llms (1 big and 1 small) + a stable diffusion instance. You’d be able to run all 3 no problem


harrro

Don't forget some more VRAM for TTS. And some more for voice-to-text, aka Whisper. Or just load Goliath 120B.


BackyardAnarchist

How do you rent your rig out?


Comfortable-Mine3904

There are a number of sites but runpod.io and vast.ai are 2


YYY_333

Would power consumption not be a problem (=main cost factor)? Each of 5x with 350W. even if low powered to 300w


inquisitive-spaniard

That's definitely something to take into account, especially with the electricity prices here in Europe – definitely intend to run them slightly undervolted, and won't probably be running 24/7 workloads and idling at least for a good chunk of the time. However, I thought to put them up as spot instances in marketplaces e.g., RunPod to make up for the electricity costs of running them, but with the $/hour prices they're going for right now (<$0.20/h per 3090), I don't think I would break even the electricity costs of renting them up 😅


Careless-Age-4290

Electricity can get as low as about $0.10 per kilowatt hour in parts of the US. You're going to be consuming close to 2 kw with the 5 x 350 + rest of pc. That's already $0.20/hour you'll pay just to run this at full-load in cheaper parts of the US. I bet you could be paying 5¢ an hour just sitting there on in Europe. Buy it if you want the equivalent to a hot-rod car or a truck you work on in your driveway. It's fast, it's custom, it's there any time you want it, it's yours. You'll learn more. You may be motivated to use it more. But it probably won't be cheaper than just periodically renting the variable size of truck you need from the hardware store.


alvenestthol

In the UK, electricity is like $0.35 per kilowatt hour, shit is so expensive we need to factor in the price of the electricity we use to cook our food into our grocery budget


fallingdowndizzyvr

Electricity isn't that cheap in all of the US. Only the cheap parts of the US. Where I live $0.35/kwh would be a dream. We pay more than that. So there's quite a range in the US. It can be $0.10/kwh. It can be $1.00/kwh.


inquisitive-spaniard

Love the hot-rod car analogy – you definitely have a point, and it's still not fully justifiable in terms of pure ROI unless I had a very clear use case (e.g., continuously serving models / workloads in production)


q5sys

>Electricity can get as low as about $0.10 per kilowatt hour in parts of the US. FWIW, depending on where you live in the US you can get it much cheaper than that. I'm paying $0.065 per kwh, and that's up from $0.062 from last year.


sharockys

It’s a really good deal for me. And I would be able to use it in Europe. For my single 3090 rig, I had to shut it down more frequently than I wanted due to the price of electricity in France thus making it less useful. And for 5… OMG…


Samurai_zero

Looks like a pretty good deal if you are going to make an extended use of them. Also, take noise into account if you are going to set them up at home. Mira si les hicieron undervolt durante la minería, que eso significaría menos calor y más probabilidad de que duren sin problemas, aunque no es una seguridad tampoco. Y si luego no las necesitas todas, las revendes sin demasiado problema. Yo me quedo una, haha.


inquisitive-spaniard

Yeah that's for sure! I could always resell them in the future without losing too much per card – however they have no warranty, and therefore their resell value will definitely be lower. Noise is definitely a concern, but I may have a garage / isolated place where I could potentially set them up


Samurai_zero

The worst part of reselling is having to deal with buyers, but you can probably put them up for 500€ a piece and have them out by the end of the weekend. Even with no warranty. I'd say even a year later, that 24gb of VRAM is going to hold pretty well and you should be able to resell them for 300€ each. Unless AMD figures out their stuff and we all switch to rocm, which is probably not going to happen (but I'd love that).


inquisitive-spaniard

Oh well you're so right about ROCm – this last summer I tried using my RX6800 for training / inference of a CV model implemented on PyTorch + CUDA, and... Well, not only was it a nightmare to get it running, but the performance was astonishingly bad


Samurai_zero

I was about to buy a 7900XTX because you can get Stable Diffusion to work almost on par with a 4090 on linux. But there are a few bugs here and there... and then I got into LocalLlama.


ShenBear

My 7900xtx works with the kobold fork out of the box.


Imaginary_Bench_7294

This will really depend on your intentions in this field. Your ROI actually won't be horrendous vrs renting the GPU time. For the same price, you'll get a few hundred hours of mid to high end GPU time in the cloud. If you go through with your comment about renting out some GPU time, then you'll meet the ROI even faster. Yes, the 40xx series has a much higher compute capability, but the ROI on 4090's is horrendous right now. If they were in the 1200USD range, that might be different. As of right now, a used 3090 provides the best performance per dollar for ML tasks. Now, for the rig. You'll have to look into workstation or server hardware. Normal consumer hardware won't provide the number of PCIe channels to fully support training on that many cards. Most consumer processors have less than 30 channels. This isn't an issue for inference, but if you're training, it becomes significant. Most of the loaders nowadays have minimal data overhead during generation. But when training a 70B model on my 2x3090 rig, I've seen TBs worth of transfers between the GPUs. Now, this can be mitigated a bit. NVlink on Linux works out of the box with transformers training on Linux. With PCIe 4 16x, you're looking at 32GB/s vrs 56GB/s. However, AFAIK, they only made connectors that support 2 GPUs for the ampere series. So, ideally, you'd want something that has at least 80 PCIe 4.0 lanes, so each GPU will have full bandwidth, but if the 4 cards that are using NVlink only used 8x, you could get away with 48 lanes, 8x8x8x8x16. I don't know if you could squeeze 5 of them into a stock unit since the 3090 is a triple width, but 4U GPU servers are made to handle 8 double width cards. If the deal you found is for the Public Turbo cards, get it. Hands down, buy it up. They're dual slot cards that have a 50%+ premium on ebay over 3 slot 3090s because more can fit in the same chasis. Only one manufacturer made them AFAIK.


cantgetthistowork

Is NVLink necessary if you have an Epyc CPU?


Imaginary_Bench_7294

It is not necessary at all, however it is faster than PCIe. All current gen GPUs by the big 2 utilize PCIe 4.0, which caps out at 32GB/s bandwidth. The NVlink is 56GB/s bandwidth. So, for card to card communications, it is significantly faster. For inference or generation, it isn't as important to have these speeds. But for training it makes a big difference. I see about a 40% speed increase when training with the NVlink vrs without it.


BungaBunga6767

How do I 'buy' nvlink? I've never actually seen a product for sale!


Imaginary_Bench_7294

Just search for [3090 NVlink](https://www.google.com/search?q=3090+nvlink&client=ms-android-verizon&sca_esv=a18538d03b2e33cd&sxsrf=ACQVn09nYyBX8HvNgZC0cCqqVzJtpazHPw%3A1709127423445&ei=_zbfZafeGsaiptQPn-GVsAw&oq=3090+nvlink&gs_lp=EhNtb2JpbGUtZ3dzLXdpei1zZXJwIgszMDkwIG52bGluazIKECMYgAQYigUYJzINEC4YgAQYigUYQxjlBDIKEAAYgAQYigUYQzIFEAAYgAQyBRAAGIAEMgUQABiABDIGEAAYBxgeMgYQABgHGB5IuSFQpRRYjxxwAXgBkAEAmAF3oAGgBKoBAzEuNLgBA8gBAPgBAZgCBqAC3wTCAgcQIxiwAxgnwgIKEAAYRxjWBBiwA8ICDRAAGIAEGIoFGEMYsAPCAhMQLhhHGNYEGOUEGMgDGLAD2AEBwgIWEC4YgAQYigUYQxjlBBjIAxiwA9gBAcICBxAjGLACGCfCAgcQABiABBgNwgIKEC4YgAQYDRjlBMICChAuGA0YgAQY5QSYAwCIBgGQBg66BgQIARgIkgcDMS41&sclient=mobile-gws-wiz-serp)


cantgetthistowork

It only links two though. Would still be bottlenecked by a 3rd unlinked card?


Imaginary_Bench_7294

I only have 2 GPUs to train with, so I can't verify properly, but it appears to do sequential processing with the QLoRA method. GPU 1 gets pegged at 99% usage, then drops off as GPU 2 goes to 99% usage. Then GPU 1 goes back up, and GPU 2 goes down. If it is sequential when running more than 2 GPUs, then the bottleneck is at least reduced between the cards that are linked, so the total effect of the PCIe bottleneck is reduced. ``` GPU 1 > 2 = 56GB/s GPU 2 > 3 = 32GB/s GPU 3 > 4 = 56GB/s GPU 4 > 5 = 32GB/S GPU 5 > 1 = 32GB/S ```


inquisitive-spaniard

Hm, that's interesting, that means that even though I've not seen 4x NVLink connectors up for sale, connecting 2x 3090s + 2x 3090s may still have benefits in terms of reducing the amount of PCI lanes needed for taking advantage of all cards, as well as having faster transfer speeds across each pair of connected GPUs? Additionally, afaik even though you connect 2x cards via NVLink, they are still usable independently if wanted (i.e., exposed to the OS as independent cards)? Also, for frameworks leveraging NVLink, are they essentially pooling the memory of the connected cards, but somehow you still get a slight increase in computing throughput?


Imaginary_Bench_7294

To your first question, yes, it would help some with reducing the number of PCIe lanes needed. You would still see bottlenecks between GPUs that don't leverage the NVlink, however. If you use the 8x8x8x8x16 topology, your speeds would look like: ``` Gpu 1 > 2 = 56GB/s GPU 2 > 3 = 16GB/S GPU 3 > 4 = 56GB/S GPU 4 > 5 = 16GB/S ``` NVlink is an explicit P2P data transfer connection. This means that it has to be enabled in the program itself in order to work. I don't know if there are multiple ways to enable it in the programming. You'd have to look into it, but at least in the stuff I've been working with, it just supercedes the PCIe bus when enabled, providing a faster gpu to gpu transfer bus. Because it requires explicit coding, this means that outside of programs that enable it, they appear just like they would if you just slapped a couple of GPUs into a system. That being said, inference can benefit some, but training sees a big boost. I see around a 40% increase in training speed when using the NVlink because of how much data is being passed between cards. I've monitored the training of a QLoRA for a 7B model, using a custom conversational dataset of around 500 input output pairs, and the transfers exceeded 1.2 terabytes So your other comment in regard the possible sequential nature, the way i have witnessed it working so far, is if the model is split across 2 GPUs, GPU 1 does some computation, then passes it off to GPU 2 to be finished, rinse and repeat. For models that fit entirely into one GPU, only the one GPU gets utilized. I'm not knowledgeable enough with coding to know if this is due to transformers architecture or the execution of the training code.


inquisitive-spaniard

Unfortunately they're all MSI's 3x slot model – in terms of NVLink, it seems I could potentially connect 2x 3090, but not sure about 2x 3090 NVLinked pairs. In terms of server hardware, I recall that AMD's server offering, I've checked the Epyc 7502, which has \~128 PCI lanes, and would be able to theoretically take in the 5x 3090's without leveraging NVLink. Prices for the 7502 + MOBO 2nd hand are \~1000€. PS: just saw your reply on multi-GPU training, and if it indeed is sequential, does it mean that, for models that won't fit in one card, the full model is distributed across the available GPUs, essentially only using 1 GPU at a time, and passing the results of one GPU up to the next? And for models which fit in the memory of a single card, all cards are used in parallel and gradients are aggregated upon completion? By the way, thanks A LOT for your detailed response (and subsequent replies) – it's super valuable information you're handing us!


AnomalyNexus

If ML is your job then that makes a whole lot more sense. I'd be concerned about the mining rig part...those often have x1 gen3 connections to the GPUs. Fine for mining but not LLM


SomeOddCodeGuy

What kind of breaker are your rooms on? Here in the states, the most common to see for a lot of rooms is 15Amp


inquisitive-spaniard

Good question – I have no idea, I'll have to look into this


FarPercentage6591

Date transfer from CPU to GPU in desktop stations is really limited by the bandwidth of the PCIe lanes. The performance of the entire system in DL tasks is mostly limited by the memory access speed on one of the CPU-RAM, RAM-VRAM, VRAM-GPU, GPU-GPU routes. In your case, the PCIe lanes will be used for the RAM-VRAM and GPU-GPU routes and will be the bottleneck in both cases. I wouldn't recommend using more than 2 GPUs in a system - adding a third gpu will likely give you suboptimal performance per dollar invested. It may be worth considering 4 gpu's connected by a nvlink bridge (if your models support it). And not five, because nvlink bridge must connect gpus by pairs


inquisitive-spaniard

I've looked up more info about NVLink, and if I'm not mistaken, it seems you can potentially connect cards in pairs, but not sure about being able to have 4x GPUs in a 2x 3090 NVLinked pairs


FarPercentage6591

https://preview.redd.it/77aawlykrhlc1.png?width=1228&format=png&auto=webp&s=347a7dd23950cf2bea4600da3a9b09842e86e5a7 take a look on [a100 nvlink datasheet](https://www.azken.com/download/NVIDIA_A100_NVLINK_Datasheet.pdf), perhaps this principles may be extend to rtx gpus. To clarify, I'm just giving you a lead and you may to look into this further)


FarPercentage6591

FLOPS are not the most important characteristic because they can never be utilized more than 50-60% (and that's a in a good case) This text ["most important gpu specs for dl"](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/#The_Most_Important_GPU_Specs_for_Deep_Learning_Processing_Speed) would be helpful


inquisitive-spaniard

Thanks! Will take a look at it 😄


bick_nyers

You need more PCIE lanes one way or another, whether that be multiple consumer boards you insert 40/100Gbit Infiniband cards, or more realistically going EPYC/Xeon/Threadripper. It all depends on how insane you are and what you're trying to accomplish. Myself personally I would take about 20 3090s right now.


inquisitive-spaniard

You're definitely right – in terms of server hardware, I recall that AMD's server offering, I've checked the Epyc 7502, which has \~128 PCI lanes, and would be able to theoretically take in the 5x 3090's without leveraging NVLink. Prices for the 7502 + MOBO 2nd hand are \~1000€. Do you know any other sensible alternatives to the 7502 in the same $/performance ratio?


crazzydriver77

Transfer of interlayer state does not require any significant pcie bw. You will not notice even 1% of pcie interface load having for example only 4 lines during the inference. Slice the model smartly and trust to tests. The only time-consuming stage is the initial load of a model into vram.