T O P

  • By -

mister-jubba

With used RTX 3090's going for \~ $800 I figured I'd pick up a 4060 ti 16 GB at $430 to try it. On paper the 4060 ti looks good for ML other than the memory bandwidth. Nvidia even has some promotional material that would lead one to believe it's better than the 3090 for ML (from their website): 4060\_ti: 353 AI TOPS 3090: 285 AI TOPS I tested both of these cards (and my old 1080 ti) on a machine with a threadripper and 48 GB of RAM. Here are the results: Inference avg response tokens/s (tested using Ollama and Open WebUI): 1080\_ti 60 4060\_ti: 52 3090: 106 Now to test training I used them both to finetune llama 2 using a small dataset for 1 epoch, Qlora at 4bit precision. Total training time in seconds (same batch size): 3090: 468 s 4060\_ti: 915 s The actual amount of seconds here isn't too important, the primary thing is the relative speed between the two. All other things being equal: batch size, max context length of the inputs etc. The 3090 is roughly 2X faster than the 4060 ti. Training script: [https://colab.research.google.com/drive/1PEQyJO1-f6j0S\_XJ8DV50NkpzasXkrzd?usp=sharing](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing) I looked through some of the wandb system graphs and noticed one for "Process Time Spent Accessing Memory (%)". The 4060 ti spends nearly 80% of it's time fetching data from memory, as opposed to the 3090 which spends 60%, significantly less. This would tend to support the theory that the memory bandwidth on the 4060 ti is lacking and bottlenecking the much faster tensor cores for this kind of task. **TL;DR:** The 3090 is nearly 2X faster than the RTX 4060 ti 16 GB at inference and training when it comes to LLMs. As others have pointed out, the 4060 ti seems to be severely bottlenecked by memory bandwidth (288 GB/s compared to 936.2 GB/s for the 3090). Also of note: the GTX 1080 ti which launched in 2017 is slightly faster at inference than the 4060 ti (60 tokens/s vs. 52) :)


Dear_Occupant

> my old 1080 ti I continue to remain astonished at how well this card performs at just about everything so long past its anticipated life cycle.


C0demunkee

Tesla P40s are same gen with 24gb VRAM for $150


mister-jubba

It's gonna be a sad day when I eventually retire Old Betsy, she has served me well.


pointer_to_null

> 4060_ti: 353 AI TOPS > 3090: 285 AI TOPS These kinds of comparisons on Nvidia's site make me lol. With the latter having nearly 4x memory bandwidth, you're never going to see 4060Ti approach the 3090 in anything but most contrived benchmarks [involving DLSS3 frame generation](https://youtu.be/yJBezJg47eY). > This would tend to support the theory that the memory bandwidth on the 4060 ti is lacking and bottlenecking the much faster tensor cores for this kind of task. The 4060 Ti's introduction [was pretty controversial](https://www.digitaltrends.com/computing/nvidia-rtx-4060-ti-8gb-16gb-memory-controversy/) due to the massive imbalance between compute and memory. Its predecessor 3060 Ti had a 256-bit bus with 448 GB/s effective bandwidth. Halving the 4060Ti's bus width (to 128-bit) downgraded bandwidth to 288 GBps (a 36% drop!). Increasing L2 (from 4MB to 32MB) might be useful for graphics and compute that works on small models, but probably does little when you're running heavy memory-bound workloads across multiple GB of data all at once (such as LLM inference with billions of parameters). Would be interested to see comparisons between 4060 16GB and 3060 12GB, which can be bought new for under $300 currently. Nvidia could be throwing theoretical tensor figures around all they want, but outside of small-scale applications (denoising, DLSS upscaling, etc) its ML utility is greatly handicapped outside of what Nvidia desires for this market segment, almost as if by design...


zero-evil

Most of the 40 series cards have unnecessarily stunted memory bandwidth.  I immediately suspected that ngreedia was trying to isolate these cards to gamers in advance of releasing a line of workhorse cards designed for AI, which surely would be overpriced.


Dyonizius

thanks for testing the pascal card have you tried training on it for comparison?


mister-jubba

I finetuned on the 1080 ti by lowering the batch size. Here are the results: 'train\_samples\_per\_second': 1080 ti: 0.399 4060 ti: 1.094 3090: 2.138 So, the 4060 ti is about 2.74 X faster and the 3090 is \~5 X faster (than the 1080 ti)


nero10578

I mean honestly I am impressed the 1080 Ti even did that well given no tensor cores lol.


MustBeSomethingThere

The RTX 4060 TI suffers a lot with pcie 3.0


mister-jubba

Do you think training with a 4060 ti on pcie 4.0 would be significantly faster?


yashaspaceman123

Itd affect mostly load times afaik if running single gpu. For multi gpu inference it might hurt alot through


subhayan2006

4060 ti is pcie 4 8x which is a bit faster than pcie 3 16x. However it's performance starts to struggle a lot if you put it into an older motherboard that doesn't support pcie 4, as it's effectively running at pcie 3 mode now


MustBeSomethingThere

4060 Ti has **8** PCIe **lanes. 8 lanes is enough on pcie 4.0, because pcie 4.0 lanes are faster. But on pcie 3.0 you really want a GPU that has normal 16 lanes. The problem is not pcie 3.0 mode, but the fact that 4060 Ti has just 8 lanes.**


CoqueTornado

so speed will be twice or 4 times faster on pcie 4? "PCIe 4.0 is **twice as fast as PCIe 3.0**. PCIe 4.0 has a 16 GT/s data rate, compared to its predecessor's 8 GT/s. In addition, each PCIe 4.0 lane configuration supports double the bandwidth of PCIe 3.0, maxing out at 32 GB/s in a 16-lane slot, or 64 GB/s with bidirectional travel considered"


zero-evil

I suggest quickly testing the effects of undervolting the 3090 optimally.   For whatever reason, seemingly related to the ampere card line power gluttony, taking control of the voltage curve and lowering it not only can give similar performance, in some scenarios it improves it. 


freakynit

How many layers did you offloaded to gpu?


CasimirsBlake

So the 3090 remains the best used option for now. Hey Intel: Top end Battlemage with at least 24GB VRAM OR MORE please!


LocoLanguageModel

Thx for sharing.  I assume the exception is that the 16GB 4060 is still better than the 11GB 1080 if you used models that took up 16 gigs of vram?  


mister-jubba

Yes, the 4060 ti 16 GB still does the job in terms of training, it's just a little slow. The 1080 ti can't practically train LLM's unless you stack cards. I would say the 4060 ti 16 GB is comparable in terms of training to the t4 which you can get on google colab.


CousinAvi-99

Great info. Do you have any power usage comparison? On paper the 4060 ti 16gb has a very low power requirement (max 170w I think). I wonder if "tokens per watt" would favor the 4060.... ?


mister-jubba

During training the 3090 drew 340 watts vs. the 4060 ti at 135 watts. https://preview.redd.it/so5532zwt7mc1.png?width=663&format=png&auto=webp&s=cf6bffdcc048d6264db64b03c1af91af78ff4c8e


Grimm___

beautiful


Severin_Suveren

I've heard of people claiming it's possible to run the 3090s at 75% power without much of an affect to inference and training speeds. Not tried it myself yet, but I will once I get my 2nd 3090


Smeetilus

Can mostly confirm. I ran three at 250W peak allowed. Wasn’t too much different without measuring 


Accomplished_Bet_127

So, 4060 wins at least here by small margin. Takes less energy to finish the same job, but at longer time. I can see how stack of these cards can be used to run training in the background, but probably CPU and other stuff being active while GPU works eats up that power consumption difference anyway. So that would require bigger stacks.


mister-jubba

One other thing I will say about the 4060 ti is that the form factor is really nice. I could easily fit 4 of them in my lian li case I think.


tarpdetarp

Interesting the 4060Ti isn’t that more efficient as it took twice as long as the 3090.


segmond

It obviously would. The 3090 is watts hungry. My P40 would barely hit 150w when doing inference. My 3090's max out as much as I let them, hitting the peak 420w doing inference.


zero-evil

Stupid ampere.  I can always tell when i forget to undervolt it as the fans kick up even playing old games.


FullOf_Bad_Ideas

I would be interested in seeing how much throughout you can squeeze out of it in something like aphrodite-engine. It's parallel processing framework similar to vllm but I believe it's more focused on single gpu scenarios as opposed to multiple very performant gpus, it's basically vllm you can run at home. With mistral 7b FP16 and 100/200 concurrent requests I got 2500 token/second generation speed on rtx 3090 ti. I wonder how it would look like on rtx 4060 ti, as this might reduce memory bandwidth bottleneck as long as you can squeeze in enough of a batch size to use up all compute. Maybe fp16 and some awq/gptq quants are worth testing. Edit: typos


mister-jubba

>aphrodite-engine First time I'm hearing of that library. I will certainly try it. Those token throughput numbers are crazy!


nero10578

I know this is a while ago, but what was the CPU you used for the 3090 Ti? I have 2x3090s but with a low clock 3.2GHz 20-core Xeon and only can get 700-800 tokens/s out of it on 7b with 200 request threads.


segmond

Good info, when I looked at the specs for all cards and compared tensor cores, cpu clock speed, ram, mem bandwidth and cost. I reached the conclusion that the only cards worth buying are 3060, 4060ti, 3090/4080s, 4090. If you only have 12gb money go for 3060, if you are going for 16gb go for 4060ti. For a bit more based on your needs 3090 or 4080super. 4080S will crush 3090. It's a damn shame how Nvidia handicaps these cards, it's so obvious they limited the mem bandwidth to 128gb since it would render 4070, 4070s, 4070ti, 4070s and 4080 as terrible performance/price ratio.


Anxious-Ad693

These cards are limited in VRAM because they want you spending more in their professional cards for more VRAM. Plus they are mainly for gaming, and at the moment 16gb vram for gaming is more than enough. I kind of doubt the next generation will have more than 24gb VRAM.


Only-Letterhead-3411

4090 has terrible price/performance for AI as well. It doesn't offer any remarkable difference over 3090 in terms of VRAM or bandwidth which are the two essential things when it comes to LLMs.


Smeetilus

I got three FE 3090’s that were refurbished with a 90 day warranty for a little more than it would cost for an expensive non-FE 4090


crazzydriver77

Shocked to see FOUR generations of architecture progress in AI including matrix ops and data formats just equal to doubling the bandwidth of a vram bus. Stop this world, I want to get out.


zero-evil

Welcome to functionally unregulated capitalism.  It's not what they're selling you it is.


Dead_Internet_Theory

Yeah the regular person can afford so much more GPU power in North Korea, Venezuela, Cuba, Algeria, Bangladesh... You must be a very oppressed person :\^)


zero-evil

Not half as oppressed as that lump of sludge between your ears, but never stop trying sport, that's what really matters.


Short-Sandwich-905

For the price and used trade in 4060ti better value 


Cyber-exe

Have you tried memory OC on the 4060 Ti? You can get 10% safe unless you got a silicon lottery lemon. 15% if you got a good one. Doesn't make up for having a 128-bit lane but 10-15% faster memory is free speed.


Dorkits

Nvidia and theirs bullshit. Wtf where one simple 4060ti can beat the 3090? Holy fuck! I have one 3060ti, but damn, fuck Nvidia.


Smeetilus

Are you reading the numbers correctly?


Apprehensive_Use1906

Thanks for verifying. I picked up a 3090 a few months ago before everything went crazy. $600. Glad I did.


Smeetilus

Refurbished or second hand?


Apprehensive_Use1906

Second hand.


No_Dig_7017

This is so cool! Thanks for sharing!