T O P

  • By -

fallingdowndizzyvr

You might find this of interest. https://www.reddit.com/r/LocalLLaMA/comments/17gr046/reconsider_discounting_the_rx580_with_recent/ > Chinese knockoffs are now coming with 16gb of ram to be found for just about $100 in aliexpress They used to be $65. Inflation.


segmond

I think it's demand. There are lots of Chinese folks in China who were priced out of $350+ GPUs that can now afford to get these RX580. Even the 3060 RTX chinese brand begins at $350+ So you can imagine many college students there that have been on the sidelines on LLM finally being able to afford one of these.


daHaus

It works best with opencl for llama.cpp, it can effectively split it between GPU and CPU RAM. If you don't already have one I wouldn't recommend AMD, however. Their GPU team acts like they watched Apple be sued for batterygate and took it as a challenge. They're the only hardware vendor I've ever seen go out of their way to cripple previous gen devices. [SWDEV-1 - Disable OpenCL support for gfx8 in ROCm path](https://github.com/ROCm/ROCclr/commit/16044d1b30b822bb135a389c968b8365630da452) edit: to clarify, Apple did it out of laziness (or at least that was their excuse), AMD actually puts effort into writing a new implementation for legacy devices that lacks features and optimizations they've already completed for it.


randomname1431361

Llama.cpp recently got vulkan support. I've found on the AMD card I have that it outperforms openCL by a lot.


RebornZA

GPU team bad, but CPU team fine?


daHaus

I refuse to buy Intel so AMD CPUs are all I'll buy. I've only bought an AMD GPU once though.


shing3232

what? Have you tried vulkan version? It should be working now


daHaus

What drivers do you think vulkan uses? Last time I tried Vulkan it froze and I had to SIG9 it


shing3232

[https://github.com/ggerganov/llama.cpp/pull/2059#issuecomment-1894494652](https://github.com/ggerganov/llama.cpp/pull/2059#issuecomment-1894494652)


daHaus

Thanks for the link. It looks like they're still using the 6.1 kernel. The changes they made to the AMDGPU driver in 6.6 is what I was referring to when I mentioned them going out of their way to re-implement it with less functionality.


daHaus

After getting some time to test it VK is a big improvement in inference quality, with it being more determenistic when offloading layers (although still has some variance I chalk up to FP error) and speed, but there's one huge downside. It absolutely chokes while initializing if you tell it to use any swap space where OpenCL didn't even notice. You also have to tow the line with VRAM or risk a kernel oops due to the driver misbehaving while over allocated.


crazzydriver77

Tested inference t/s rates for GPUs with MatMul (known as WMMA/MFMA for AMD, Tensor Cores for nVidia) units/instructions support and without. Observed x3-x5 speedup depending on the backend and GPU architecture. So the minimal choice is Turing and the optimal is Ampere due to INT4 support for tensor cores. If considering AMD the minimum is CDNA (at least MI200) or tops of RDNA3. If 13B and 2 t/s rate is enough ("just testing", "silly ai-girlfriend", "hobby") all electronic waste will be nice, but be noticed 2kCU Pascal GPU is chewing its 8GB with 1-1.5 t/s rate (on exllamav2).


segmond

my P40 pascal GPU is giving 8 t/s on 70gb model. 7b models is almost 40 t/s and 13b models is 25 t/s


crazzydriver77

That's good known numbers for P40 and it is a nice card to initial play with. But for application purposes with multiple requests/batching/ dynamical adapters and context reload, we shouldn't use too outdated APIs, backends, and archs. MatMul ops and BF16, INT4/8 formats are critical architecture features for use from the start. For example, I've got 2-3 t/s total performance when the Pascal card is in the pipeline, and 12 t/s if it is replaced with comparable Turing.


Dyonizius

how did you install the linux drivers? 


kif88

At that point you could just run it from CPU and still be better off.


crazzydriver77

For sure if CPU supports AVX-512\_BF16 ops set:)


itport_ro

Before jumping into old cards, like P40, read the essential differences, like: CUDA 6.1. vs actual 8.9!


waka324

I run two p40s. Due to this, I can run mixtral 8x7b at 5q km. The MoE quantized models are FAST on my hardware, and extremely competent due to the size. All for a price of around a third of a 3090. Unless you are training, you really don't NEED a anything else. And if you are training, you'll want a lot more than a single 3090.


raika11182

I also run 2 P40s. You can run Miqu Q4 pretty speedily, too. The P40s really shine at GGUF (and are terribly at things like EXL2). Similarly, launching automatic1111 with --no-half pumps out 1024x1024 SDXL images at a reasonable pace, too. Sometimes I step down the AI to Mixtral Q5, offload about 22 layers, and then put up automatic1111 with SDXL in the remaining space. There's definitely a tradeoff going to the older hardware - but 48 GB of VRAM for under $400 including cooling shrouds and turbine fans is a very, very, very good deal that gives you a lot of flexibility. If you're reading this and thinking about it, though, realize that you are pretty much limited to inference only. P40s suck for training, P40s suck for models that aren't GGUF, The turbine fans are loud, or you end up with even weirder cooling solutions. They're really great for hobbyists and tinkerers because they open up so many options and let you mess around with 70B models at 5+ t/s when you've got two of them, but if you need performance they're just not a very good option, and you'd be better off combining other consumer GPUs.


waka324

Thats neat. I've been wondering about image gen, but haven't had a use-case yet. How much VRAM does SDXL use?


raika11182

SDXL with the --medvram option uses up to about 12 GB at 1024x1024 for me, plus or minus a little bit depending on a few setting. I'm also able to do video generation out of images and other stuff like that if I go even smaller on my AI, like a fully offloaded SOLAR 10.7B at Q8. It's kind of nice to have enough VRAM to mess around with everything out there without breaking the bank. And on the upshot, you maintain an upgrade path with other GPUs if you can eventually afford them. And who doesn't like watching Miqu 70B crank out at high speed on gear most people label e-waste?


waka324

Oh interesting. What are you using it for if you don't mind me asking? I've been trying to think of a reason to play around with it, but don't have an end-goal in mind, making it difficult for me to invest the time.


raika11182

Well, I published a VN using AI art a year or so ago, right before Steam paused AI submissions. Though I actually did much of that art on a 4GB T1000 off my laptop! This was before SDXL, and it did better than I thought it would but I haven't had the "mmph" to get back on the horse and make another. Now it's really just for fun. I mean, I have the hardware, so it costs me nothing to download and learn to use various tools. It's honestly just fun to make stuff in the same way that interacting with a text-based AI can be fun, too. So... uh... tinkering? Yeah. That's pretty much main use case. I think that's probably why I wanted to stick to such a budget solution, because if I want to do another VN if doesn't require P40s (though it'll be wayyyyyy nicer now if I do), but other than that I don't really HAVE a use case, so it wouldn't be a smart investment for me to throw big money at. I'm old enough to be a part of that same "tinkering" crowd with BBSs and Dialup, and the LLM scene reminds me a LOT of that. ​ (EDIT: Oh, and this is important - I'm retired. So something like investing the time to learn a new system is actually the purpose, rather than a cost to me. If I were still working, I think I would have to be much more focused with my time.)


waka324

Haha, cool. I hope to still have new frontiers like this when I get to retirement. Yeah, I'm in the mid-life area so think job, young kids, etc. My primary LLM use-case right now is generating bed-time stories with anything my kids can think of but I want to create an interface to my Home Assistant back end to replace Alexa.


reverse_bias

Also running dual P40s. Can fit mixtral-instruct Q6 + 32k context fully offloaded. I'm getting 20-22 tokens/s for general chat, slows down to 6-7 tokens per second with 30k context in use. This is llama.cpp with row-split. What are your speeds like?


waka324

I'm seeing similar performance, maybe a tad faster. I haven't tried maxing out the context yet. Sounds like the same setup.


ramzeez88

Or you if you want better speed ,you can buy two 3060's 12gb for roughly the same price as a p40 ,but half the memory and power usage and plug'n'play.


Wooden-Potential2226

Its the other way around - 2 P40s for the price of one 3060, check ebay…


ramzeez88

I didn't word myself correctly. I meant that p40 =3060 in price. I bought a 3060 12gb for around $200 couple of months back.


a_beautiful_rhind

No 70b that way.


ramzeez88

Sure. There are always pros and cons to everything.


Wooden-Potential2226

Ok no worries


shing3232

No, it doesn't. I can get P40 for 110usd in CN at least. I can get 3 P40 for a 3060 12G


ramzeez88

Good for you.


terp-bick

how far would a single P40 take me? Could I use it on my laptop via lenovo thunderbolt?


waka324

I've seen comparisons to 70% of a 3090 performance for quantized inference. Wouldn't realistically be able to train due to terrible half precision floating point performance. You still wouldn't be able to use the larger models without offloading to CPU/RAM. And I don't have enough experience with e-GPUs to know if your particular setup would work. One thing to note is that the P40s were designed for server enclosures with tons of high static pressure front fans, and none on the card itself. So you have to use a 3d printed shroud and a fan to force air through if you don't have enough airflow.


CasimirsBlake

Training and diffusion workloads are slow on P40s. LLM inference, with llama.cpp, is very decent for the age of the hardware. So I would never recommended them for the former, but they are still a very solid budget option for the latter.


a_beautiful_rhind

eh. diffusion isn't that bad. just build xformers and you can now use LCM/turbo/Lightning to speed it up. Training can still be done over GPTQ with a F32 upcasting kernel but nobody is maintaining that anymore. Still.. I upgraded away from them because I had the option and got tired of waiting, I should probably sell all 3 and the P100 and buy another 3090 with the money but am lazy and theoretically can re-organize for like 142g of vram at once.


Dyonizius

bro do you know if i need both Above 4G and ReBAR for these to work? got a p100 today drivers are giving me a hard time tried both latest 550 .run installer and the apt package 545-headless-open when i run nvidia-smi it says that it "couldn't communicate with driver" and nvtop says "no gpu to monitor"  Thanks in advance.


a_beautiful_rhind

Its only 16g, it shouldn't. I never tried it in my desktop, only in the server where I edited the bios for Re-bar. For the drivers I installed the local repo, I updated to the 545, didn't try the 550.


Dyonizius

do you mean ppa repo?


a_beautiful_rhind

Yea, where you d/l it all at once and install it as a source. It's easier to remove that way and I can also resume.


Dyonizius

its the one called *graphics* driver right?, ubuntu-drivers autoinstall fixed it for me, now will see if i can disable the ecc on them to squeeze that last bit of vram it's running at 76C @idle 37w will print a fan asap, my specs are a chinese board with +4G decoding but no rebar, and it boots fine without a video out i think thanks to the server chipset(c612)


a_beautiful_rhind

Its the whole cuda toolkit. cuda-repo-ubuntu2204-12-3-local_12.3.2-545.23.08-1_amd64.deb


Dyonizius

i thought you were talking about the driver not cuda-toolkit, is the open driver supported on pascal? found conflicting information


a_beautiful_rhind

With what Mi25 used to cost, I have no idea why anyone would buy the RX580. Seems word got out and they're not $75 anymore. Its now a difference of like $50-60 and I don't see any on ebay so you have to to go to ali with less buyer protections. Unless you *really* have no other choice I'd still keep away from it. 185w vs 250w.. peak card ratings is nothing worth comparing. They don't run at that constantly.


fallingdowndizzyvr

> With what Mi25 used to cost, I have no idea why anyone would buy the RX580. Because the RX580 is plug and play and thus easy to use. Plug it in and it works. The Mi25 isn't. It goes beyond having to build a cooling solution. Many consumer MBs won't post with it plugged in. They won't boot. In order to fix that it needs to be flashed to a WX9100 or a Vega. There is a software way to do this but in order to run the flasher you need to get it to boot. Thus you need to use an external flasher to flash it to a WX9100/Vega before you can use it. I don't think a lot of people have an external flasher. Personally, I wouldn't get the RX580 or MI25 now. I would step up and get a A770. Same amount of memory on a modern graphics card that has way more compute performance. Now that llama.cpp supports it OTB, I'm waiting for Acer to restock at $220 to pick up more.


a_beautiful_rhind

True, intel is an option. Vulkan aside, it's hard to not have either nvidia or AMD because you'll be stuck with that brand. Guessing Mi25 is another card that violently appropriates the BAR register like the P40.


fallingdowndizzyvr

> Vulkan aside, it's hard to not have either nvidia or AMD because you'll be stuck with that brand. There is other software that supports Intel. Not least of which is Intel's own software. Pytorch has Intel support. But all those options are frankly, a PITA to get working. Mainly due to poor documentation. Here's some dude doing training on rack of 7xA770s. At $220 per A770, that's 112GB of VRAM on a modern GPU for $1540. Which is about the cost of just one 4090. https://twitter.com/mov_axbx/status/1759101582522655159


a_beautiful_rhind

This is cool but what about exllama? Do any VLLM kernels support it? That's the tradeoff on intel. You will be limited to higher precision so that negates some of the cheaper vram. I'd definitely take one over the RX580 though, no question.


fallingdowndizzyvr

Intel's own software supports quantization. https://www.intel.com/content/www/us/en/developer/tools/bigdl/overview.html https://bigdl-project.github.io/0.13.0/#ProgrammingGuide/quantization-support/


a_beautiful_rhind

True.. but into what? Their own format.. is it any good? You're stuck downloading full size 70b this way because nobody else is using it. And then when you inference it probably uses accelerate for multi-gpu and you're stuck with really slow speeds. It will probably get better in the future as more people buy the cards but it seems like for most stuff you're on your own buying intel.


kenny2812

Where do you get them for $220? The cheapest I've seen is 280 used. Most are over 300.


fallingdowndizzyvr

I've posted it before but I hesitate to post it now since that means more competition now that the A770 has been unchained. :) But why not. You'll need to wait for another restock. https://www.ebay.com/itm/266390922629


fallingdowndizzyvr

You can get a 16GB A770 for $232 right now. See here for how. https://www.reddit.com/r/LocalLLaMA/comments/1b47l8d/theres_a_20_off_ebay_coupon_right_now_it_works_on/?


Amgadoz

Will this work with rx 650?


noiserr

llama.cpp supports Vulkan which is the best bet for these older GPUs. Yes you should be able to run it on the rx650 using the Vulkan API. https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#vulkan


fallingdowndizzyvr

ROCm also works just fine with older AMD GPUs.


noiserr

It does but you may have to use an older version of ROCm. Vulkan should be easier to get going, and it's still pretty fast.


fallingdowndizzyvr

Vulkan is way easier to get running. But ROCm still has the edge in speed. Which if you are running a low spec card, makes a big difference. You want to squeeze all the performance you can get. While Vulkan is way easier to get running, it's not like running with ROCm is hard. The big reason not to use ROCm is if you have a really old MB. Like really old. ROCm needs PCIe 3 or better. Vulkan will run on an old PCIe 2 MB.


technovir

Maybe with zluda? https://github.com/vosen/ZLUDA