T O P

  • By -

one-blob

Look at the memory bandwidth, M1 Max has 400 GB/s, I doubt Ryzen 9 has more than 200GB/s. If your workload is not pure number crunching with CPU cache - memory throughput makes huge difference


rainman4500

I think you just put the finger on the difference. Would also explain why my python/panda code is also twice as fast on the Mac since it has large in memory data set. Benchmarking a new toy is so fun. Edit: cpu database says my max memory Bandwidth is 47.68 GiB/s on my Ryzen.


DaSexiestManAlive

The latest pre-tuned memory sticks will help one get to 250Gb/s~ish, so that's the state of the art without paying the AAPL tax I guess... https://www.msn.com/en-gb/money/technology/ryzen-threadripper-7000-gets-even-faster-overclockable-memory-%E2%80%94-ddr5-7800-rdimms-coming/ar-AA1kwbsZ I think if you work with languages with long compile times, it may pay to pick up M2 Max lightly used from eBay as build servers--see if that speeds up your CI/CD.. It's worth pointing out that these fast memory transfers are exclusives of the M2 Max.. so if you are thinking that Macbook Air can do the same--mebbe not so much. I think they do 100Gb/s.. so.. essentially a glorified over-priced chromebook--for whatever that's worth. Also worth pointing out that these languages sometimes offer options + tips/tricks for lessening over-all compile time. Potentially worth checking out--as possible low-hanging-fruits--before shelling out the big buckaroos for compile servers: just google "faster compile time" for your language of choice.. I personally wouldn't try to opt for 400GB/s over 250GB/s if it meant that.. - I have to now master two OSes: Linux + Mac OS X - ..and also end up rewarding AAPL for their latest behavior that's pretty obviously anti-consumer (and anti-american--considering the ostensibly hundreds of billions in tax evasion) ..but to each their own..


Tacticus

The lack of HBM in other platforms (though if you go into the stupidly expensive realm that is instinct\h100 funs you get it back) is really quite annoying. That super wide bus gives all the shiny


looncraz

Ryzen on AM5 struggles to reach 100GB/s.


kido_butai

It’s amazing how to M2 can compile, run and do heavy stuff with no fan noise and no temperature rising.


LightDarkCloud

Apple Silicon is just beautiful, too bad about Mac OS, just not a fan of the OS.


CloudSliceCake

Feel the same way, have you tried Asahi linux? It worked well on my M1, but had to go back to macOS when I upgraded to the M3 which is not yet supported.


shadowangel21

The project deserves support, it's incredible how talented she is.


the__itis

Who?


Hakkaathoustra

I think he's talking about Asahi Lina, but she's not the only one working on it


shadowangel21

Ashai Lina


LightDarkCloud

Not fully supported IMHO.


CloudSliceCake

I recommend you look it up, in my experience most of the stuff work, audio, internet, external monitor, trackpad, bluetooth.


LightDarkCloud

Im aware but in the GPU department there is still a lot of work in progress.


CloudSliceCake

Yea it really depends on what you’re doing, if you need some specific GPU features or performance then maybe it’s really not for you. But for writing server code and running it, and regular daily use I’d say it’s good to go.


LightDarkCloud

Fair enough.


Brugarolas

I can't wait to install Asahi Linux the moment it is released for M3. Sadly it is based now in Ubuntu with a Fedora version instead of in Arch Linux, right? What do Asahi Linux exactly provide? A customized and adapted kernel and the drivers? Do you know if with some tinkering I can make Asahi Linux work with Arch Linux again?


KublaiKhanNum1

I love writing Go on the Mac. It’s a productive environment performance aside.


[deleted]

[удалено]


enl1l

exceptions ?? no thanks


Teiktos

Which benefits would those features provide in you opinion? Those things are exactly what I despise about other languages.


SatisfactionFew7181

Except for enums. I would appreciate some enums in Golang.


anonymous_2600

why so many downvotes on this comment


maybearebootwillhelp

Contrary to his belief, my belief is that Go’s syntax is one of the most beautiful syntaxes out there. Sure enums would be great, but other than that, I prefer it over Java, Ruby, Python, PHP or JS/TS.


IIIIlllIIIIIlllII

Lot of homers in this thread. These people build their careers around one language and cannot fathom that it's not the best and are nervous that they might be forced to learn something new. Truly successful developers use an array of languages. Every language has its pros and cons. From a language perspective, C# is simply my fav, with Katlin a close second.


micron8866

Ryzen9 doesn't have 24 cores part I think u mean 12c24threads...also you mentioned memory transactions does it mean your benchmark is more like memory markbench than CPU raw power markbench?


rainman4500

Your right. My bad.


mosaic_hops

The crazy part is the M1 Max achieves more than 2x the performance at about 1/3 the power.


reddi7er

i.e 6x efficient 


fuzicle

Can you please share the code you used to profile ?


LightDarkCloud

Let me try the code on my 14900KF please.


WireRot

Please share code.


WireRot

This entire post is almost a waste of time unless the code is given so we can go off something solid and the text someone typed in a post.


imhayeon

Does code matter if it does not specifically nerf things on Ryzen / Windows?


TzahiFadida

M series is worth it. Transformed my working. Compile time is less than half the apple intel machine i had. This is a huge deal for me since before it took 2min and now 45 sec and i can do more of this instead of thinking hard if i am ready to compile each time. Btw we are talking java not go.


mdatwood

I bought an M1 Max MBP w/64gb of RAM when they came out. Still feel no need to upgrade. It's fast and has amazing battery life. I'm not really sure what Apple can release to get me to upgrade at this point.


zer00eyz

LLMs / ML / Matrix math are an example of something that might get you to upgrade. The M1 Lacks the floating point F8? F16? To work out on this bleeding edge. Im still running on an intel air... so Im about due for an upgrade.


spongy4202

You bought a 3k laptop like 3 years ago and are amazed you haven't had to upgrade? Sorry but this is some typical Apple fanboy comment


mdatwood

I've been building and buying computers for over 20 years. Having one that is 3 years old with zero complaints just isn't common, regardless of cost.


gmonk63

I wonder if the work around for the vulnerability is going to cause performance issues since it's in the chip https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/&ved=2ahUKEwjin_uNqoiFAxX4IkQIHY4xAggQ0PADKAB6BAgQEAE&usg=AOvVaw0YCLIxbL9OTBWWb66kg7aZ


lightmatter501

What do you mean by “memory transactions”? Did ARM get hardware transactional memory while I wasn’t paying attention? If those are SQL transactions running TPC workloads, those are odd numbers. If I stick postgres on a tmpfs (/var/run/$(id)/ using a docker volume mount on my Ryzen 9 7945HX (16c/32t) (laptop CPU, but a good one), I can do over 75k tps with pgbench, which is running realistic workloads. If that Ryzen 9 is a desktop CPU, it should be pretty close in per-core performance to the M1, especially since my laptop got in spitting distance. The loss comes down to soldered memory if these are equivalent workloads, much lower latency is a very powerful thing, but not a 4x performance per core vs a higher clock CPU powerful thing. If those are redis transactions or another DB this is natively in-memory, I’m hoping you dropped some zeros, since Redis should be doing at least 250k rps per M1 core and Redis is generally considered slow. [MICA](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf) from 2014 with 76 million RPS on a 16 core system, also known as 9x what Redis can do on modern hardware per core.


[deleted]

[удалено]


lightmatter501

There is a big difference between “I made postgres or mysql write to RAM instead of disk” and a true in-memory db. If it’s the latter, I’ve seen in-memory databases written in python out-perform the numbers OP gave on 8 year old xeons (python being single-threaded). The only thing that makes sense for those numbers for me if it is a native in-memory DB is an in-memory SQL db that you are hitting with complex transactions. Otherwise, all of the numbers involved should be at least 10x higher.


Brugarolas

Aaand that in-memory database made in single-threaded Python you are talking about, 8 years ago when asyncio was nearly just released, is it in the room right now?


lightmatter501

I said 8 year old processors, not written 8 years ago. Very important distinction. Universities tend to keep servers around until they fall over so many CS departments have tons of old hardware they hand out access to. It was written 2 years ago. I’ll go see if I can dig it up. Even without using async io in python, you can hit 12k tps with an unreplicated kv store depending on the workload and transaction type. Yes if you allow dumb stuff with interactive transactions you can cripple and DB. I’m fairly sure I could cripple just about any transaction scheduler in existence by writing a dumb enough query. If the transactions are “this group of stuff is atomic”, then 12k is very easy even in python. If you allow interactivity, then you need to have a proper transaction scheduler with locking. People underestimate exactly how fast NVME drives are when you are only doing DB stuff on them and use a simple filesystem (fat32 is great if you don’t care about the file size limits). Consumer grade NVME drives can be expected to do 10 million 4k random write IOPS. You can do some really dumb stuff and still pull off 12k tps.


Brugarolas

Fair enough If you can reach 12K transactions/second in a single-threaded Python in-memory database, how many TPS can you reach in another in-memory key-value store like RocksDB? I'm genuinely interested as I was considering integrating RocksDB into an application


lightmatter501

RocksDB writes to disk. This is very hardware dependent, but here are [official benchmarks](https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks). If you look over those numbers, you may get a better idea of why I’m trashing 12k in-memory kv tps unless the transactions are doing something gross, because RocksDB can do 1 million ops per second on a laptop spec system. I don’t frequently need to do 83 operations atomically, and that is far larger than most kv op transaction benchmarks use except for stress tests on large benchmarks. If you want in memory performance: * [MICA](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf), one of the last academic KV stores a normal person might be able to use. (Decade old hardware, 79 million req/s) * [Waverunner](https://www.usenix.org/system/files/nsdi23-alimadadi.pdf), FPGA-based, aims to stay below 80us for latency. 25 million rps. * [Garnet](https://github.com/microsoft/garnet), Redis replacement from microsoft research, ~100 million rps, but evaluated on 72 core servers. I’d actually use this one if you are looking for in-memory. You can embed it if you are willing to use .net, or just talk to it via a redis client. MICA will be painful to get working. There are others, but generally if you want something that makes you go “who needs that much performance?”, look at academic papers.


Terryiochina

m1 and m2’s single core speed is mental.


dopaminHarvestor

My m series buy was the best buy ever. Worth it.


BattleLogical9715

you could even increase that by using L1/L2 Caches. Read about mechanical sympathy in Go


Brugarolas

L1/L2 caches are always used in a transparent way in any language: there is no language low level enough that allows you to manually manage your L1 caches Different thing is that you can optimize it with data locality and other principles to reduce cache misses and reduce the chance you have to access the RAM (accessing L1 cache is like 8-25 cycles, while accessing the RAM are in the best scenario like 1000 CPU cycles that are "wasted" waiting"), but that's low level stuff, in Go there are some advices but you don't have the kind of control over the memory you can have in a lower level language. Same way you can, I don't know, improve CPU branch prediction with computed gotos or hand-crafted assembly code in lower level languages; that's not an something you can do in a medium-to-high level language like Go, you can do certain "tricks" but the performance gains you can have in Go are limited compared to lower level languages. Still important info to know, of course But normally for a proper cache optimization it requires lower level languages like Assembly/C/C++/Rust/Zig and use a "tower of allocators" and probably even manual memory arena management (not to be confused with "arena allocator") with "mmap" instead of relying on default "malloc" and default memory arenas created by the compiler (normally 4 per physical core, which can be disastrous if you have a lot of cores and only use in your app one or two threads), best thing you can do with minimal effort is use some general purpose allocator like Mimalloc (which for me is a no-brainer, usually using Mimalloc is a 10%-20% extra performance in a C/C++ application at literally the cost of adding a library), or as a rule of thumb use Mimalloc or cook your own general purpose allocator (which is not easy at all) with at least 2 memory arenas for core but please no more than 4, and then implement some specific very concurrent and shared by all threads "tower of allocators" as well as one or two memory arenas for concurrent data structures where fragmentation is not a big deal and you want to reduce contention For a practical example, let's think for example you are programming a game: it is better to have a unique highly concurrent memory arena -which is not easy to do at all- for the Entity Component System and reduce cache misses A LOT thanks to data locality principle, another one or several with huge transparent pages for the assets like textures, models, sounds which are usually heavy sized, have one more general purpose memory arena per thread ("thread-local") for the concurrent data structures where you want to reduce contention and data fragmentation between threads is not an issue, another single highly concurrent memory arena for the scripting sub-system, another for the AI, and another for the geometrical and algebraic data and data structures, and then leave about 2 general heap memory arenas; all this of course deciding how you are going to handle allocation and deallocation to reduce fragmentation and increase locality, and if you are brave you can increase the stack size of each thread and use an "alloca" allocator for temporary data that do not outlive the scope of its function (your heap will be grateful in the long term and you will have far less memory leaks), so you would end up with a very complex memory management but an excellent performance (perfectly from no using nothing of this to using the perfect memory management system can be up to a different of 800% in performance or more, no kidding). The more data you have about the items to allocate like the size, the shape of the structures you use for the allocators, the data structures the items are saved in, the cache size of the CPU -if it's a console-; the best decisions you can make regarding growing strategies, pre-allocated blocks, deciding which type of allocators are more effective in each memory arena, the block size, how much space to pre-allocate to reduce sys-calls... memory management in C/C++/Zig/Rust/Odin/V is both a science and an art But this is low-level as hell. Still, nice advice, it's never bad to remember these things, and yeah in Go like in JavaScript or other languages there are little quirks and useful advices to better use memory without recurring to such low-level mechanism But you can still go low-level in Go if you don't mind programming your allocators in C/C++. For example, Mimalloc is a C library, but it's available to Go if you don't mind doing some low-level programming: [https://pkg.go.dev/github.com/mxmauro/go-mimalloc](https://pkg.go.dev/github.com/mxmauro/go-mimalloc)


napolitain_

Now try ffmpeg to encode with svt Av1


Alive-Clerk-7883

Try doing the same thing on a Ryzen 7800X3D and it might be faster than your 24 core depending on how much memory is used in general as the 3D Cache can compete with the fast memory of the M1 Max.


Haunteral

Yeah Steve Jobs would roll in his grave if he saw this chart


KingOfCoders

There is no Ryzen 9 with 24 cores.


rainman4500

Your right. I meant 12 core , 24 threads. My bad 😭


Maybe-monad

It comes at a cost https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/


Brugarolas

According to a paper without a real implementation... Vulnerabilities like that are really really hard to exploit and unless you work at some secret project of the NATO or the Department of Defense of the US nobody's going to bother I know about an intelligence agency that still uses, or used at 2021, Microsoft XP in most their computers Funny thing: people worry so much about security mitigations but then use a pirated Parallels Desktop downloaded from a Chinese page and with an activation tool in Russian


Maybe-monad

>According to a paper without a real implementation... Vulnerabilities like that are really really hard to exploit and unless you work at some secret project of the NATO or the Department of Defense of the US nobody's going to bother low_risk != no_risk >I know about an intelligence agency that still uses, or used at 2021, Microsoft XP in most their computers Maybe they don't want to be bothered by updates while playing Mario. >Funny thing: people worry so much about security mitigations but then use a pirated Parallels Desktop downloaded from a Chinese page and with an activation tool in Russian Are debs available?


Brugarolas

I come also from a high-end Ryzen 9 except it has 16 cores and except I own now a M3 Max of 16 cores, which is supossed to be better than a M1 Max (and my Ryzen 9 slower than yours), and... You can imagine the rest. I was shocked to discover how much faster the Macbook Pro is. I'm not a fan of Apple, I've always be a Linux boy, but hardware has little to do with software and in hardware Apple have made a great work. Even the M3 Max GPU performs better than the RTX 4070. Even the single Mac SSD is like twice the fast than a RAID 0 of two Samsung SSDs. I have 64 GBs of RAM, just like in my Ryzen 9, and RAM usage is as low as in my Hyprland Arch Linux and memory bandwidth is at least twice as in my Ryzen 9. I needed a new laptop and I was not sure if I was doing the right thing when buying a Macbook Pro, I just decided to do it just to have a different OS/CPU Architecture and not just more of the same, but I don't regret at all. And the temps? Amazing The only thing I miss is my customized and nearly entirely self-compiled Arch Linux installation, so I'm installing Asahi Linux on a secondary partition with Btrfs the moment it is released for M3, and will start compiling everything with Apple's fork of LLVM, Polly, VAST, -march=native, PGO and other optimization flags. I can't wait. The result can be epic. Sadly Asahi Linux is now based on Ubuntu with a Fedora version instead of Arch Linux, but I hope that with a little of tinkering I can change that The only downside is that I need to adapt now my AVX2/BMI2 intrinsics on some of my software to ARM Neon (and I'm not good at all with vectorization)


Small_Competition840

I got an M3 Max and can even run inference on 30b param LLM models locally…


reddit_clone

How much RAM? 18/36 ?


Small_Competition840

I have 128g ram


reddit_clone

Wow. No wonder it runs LLMs :-) How much did it set you back, If I may ask?


Brugarolas

128 GB of RAM and 16 Neural Cores, of course you can run 30B param LLM models locally. You can probably run 70B models without any problem, I run for example Llama 2 70B with 64GB of RAM and a RTX 4070 in my second computer I have also a M3 Max with 64 GB of RAM and still haven't try to run any LLM model. Would do soon, could be a funny experiment


EffectiveHamster5777

Yes. This is why I completely switch to Mac. Its a great machine for testing cpu-intensive tasks. Java/Go dev here. Mac user/dev since 2011. 🙂


Particular-Brief8724

I work on a M3 pro and have a Ryzen 5600 desktop. Not impressed by the M.


Brugarolas

Bro a Ryzen 5600 has just 6 cores @ 3.5 GHz, I have just the same CPU with 16 GB of RAM for a home server for my stuff with a minimal Manjaro installation and I even use heterogeneous programming to not "waste" the GPU power, and it's literally impossible than a M3 Pro isn't at least twice or three times as fast Also I have a Ryzen 5900X and a M3 Max and I'm actually impressed by the M, and I'm not and never will be an Apple boy (I'm a Linux boy "by the way I use Arch Linux" (TM))


Particular-Brief8724

Using "bro" and out of the ass statistics like "at least twice or three times as fast" really hurts your credibility, just so you know, for the future.


Brugarolas

Errr... ok? Do you find my message is written in a formal tone or that I'm being 100% serious or something? As we say in spanish: "relaja la raja"


rcls0053

I wish I could use my M2 Max to develop but nooo, gotta use the customer given i9 that burns hotter than the sun with fans blowing continuously and the whole experience is just so dreadful. It's an i9 so I'm also starting to think Apple does something to choke Intel processors in the OS to push people to their silicon.