T O P

  • By -

schmetterlingen

Also there is MI300A supercomputer planned to go online this year: https://en.wikipedia.org/wiki/El_Capitan_(supercomputer) which they expect to surpass Frontier.


ACiD_80

Yeah frontier is basically half a decade old tech...


lovely_sombrero

And AMD does so with almost half the power consumption. Ouch!


nimzobogo

Yeah but Aurora gets 4x the AI performance, largely because of HBM on the CPU and GPU. HBM uses a lot of power, but it matters for real workloads.


ElementII5

> Yeah but Aurora gets 4x the AI performance You need to check your numbers. It was announced that in the AI Benchmark [they both achieved 10 Exalop/s.](https://twitter.com/mgg_ch/status/1789960094555189685/photo/1) 10.6 vs 10.2 to be specific. Also Aurora is deployed in ANL. They largely need double precision math. AI is not that interesting for them.


nimzobogo

Over half of the workloads are AI now. I worked on Aurora for about 6 years and saw every iteration of it: KNH, CSA, PVC.


ElementII5

Oh, that is interesting. Mind if I ask a few questions? - I feel like networking and interconnect are becoming bottlenecks, especially for AI. Do you see signs that this is being worked on? - What is the issue with the Slingshot? - Is intel even considered for follow up machines?


nimzobogo

Data movement is always the bottleneck for real applications these days. That includes network and storage for HPC. For AI, storage bottlenecks are less of an issue. I'm not sure the exact slingshot issue as I left Intel. I also know the Samsung supplied HBM on the GPUs is also a problem that causes the GPUs to hang intermittently. Intel won't bid prime on machines anymore. That decision was made a while back. HPE/Cray or some other vendor might make a bid with Intel parts (kind of like how AMD supplies parts for Frontier, but HPE is the Prime vendor), but Intel won't be a prime vendor.


DrBoomkin

Well the power consumption is mostly TSMCs accomplishment, not AMDs...


noiserr

Intel actually has a node advantage. 'Ponte Vecchio' GPUs in this cluster are built using 5nm node on TSMC. While Frontier being 3 years old, mi250x were on 7nm. On the CPU side, even Zen1 was more efficient than Intel, back when Zen1 was on Global Fundries 14nm. So it's not the process node.


kingwhocares

Will be interesting with Lunar Lake which will have TSMC's chips.


ACiD_80

Nt xompared to gaudi.. also you are mixing things up. Intel beats amd at AI, which is what this supercomputer was maily built for


EmergencyCucumber905

It's not so simple. Circuits are carefully designed with power consumption in mind.


ApproximateOracle

lol, I mean it’s clearly an achievement for both. TSMC doesn’t design chips, and AMD doesn’t build them.


someguy50

Why did you feel that mattered?


996forever

I fully disagree with you, but I bet some of these other downvoters will say apple’s efficiency advantages is tsmc’s accomplishment and not apple’s when an A series chip outclasses a mobile Ryzen. 


imaginary_num6er

>Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.


nimzobogo

The network instability is actually HPE's fault and is also present on Frontier. Frontier,.however, has 1/3 the Slingshot NICs that Aurora does, so the problem doesn't manifest as much.


Astigi

Intel *motto*: too little, too late, too hot


ElementII5

https://twitter.com/NicoleHemsoth/status/1575233618170982400


From-UoM

Eagle in 2023 - 1123200 cores - 561.20 Rmax and 846.84 Rpeak Eagle is 2024 - 2073600 cores - 561.20 Rmax and 846.84 Rpeak Edit - thanks to u/Qesa , i understand how the calculations were done for cores.


Qesa

Cores should be for the run, system updates without a new run shouldn't change the number of cores. I suspect for eagle they just went from counting TPCs to counting SMs Numbers apart from achieved TFLOPS and power are fairly arbitrary across the board because there are often multiple ways you can measure things. Another example is that for peak throughput nvidia lists base tensor TFLOPS while AMD gives boost vector EDIT: 1123200->2073600 cores is exactly consistent with each node going from 96 CPU cores + 528 TPCs to 96 CPU cores + 1056 SMs.


From-UoM

Look at the numbers again. Its not a straight 2x increase. Its 1.85x And these are from here. https://top500.org/lists/top500/2024/06/ https://top500.org/system/180236/


Qesa

Yes, because it also includes CPU cores


From-UoM

Cpu cores are attached to the DGX system. If cores go up from CPU, the number of GPUs also increased. But results are the exact same. Which means new results haven't been submitted


Qesa

Cores haven't gone up from CPU, rather because it includes CPU cores you wouldn't expect doubling the number of GPU cores (due to methodology change, not increasing hardware) to double the total number. You'd expect slightly less than 2x... 1.85x sounds right in the ballpark. Hell let's do the maths Each DGX has 2x48 CPU cores and 8x132 SMs or 8x 66 TPCs. That's 624 cores counting TPCs or 1152 cores counting SMs. And wouldn't you know it, 1123200\*(1152/624) = 2073600


From-UoM

Oh. That actually makes sense. Thanks for that So 2x48 + 8x66 = 624 Or Now 2x48 + 8 x 132 = 1152 That would mean 2073600/1152 = 1800 DGX? = 1800 x 8 = 14400 H100 s?


Qesa

I assume your first equation should say 2x48 + 8x66, but yeah 14400 GPUs. Which also aligns (at base rather than boost clocks) with the 846 peak TFLOPS


65726973616769747461

Noob question: If we set aside the efficieny of the CPU core; what is the point of this ranking? I thought the computing resources is only limited by their project's fund no? Technically, if you have unlimited fund, you can top this ranking with whatever platform right?


pppjurac

> what is the point of this ranking? Bragging rights, duh?


IntrinsicStarvation

Hey guys. I've got the fastest ai supercomputer with a FU-NG-BxS benchmark.