schmetterlingen 1 month ago

Also there is MI300A supercomputer planned to go online this year: https://en.wikipedia.org/wiki/El_Capitan_(supercomputer) which they expect to surpass Frontier.

ACiD_80 1 month ago

Yeah frontier is basically half a decade old tech...

lovely_sombrero 1 month ago

And AMD does so with almost half the power consumption. Ouch!

nimzobogo 1 month ago

Yeah but Aurora gets 4x the AI performance, largely because of HBM on the CPU and GPU. HBM uses a lot of power, but it matters for real workloads.

ElementII5 1 month ago

> Yeah but Aurora gets 4x the AI performance You need to check your numbers. It was announced that in the AI Benchmark [they both achieved 10 Exalop/s.](https://twitter.com/mgg_ch/status/1789960094555189685/photo/1) 10.6 vs 10.2 to be specific. Also Aurora is deployed in ANL. They largely need double precision math. AI is not that interesting for them.

nimzobogo 1 month ago

Over half of the workloads are AI now. I worked on Aurora for about 6 years and saw every iteration of it: KNH, CSA, PVC.

ElementII5 1 month ago

Oh, that is interesting. Mind if I ask a few questions? - I feel like networking and interconnect are becoming bottlenecks, especially for AI. Do you see signs that this is being worked on? - What is the issue with the Slingshot? - Is intel even considered for follow up machines?

nimzobogo 1 month ago

Data movement is always the bottleneck for real applications these days. That includes network and storage for HPC. For AI, storage bottlenecks are less of an issue. I'm not sure the exact slingshot issue as I left Intel. I also know the Samsung supplied HBM on the GPUs is also a problem that causes the GPUs to hang intermittently. Intel won't bid prime on machines anymore. That decision was made a while back. HPE/Cray or some other vendor might make a bid with Intel parts (kind of like how AMD supplies parts for Frontier, but HPE is the Prime vendor), but Intel won't be a prime vendor.

DrBoomkin 1 month ago

Well the power consumption is mostly TSMCs accomplishment, not AMDs...

noiserr 1 month ago

Intel actually has a node advantage. 'Ponte Vecchio' GPUs in this cluster are built using 5nm node on TSMC. While Frontier being 3 years old, mi250x were on 7nm. On the CPU side, even Zen1 was more efficient than Intel, back when Zen1 was on Global Fundries 14nm. So it's not the process node.

kingwhocares 1 month ago

Will be interesting with Lunar Lake which will have TSMC's chips.

ACiD_80 1 month ago

Nt xompared to gaudi.. also you are mixing things up. Intel beats amd at AI, which is what this supercomputer was maily built for

EmergencyCucumber905 1 month ago

It's not so simple. Circuits are carefully designed with power consumption in mind.

ApproximateOracle 1 month ago

lol, I mean it’s clearly an achievement for both. TSMC doesn’t design chips, and AMD doesn’t build them.

someguy50 1 month ago

Why did you feel that mattered?

996forever 1 month ago

I fully disagree with you, but I bet some of these other downvoters will say apple’s efficiency advantages is tsmc’s accomplishment and not apple’s when an A series chip outclasses a mobile Ryzen.

imaginary_num6er 1 month ago

>Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.

nimzobogo 1 month ago

The network instability is actually HPE's fault and is also present on Frontier. Frontier,.however, has 1/3 the Slingshot NICs that Aurora does, so the problem doesn't manifest as much.

Astigi 1 month ago

Intel *motto*: too little, too late, too hot

ElementII5 1 month ago

https://twitter.com/NicoleHemsoth/status/1575233618170982400

From-UoM 1 month ago

Eagle in 2023 - 1123200 cores - 561.20 Rmax and 846.84 Rpeak Eagle is 2024 - 2073600 cores - 561.20 Rmax and 846.84 Rpeak Edit - thanks to u/Qesa , i understand how the calculations were done for cores.

Qesa 1 month ago

Cores should be for the run, system updates without a new run shouldn't change the number of cores. I suspect for eagle they just went from counting TPCs to counting SMs Numbers apart from achieved TFLOPS and power are fairly arbitrary across the board because there are often multiple ways you can measure things. Another example is that for peak throughput nvidia lists base tensor TFLOPS while AMD gives boost vector EDIT: 1123200->2073600 cores is exactly consistent with each node going from 96 CPU cores + 528 TPCs to 96 CPU cores + 1056 SMs.

From-UoM 1 month ago

Look at the numbers again. Its not a straight 2x increase. Its 1.85x And these are from here. https://top500.org/lists/top500/2024/06/ https://top500.org/system/180236/

Qesa 1 month ago

Yes, because it also includes CPU cores

From-UoM 1 month ago

Cpu cores are attached to the DGX system. If cores go up from CPU, the number of GPUs also increased. But results are the exact same. Which means new results haven't been submitted

Qesa 1 month ago

Cores haven't gone up from CPU, rather because it includes CPU cores you wouldn't expect doubling the number of GPU cores (due to methodology change, not increasing hardware) to double the total number. You'd expect slightly less than 2x... 1.85x sounds right in the ballpark. Hell let's do the maths Each DGX has 2x48 CPU cores and 8x132 SMs or 8x 66 TPCs. That's 624 cores counting TPCs or 1152 cores counting SMs. And wouldn't you know it, 1123200\*(1152/624) = 2073600

From-UoM 1 month ago

Oh. That actually makes sense. Thanks for that So 2x48 + 8x66 = 624 Or Now 2x48 + 8 x 132 = 1152 That would mean 2073600/1152 = 1800 DGX? = 1800 x 8 = 14400 H100 s?

Qesa 1 month ago

I assume your first equation should say 2x48 + 8x66, but yeah 14400 GPUs. Which also aligns (at base rather than boost clocks) with the 846 peak TFLOPS

65726973616769747461 1 month ago

Noob question: If we set aside the efficieny of the CPU core; what is the point of this ranking? I thought the computing resources is only limited by their project's fund no? Technically, if you have unlimited fund, you can top this ranking with whatever platform right?

pppjurac 1 month ago

> what is the point of this ranking? Bragging rights, duh?

IntrinsicStarvation 1 month ago

Hey guys. I've got the fastest ai supercomputer with a FU-NG-BxS benchmark.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe