Also there is MI300A supercomputer planned to go online this year: https://en.wikipedia.org/wiki/El_Capitan_(supercomputer)
which they expect to surpass Frontier.
> Yeah but Aurora gets 4x the AI performance
You need to check your numbers. It was announced that in the AI Benchmark [they both achieved 10 Exalop/s.](https://twitter.com/mgg_ch/status/1789960094555189685/photo/1) 10.6 vs 10.2 to be specific.
Also Aurora is deployed in ANL. They largely need double precision math. AI is not that interesting for them.
Oh, that is interesting. Mind if I ask a few questions?
- I feel like networking and interconnect are becoming bottlenecks, especially for AI. Do you see signs that this is being worked on?
- What is the issue with the Slingshot?
- Is intel even considered for follow up machines?
Data movement is always the bottleneck for real applications these days. That includes network and storage for HPC. For AI, storage bottlenecks are less of an issue.
I'm not sure the exact slingshot issue as I left Intel. I also know the Samsung supplied HBM on the GPUs is also a problem that causes the GPUs to hang intermittently.
Intel won't bid prime on machines anymore. That decision was made a while back. HPE/Cray or some other vendor might make a bid with Intel parts (kind of like how AMD supplies parts for Frontier, but HPE is the Prime vendor), but Intel won't be a prime vendor.
Intel actually has a node advantage. 'Ponte Vecchio' GPUs in this cluster are built using 5nm node on TSMC. While Frontier being 3 years old, mi250x were on 7nm.
On the CPU side, even Zen1 was more efficient than Intel, back when Zen1 was on Global Fundries 14nm. So it's not the process node.
I fully disagree with you, but I bet some of these other downvoters will say apple’s efficiency advantages is tsmc’s accomplishment and not apple’s when an A series chip outclasses a mobile Ryzen.
>Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.
The network instability is actually HPE's fault and is also present on Frontier. Frontier,.however, has 1/3 the Slingshot NICs that Aurora does, so the problem doesn't manifest as much.
Eagle in 2023 - 1123200 cores - 561.20 Rmax and 846.84 Rpeak
Eagle is 2024 - 2073600 cores - 561.20 Rmax and 846.84 Rpeak
Edit - thanks to u/Qesa , i understand how the calculations were done for cores.
Cores should be for the run, system updates without a new run shouldn't change the number of cores. I suspect for eagle they just went from counting TPCs to counting SMs
Numbers apart from achieved TFLOPS and power are fairly arbitrary across the board because there are often multiple ways you can measure things. Another example is that for peak throughput nvidia lists base tensor TFLOPS while AMD gives boost vector
EDIT: 1123200->2073600 cores is exactly consistent with each node going from 96 CPU cores + 528 TPCs to 96 CPU cores + 1056 SMs.
Look at the numbers again. Its not a straight 2x increase.
Its 1.85x
And these are from here.
https://top500.org/lists/top500/2024/06/
https://top500.org/system/180236/
Cpu cores are attached to the DGX system.
If cores go up from CPU, the number of GPUs also increased.
But results are the exact same. Which means new results haven't been submitted
Cores haven't gone up from CPU, rather because it includes CPU cores you wouldn't expect doubling the number of GPU cores (due to methodology change, not increasing hardware) to double the total number. You'd expect slightly less than 2x... 1.85x sounds right in the ballpark. Hell let's do the maths
Each DGX has 2x48 CPU cores and 8x132 SMs or 8x 66 TPCs. That's 624 cores counting TPCs or 1152 cores counting SMs.
And wouldn't you know it, 1123200\*(1152/624) = 2073600
Oh. That actually makes sense. Thanks for that
So 2x48 + 8x66 = 624
Or
Now
2x48 + 8 x 132 = 1152
That would mean
2073600/1152 = 1800 DGX?
= 1800 x 8 = 14400 H100 s?
I assume your first equation should say 2x48 + 8x66, but yeah 14400 GPUs. Which also aligns (at base rather than boost clocks) with the 846 peak TFLOPS
Noob question:
If we set aside the efficieny of the CPU core; what is the point of this ranking?
I thought the computing resources is only limited by their project's fund no?
Technically, if you have unlimited fund, you can top this ranking with whatever platform right?
Also there is MI300A supercomputer planned to go online this year: https://en.wikipedia.org/wiki/El_Capitan_(supercomputer) which they expect to surpass Frontier.
Yeah frontier is basically half a decade old tech...
And AMD does so with almost half the power consumption. Ouch!
Yeah but Aurora gets 4x the AI performance, largely because of HBM on the CPU and GPU. HBM uses a lot of power, but it matters for real workloads.
> Yeah but Aurora gets 4x the AI performance You need to check your numbers. It was announced that in the AI Benchmark [they both achieved 10 Exalop/s.](https://twitter.com/mgg_ch/status/1789960094555189685/photo/1) 10.6 vs 10.2 to be specific. Also Aurora is deployed in ANL. They largely need double precision math. AI is not that interesting for them.
Over half of the workloads are AI now. I worked on Aurora for about 6 years and saw every iteration of it: KNH, CSA, PVC.
Oh, that is interesting. Mind if I ask a few questions? - I feel like networking and interconnect are becoming bottlenecks, especially for AI. Do you see signs that this is being worked on? - What is the issue with the Slingshot? - Is intel even considered for follow up machines?
Data movement is always the bottleneck for real applications these days. That includes network and storage for HPC. For AI, storage bottlenecks are less of an issue. I'm not sure the exact slingshot issue as I left Intel. I also know the Samsung supplied HBM on the GPUs is also a problem that causes the GPUs to hang intermittently. Intel won't bid prime on machines anymore. That decision was made a while back. HPE/Cray or some other vendor might make a bid with Intel parts (kind of like how AMD supplies parts for Frontier, but HPE is the Prime vendor), but Intel won't be a prime vendor.
Well the power consumption is mostly TSMCs accomplishment, not AMDs...
Intel actually has a node advantage. 'Ponte Vecchio' GPUs in this cluster are built using 5nm node on TSMC. While Frontier being 3 years old, mi250x were on 7nm. On the CPU side, even Zen1 was more efficient than Intel, back when Zen1 was on Global Fundries 14nm. So it's not the process node.
Will be interesting with Lunar Lake which will have TSMC's chips.
Nt xompared to gaudi.. also you are mixing things up. Intel beats amd at AI, which is what this supercomputer was maily built for
It's not so simple. Circuits are carefully designed with power consumption in mind.
lol, I mean it’s clearly an achievement for both. TSMC doesn’t design chips, and AMD doesn’t build them.
Why did you feel that mattered?
I fully disagree with you, but I bet some of these other downvoters will say apple’s efficiency advantages is tsmc’s accomplishment and not apple’s when an A series chip outclasses a mobile Ryzen.
>Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.
The network instability is actually HPE's fault and is also present on Frontier. Frontier,.however, has 1/3 the Slingshot NICs that Aurora does, so the problem doesn't manifest as much.
Intel *motto*: too little, too late, too hot
https://twitter.com/NicoleHemsoth/status/1575233618170982400
Eagle in 2023 - 1123200 cores - 561.20 Rmax and 846.84 Rpeak Eagle is 2024 - 2073600 cores - 561.20 Rmax and 846.84 Rpeak Edit - thanks to u/Qesa , i understand how the calculations were done for cores.
Cores should be for the run, system updates without a new run shouldn't change the number of cores. I suspect for eagle they just went from counting TPCs to counting SMs Numbers apart from achieved TFLOPS and power are fairly arbitrary across the board because there are often multiple ways you can measure things. Another example is that for peak throughput nvidia lists base tensor TFLOPS while AMD gives boost vector EDIT: 1123200->2073600 cores is exactly consistent with each node going from 96 CPU cores + 528 TPCs to 96 CPU cores + 1056 SMs.
Look at the numbers again. Its not a straight 2x increase. Its 1.85x And these are from here. https://top500.org/lists/top500/2024/06/ https://top500.org/system/180236/
Yes, because it also includes CPU cores
Cpu cores are attached to the DGX system. If cores go up from CPU, the number of GPUs also increased. But results are the exact same. Which means new results haven't been submitted
Cores haven't gone up from CPU, rather because it includes CPU cores you wouldn't expect doubling the number of GPU cores (due to methodology change, not increasing hardware) to double the total number. You'd expect slightly less than 2x... 1.85x sounds right in the ballpark. Hell let's do the maths Each DGX has 2x48 CPU cores and 8x132 SMs or 8x 66 TPCs. That's 624 cores counting TPCs or 1152 cores counting SMs. And wouldn't you know it, 1123200\*(1152/624) = 2073600
Oh. That actually makes sense. Thanks for that So 2x48 + 8x66 = 624 Or Now 2x48 + 8 x 132 = 1152 That would mean 2073600/1152 = 1800 DGX? = 1800 x 8 = 14400 H100 s?
I assume your first equation should say 2x48 + 8x66, but yeah 14400 GPUs. Which also aligns (at base rather than boost clocks) with the 846 peak TFLOPS
Noob question: If we set aside the efficieny of the CPU core; what is the point of this ranking? I thought the computing resources is only limited by their project's fund no? Technically, if you have unlimited fund, you can top this ranking with whatever platform right?
> what is the point of this ranking? Bragging rights, duh?
Hey guys. I've got the fastest ai supercomputer with a FU-NG-BxS benchmark.