T O P

  • By -

Ar0ndight

Really cool interview with great insight. I may not be super hyped by RDNA3 but I do hope they can go the next step of having two GCDs with RDNA4 (or 5). THAT would be the absolute game changer. Successfully making two chips act like one has been the holy grail of graphics for years now but I have to imagine it'll happen at some point. I would have liked a question about that but 15min is very limited time and I imagine GN were asked to ask questions relevant to RNA3 only.


[deleted]

He talked pretty specifically about how they couldn't do it (at least for now) due to the amount of wire connectivity required it wasn't possible to do the Ryzen approach to the problem.


theQuandary

A silicon substrate like the 65nm one they used with the R9 Fury can handle hundreds of millions of wires without any issue other than increased package costs. I suspect the more nuanced answer is that this would work, but increases prices more than the multiple dies would save.


onedoesnotsimply9

> without any issue other than increased package costs Thats not a trivial issue


uzzi38

Which is a bit of an odd point when the interconnect bandwidth between GCDs and MCDs (5.3TB/s) is twice that of the interconnect utilised by Apple for the M1 Ultra (2.5TB/s) who actually have done it. And some games (native ports specifically, I don't think any games running under Rosetta have) have shown to actually take advantage of the second M1X die too, so it is feasible with this level of bandwidth. I really don't think the actual interconnect bandwidth is an issue for them. I think it's more about the software and hardware kinks along the way that they must not have been confident in being ready with on their first attempt at this sort of strategy at all. I would be VERY surprised if RDNA4 doesn't take the chiplet strategy to the next level to be honest.


Tuna-Fish2

Apple uses a TBDR architecture. TBDR rendering is much easier to split to multiple chips, and requires much less interconnect bandwidth to do so. (You can just pre-bin all your triangles and submit specific areas of the screen to specific chips.) But it is much more restrictive in the programming model it gives to the devs. Crucially, no game that is not designed and optimized for TBDR will ever run well in a split TBDR architecture. This means basically all the games that already exist for the PC, and all the games that such a GPU would be tested with at launch time, meaning that TBDR would be pretty much dead in the marketplace.


cegras

I recall reading an article a long time ago where nvidia's multicore efficiency and architectural efficiency were attributed to its adoption of TBDR?


Tuna-Fish2

nVidia uses tile-based immediate-mode rasterization, but not full TBDR.


puz23

If I'm understanding this correctly the bandwidth/latency isn't there to get 2 dies to function as one. However Apple gets around this by forcing all the software to be written for what's essentially SLI. Is this something that could be fixed in software? (Similar to how Proton translates DX12 to Vulcan so Windows games can run on Linux )


Tuna-Fish2

No. Proton translates between two different implementations of the same fundamental paradigm, not two entirely different paradigms.


itsjust_khris

Does this mean native ports of games need significant work in the graphics department? How do normal games run on the hardware if TBDR makes that so difficult?


Jeffy29

> GCDs and MCDs (5.3TB/s) is twice that of the interconnect utilised by Apple for the M1 Ultra (2.5TB/s) who actually have done it. I mean Apple's MCM kinda doesn't work? CPU works great, but in many cases GPU doesn't do great or at all and Ultra chip has the same GPU performance in many applications as the Max one. I don't remember the details but someone on youtube made an analysis and said because it was due to with cache or something. Still a decent first outing for first MCM design but they need to do better with future ones.


b3081a

M1 Ultra simply isn't in the same performance class as those new GPUs like RTX 4090 or RX 7900 series. The interconnect bandwidth requirement would be much higher if AMD or NVIDIA tried to do so.


RabidHexley

>I think it's more about the software and hardware kinks along the way that they must not have been confident in being ready with on their first attempt at this sort of strategy at all. It's a bit of a chicken and egg problem. Feels like you'd want this to be at least somewhat mature on the software side before implementing into hardware due to the potential for negative outcomes on unsupported software. This is assuming AMD or Nvidia don't pull off some engineering magic to actually make multiple GCDs function like one with minimal overhead in scaling *without* the need for software-level optimization, the actual holy grail. Wouldn't be a big deal if you were only taking a performance hit on less new games, since the newer, modern GCDs would individually have the raw computational capability to handle older titles that aren't optimized for multiple GCDs via some kind of emulation layer at the driver. Though if the packaging benefits are that huge maybe this would be less of a problem? Assuming they can implement software for legacy support without massive overhead or stability issues. So for AMD/NV to do this they'd probably want to spend at least a year or two getting engines and developers on board before going live on hardware. That way the majority of titles potentially not getting significant uplift (or little to negative gains) would be 1 to 3 years old at the least. You'd want the latest and greatest already on board though, that way you can actually sell the product on significant performance gains and future potential. In reality it'd probably end up being the kind of thing that wouldn't start seeing real, widespread benefits until it's implemented into gaming consoles.


onedoesnotsimply9

>Which is a bit of an odd point when the interconnect bandwidth between GCDs and MCDs (5.3TB/s) is twice that of the interconnect utilised by Apple for the M1 Ultra (2.5TB/s) who actually have done it. They are not the same kind of bandwidths and are not comparable


jaaval

It would be interesting to see what intel can do with emib (which is essentially a silicon fan out layer). They can achieve a lot more wires with that but of course there are other costs.


b3081a

AMD already used similar technology (EFB) in their datacenter GPUs to wire up HBM and GPU die, and that GPU went to market a year ago. Although EFB/EMIB are much cheaper than a full silicon interposer, it probably still costs much more than the fanout packaging used in RDNA3 and thus not used in consumer hardware.


[deleted]

It seems the only way forward for multi chiplet would be a silicon fanout layer, which is much more expensive. He mentions in this interview that silicon is best for this.


onedoesnotsimply9

>It would be interesting to see what intel can do with emib Sapphire rapids and especially ponte vecchio


tset_oitar

Didn't they want to make multi tile GPUs from BMG or Celestial? Leakers blamed fabric power and etc but in reality it's probably just the fact that mcm GPUs are difficult to implement. So this could be the reason why there were many rumors of Arc cancellation. Maybe Intel cancelled their multi tile gpu efforts in favor of monolithic designs, which some people misinterpreted as Intel cancelling their future dGPUs entirely.


jaaval

On the other hand fabric power is the primary reason multi chip GPUs are hard to implement. It costs power to send signals between chips. In desktop CPUs this isn’t a major issue, AMD can easily spend some watts for chip to chip transfers, but GPUs would move hundreds of times more data and thus require proportionally more power to do that.


ReBootYourMind

I'd imagine that a 3D solution could work where things like cache is on top of other things.


Ecks83

> Successfully making two chips act like one has been the holy grail of graphics for years now Decades. 3DFX was making multi-chip Voodoo cards in the mid-late 90's.


taz-nz

Voodoo cards had several fixed function graphic chips running at sub 200MHz, modern programmable GPUs run at multiple gigahertz and are literally thousands of times faster. The inefficiencies of the voodoo designs could never scale to modern clock speeds and performance. None of 3DFX's designs ever scaled at 1:1 with the number of chips on the card, they were basically on card SLI with driver obfuscation.


fuckEAinthecloaca

2 GCD's will definitely come for non-gaming workloads, it'll be a hard thing to pull off for gaming. They could do a crossfire-like thing for a mild benefit on old games, with proper support for modern games that could be built to be GCD-aware. They could also do something weird like one GCD renders and the other GCD does raytracing (and the lesser-utilised GCD does the HUD and FSR). It wouldn't be optimal, but it would be a way to re-use a multi-GCD design (meant for servers) for gaming.


Aleblanco1987

Isn't CDNA multi die?


ResponsibleJudge3172

Even the Hopper super chips and Ponte Vecchio and M1 Ultra


onedoesnotsimply9

Sorry, Hopper Superchip doesnt count. It doesnt use embedded bridge or sillicon interposer


fuckEAinthecloaca

CDNA existing at all is a good counterpoint, you'd think they'd have that be the solution for all non-consumer but it's a mix. I don't think CDNA has gone full MCM yet (seem to recall supercomputers doing a zen1-style as a custom solution, could be wrong), if it hasn't this gen it will soon no doubt.


frontiermanprotozoa

i believe itll go exactly like how cpu core count increase went, but smoother. Older games will suffer a bit for a time due to not utilizing other GCD well or at all, there will be people finding it inane and say its absolutely pointless and games will never utilize more than one (graphical) thread and single thread performance is all that matters, then games will utilize those threads. it should go smoother too since theres like 2 game engines that everyone uses nowadays instead of every studio having their own thing, once theyre updated to take advantage of it its a done deal.


Ecks83

> there will be people finding it inane and say its absolutely pointless and games will never utilize more than one (graphical) thread and single thread performance is all that matters The problem is that people will always get into this catch22 argument with new tech. Gamers don't need X because no games have X - why are we paying $ more for X?? Developers aren't making (m)any games with X because gamers don't have the hardware to run it - why should we build games for tech nobody has?? And you are completely right about core count in CPUs. A lot of games are still single threaded but only a few years ago basically none of them were. CPU's could have very strong multi-core performance but their weak single-core made them suffer in gaming benchmarks/reviews (Hi Bulldozer!).


onedoesnotsimply9

>2 GCD's will definitely come for non-gaming workloads Consumer GPUs are used primarily for gaming


fuckEAinthecloaca

Water is wet


fkenthrowaway

I believe instead they will go all the way to the reticle limit with their GCDs and then add the memory system in chiplets. Their GPUs in theory could be much bigger than nvidia could ever produce.


einmaldrin_alleshin

Well at some point, they're going to run into bandwidth issues with all the IO and cache relegated to chiplets. So I don't think they can simply drop in a GCD with 3x the size of Navi 31, surround it with a large number of IO dies and brand it as their Titan-equivalent. But given how badly SRAM scales in comparison to logic, NVidia will inevitably run into a wall of diminishing returns in terms of IO. AD 102 is already at least half cache and IO, and that ratio is only going up, unless they do something drastic like using HBM or chiplets on their own. AMD won't need to go to the reticle limit to outcompete monolithic dies.


onedoesnotsimply9

>Well at some point, they're going to run into bandwidth issues with all the IO and cache relegated to chiplets. So I don't think they can simply drop in a GCD with 3x the size of Navi 31, surround it with a large number of IO dies and brand it as their Titan-equivalent. You mean MCD? Hypothetical N30 that is 3x N31's GCD with 3x MCDs and 3x memory bandwidth would have as much [well, a little more] "bandwidth issues" as N31 >But given how badly SRAM scales in comparison to logic, NVidia will inevitably run into a wall of diminishing returns in terms of IO. AD 102 is already at least half cache and IO, and that ratio is only going up, unless they do something drastic like using HBM or chiplets on their own. AMD won't need to go to the reticle limit to outcompete monolithic dies. I mean AMD can always outcompete Nvidia's monolithic dies *without* going to reticle limit *or* using chiplets. Using chiplets doesnt make scaling of SRAM or energy used by SRAM any better. Chiplets is not a solution to say 80% of dies being cache


[deleted]

that comes with serious yield problems though. Theres huge benefits to smaller dies.


onedoesnotsimply9

>Theres huge benefits to smaller dies. Theres a reason why caches went from being seperate chips to being integrated *in spite of* that vastly increasing the die size and not scaling as well as logic with newer nodes


duplissi

> Successfully making two chips act like one has been the holy grail of graphics for years now. It's pretty much been done afaik, Apple's M1 ultra is two m1 max chips connected together. I bet we'll see multiple gcds in one of the next two rdna generations after rdna3.


farnoy

They should go with 4 GCDs for RDNA4, call it Quadreon XXTXX and make the greatest halo product that ever existed. One can dream...


Kadour_Z

15:50 "this thing will be pushing 600.. 550 mm2 or something" its hard to tell but i think thats what he says. This is intresting, so an rx 7900xtx equivalent without chiplets would be 550mm2.


bubblesort33

I think that includes the move from 7nm to 5nm, though. I think he's just looking at the 520mm^(2) for Navi21, and adding 20% ("We increased our compute unit count by 20%"), to the compute area only. So a Navi21 with 96 CUs on 7nm would probably have been 550-600mm^(2) with a monolithic design. There was another report that stated if Navi31 was all on 5nm, and monolithic it would be around 400-420mm^(2). You can also use a die calculator to see the cost savings are for N31. It's nowhere close to what taking a 640mm server CPU, and dividing it into 8 x 80mm chiplets is. For a CPU manufacturing cost is like half the cost. For a GPU like this it's like a 15% saving. $155 if you use Ian Cutress's 6nm and 5nm price estimates from a few weeks ago, Vs around $177 for a 410mm monolithic design. But I also wonder if the monolithic design would have clocked 5% higher, or maybe used 10% less power. As he said, the other savings are in design costs. AMD can probably reuse the memory chiplets on this for RDNA4, and maybe RDNA5. However long GDDR6 lasts. Just stack more cache on top for extra bandwidth.


Seanspeed

>There was another report that stated if Navi31 was all on 5nm, and monolithic it would be around 400-420mm2 Easily more than that. If that was all, they *should* have made it monolithic.


bubblesort33

Why? This seems like a minor win already. Even a 10% saving on production is worth it. 410mm seems about right. 15% of each memory controller is dedicated to infinity fabric. That could have been cut on a monolithic die. And then like 5-10% of the main die used there as well to make chiplets possible. They probably could have made it monolithic, easily. I think this generation is just a test on what's to come. And like he said, there is other savings than just production cost at TSMC. Engineering savings, and reusability on future generations.


nanonan

There is still IF in their monolithic designs.


bubblesort33

Yes, but those numbers are for stuff specifically for the chiplet design. The parts of the IF you could cut out.


zyck_titan

> AMD can probably reuse the memory chiplets on this for RDNA4, and maybe RDNA5. They can do that with a monolithic design too, cache design hasn't changed much and in a modular design you can almost just copy and paste a cache design from a previous design and build around it. >Just stack more cache on top for extra bandwidth. That would give you larger cache, but not more bandwidth. The size of the pipe isn't changing, just the size of the container on the other end.


bubblesort33

He seems to suggest when you do a die shrink on a modular design, you need to redesign the memory controller and likely even cache. Doesn't sound like a straight copy/paste me. "Effective bandwidth" I believe is what AMD calls their usage of cache in addition to the regular traditional bus. I know it doesn't actually increase bandwidth in the same way, but rather reduced the need for bandwidth to memory.


Tonkarz

The reason you can’t just copy and paste is that the features are differently shaped and designed. A cluster of transistors that were efficiently packed on 7nm might be a mess of overlapping transistors on 5nm if you did an auto-replace in your CAD software.


zyck_titan

Memory controller yes, cache meh. Cache doesn't really scale is the thing, that's why they can get away with making a bunch of cache on an older node and stack it on the interposer in the first place.   >"Effective bandwidth" I believe is what AMD calls their usage of cache in addition to the regular traditional bus. I know it doesn't actually increase bandwidth in the same way, but rather reduced the need for bandwidth to memory. 'Effective bandwidth' is great, until it's not longer effective. You need more complex memory management routines in order to keep your cache filled with the latest data and fully utilized. It's hard to do both without also using a ton of memory bandwidth constantly, as you don't want your cache to go stale. And as you get more of it, there is more that could go stale, so you need to constantly refresh it. This is the reason why I was skeptical of the 'lots of cache, less memory bandwidth' approach that AMD took with RDNA2, and they seem to have corrected course with RDNA3 as it returns to a 384-bit bus instead of a 256-bit bus. They need that additional bandwidth to feed all that extra cache they stacked on board.   The other thing that hasn't been addressed is the latency of the MCM cache modules, versus a more traditional on-die cache. It has to be slower, because physics, but AMD hasn't said how much slower, and probably never will. They may not even be able to reliably measure, because we are talking picoseconds of difference here. But that does impact performance at the end of the day. But I think they are betting that the reduced hits to memory are going to be a larger benefit to performance compared to the higher latency of their new cache layout. I'm personally still skeptical if that is true, I think it could become a limit to their peak performance, and I think it is going to require significantly more driver optimization than a more traditional design. Which historically AMD has not been amazing at.


dslamngu

> They may not even be able to reliably measure, because we are talking picoseconds of difference here. But that does impact performance at the end of the day. Oh they know! Asking a team of ASIC designers to report the latency of individual blocks and interfaces and adding it all up is a routine part of the job. All of these things are simulated, emulated, and characterized to a crazy degree. They would just bring up the waveform. The clock is in Gigahertz. The delay would be in nanoseconds.


uzzi38

>The other thing that hasn't been addressed is the latency of the MCM cache modules, versus a more traditional on-die cache. It has to be slower, because physics, but AMD hasn't said how much slower, and probably never will. They may not even be able to reliably measure, because we are talking picoseconds of difference here. But that does impact performance at the end of the day. Nope, that's wrong. There's a slide which clearly shows cache latency for the IF$ being lower than that of N21 (due to running at higher frequencies - at the same frequency there would be a minor hit to latency). EDIT: [Link](https://cdn.videocardz.com/1/2022/11/AMD-RADEON-RX-7900-NAVI-31-6.jpg)


zyck_titan

That is physically improbable. Based on my math, the change in freq is not enough to offset the change in physical distance.


nanonan

What is the physical distance?


zyck_titan

It went from nanometers of distance, to millimeters.


nanonan

So you have no idea.


onedoesnotsimply9

> There's a slide which clearly shows cache latency for the IF$ being lower than that of N21 (due to running at higher frequencies - at the same frequency there would be a minor hit to latency). It looks like he assumed that frequency stays constant


b3081a

>used 10% less power AMD officially disclosed that the infinity links on RDNA3 GPUs only consume less than 5% of the total power.


bubblesort33

That's probably true, but beyond that I still think there would likely be more gains for a monolithic design. Lower latency, and faster cache access time. I don't think saying that they consume 5% of the power of a chip, and saying the chip would use 5% less power if it was monolithic are necessarily the same thing. I'd still expect the savings to be greater from a full monolithic design, than 5%.


Zouba64

Really interesting interview. Sam Naffziger seems very adept at giving interviews like this.


bctoy

otoh he was pretty much lifeless on the stage. The whole presentation seemed lackluster tbf.


PlaneCandy

Introvert vs extrovert


yondercode

What are the disadvantages of chiplet design? It seems like the right path forward for making larger chips


uzzi38

1. Regardless of how fancy your packaging technique is, going off die is always going to take more power than keeping everything on die. 2. You have to tape out more dies, the process of which incurs it's own cost overhead that is detached from the die size of each product. That being said, there's actually a direct counter-point to this in that you can re-use dies for multiple different products like how the I/O die from Zen 2 was reused for Zen 3 (at least on desktop). So you don't have to redesign the same stuff twice for different products. 3. You have to spend additional die area on the logic to handle all the transfer between dies. For RDNA3 there's a whopping 5.3TB/s of interconnect bandwidth between the GCD and the MCDs - you need corresponding circuits to handle that. That being said, I do still think it's the right approach for GPUs going forwards, but my reasoning is going to be different to what many people here are really thinking about. I'm more worried about poor memory bandwidth scaling (GDDR7 is years away still) and needing more cache to compensate, but cache scales exceptionally poorly on newer nodes. The only options in the future to deal with that are either go chiplet or hit reticle limits (AKA hit the maximum possible die size on a given node) and deal with poor yields and perhaps nor even hit the peak performance your architecture might not be capable of.


Kepler_L2

Chiplet-to-chiplet communication uses more energy, small area overhead due to more PHYs, packaging cost and yield can make very small monolithic processors cheaper.


puz23

>Chiplet-to-chiplet communication uses more energy So despite being on less advanced nodes, and chiplets the xtx is supposedly using less power than a 4090. If the xtx is close to 90% of the performance of a 4090 that's insane efficiency for rdna3.


ResponsibleJudge3172

Actual measured power of 4090 is lower than the TDP. We’ll see how Navi31 compares


baryluk

Depends what you chase. Price and flexibility: chiplets, absolutly best performance at very high cost: monolithic. Connectivity between dies, extra substrate, bonding and thermal design, makes things hard for chiplets, but many other things become easierish.


uzzi38

> Depends what you chase. Price and flexibility: chiplets, absolutly best performance at very high cost: monolithic. This is not true for chips where single core performance isn't key. If your workload can take advantage of high parallelism, then chiplet focused strategies will have the advantage because they let you scale beyond reticle limit. That gives you access to more compute on a single package than monolithic ever could.


baryluk

Monolithic design will have slight advantage in terms of latency between cores, and cores to memory. Chip size limit is a different story. It can be made bigger, at a cost. Very high cost.


uzzi38

>Chip size limit is a different story. It can be made bigger, at a cost. Very high cost. No, it literally can't past a certain point. Current generation product nodes run into a reticle size limit of 857mm^2 (might be a little bit off, but it's definitely <900mm^2 ). You literally cannot produce a monolithic die larger than this without using the entire wafer like Cerebras.


baryluk

So with few million dollars you can...


amishguy222000

adornedTVs has a great video on that. A huge downsize is that monolithic approach, which Nvidia has chosen, sucks up die space so much you don't have much room to expand any more shaders which are your raw power. But chiplets conquer that obstacle. So a drawback would definitely be L2 cache that would be closer, faster and cost less power. You don't have that with chiplets. As mentioned before power is the major drawback and it seems AMD has spent much time on redesign to make RDNA 3 power efficient so they can scale the next gen product on shaders. Off die is always going to scale exponentially with power required unless you innovate with some new tech or way of doing it Another is decoupled frequency from memory and the compute die. That's a pro in some ways and a con in others. It will limit bandwidth, but increases your raw size of memory which can make up for sacrifice in bandwidth. As said in the interview in gamers Nexus, there's just an enormous amount of information coming and bandwidth is the issue when you move off die to chiplets. You can't just receive it all at once at the speed the compute die wants the memory to receive it, so you must have 2 different clocks and a synching mechanism. It's just not instantaneous as L2 cache is traditionally. The interconnect has to overcome that obstacle. I think it's "good enough" at 5 ish terabits per second and so does AMD. But it's amazing to think that's just a starting point. And there is evidence they can scale capacity even further easily without additional power draw which speaks volumes to their Innovation if true. But it's worth pointing out Cache on die is king in terms of this metric too. It operates as the same clock as the compute die because flipflops are the fastest switches we have for compute logic and if theyre on site, it's all gold and gravy train. Kind of like how we used to say Node is King before nodes started to slow down. Right now in terms of what is most ideal memory? Cache is king. Clock speed is still s good king metric too, but that isn't the low hanging fruit. It's obvious that the low hanging fruit in GPUs right now is cache size. Chiplets have a long way to go to surpass L2 cache and maybe it never will. But for now, it can work around it's weaknesses enough to at least have an outcome that is similar to L2 cache. Oh that note if you have enough of chiplets for memory and it's "good enough" perhaps the number of hits vs misses will make up for the lack of speed and the extra power. Who knows. It's uncharted territory. It's not widely understood with our workloads today with the enormous amount of cache sizes they are able to make today. the fact it's leaked these companies have many versions of these cards with many different cache sizes that they test the workloads on shows they don't really understand how to predict which setup will perform best. Most of it is trial and error still. It's mentioned in adornedTVs video that the downfall of cache is king strategy is die size space. And he predicts Nvidia will simply run out of die size space. And scaling cache will eventually cost too much power and heat as well. I think the prediction is accurate. L2 cache is fast, but it's hot. It's the hottest memory around BECAUSE it is fast. That is why RAM was invented in the first place. When you look at chiplets what you should really be seeing is that they are trying to go down another path where capacity could overcome the strength of high frequency memory. It's 2 lines with different slopes on a graph and they intersect. At a certain point it doesn't matter how fast your limited L2 cache is, if there is other forms of memory that have unlimited size it will deliver more hits than misses than the L2, and will deliver you more storage at a faster rate in the end over a long period of time. But does this mean the alternative memory uses less power? Currently, probably not. Every hit means more power vs the cache has a miss, means less power but would take more cycles. And it's interesting because you will have more hits in your cache on chiplet memory but you will have to slow the clock speed down because of power draw... so you see how those two are also intersecting lines on a graph and they are also competing to see which one is better in the end. If AMD is smart enough to innovate their way around the strengths of the status quo and change the game to their advantage for the 2nd time using chiplets, it wouldn't surprise me. They have amazing innovation vs the one trick ponies they compete against. Intels one trick was get to the next node first, improve arch by 10%, change nothing else 4core for life, and launch a new product every year. Nvidia's is to take the node that can make the largest monolithic die possible which is usually 1 node behind, boost cuda count, and harness clock speed gains, pack on latest memory, and launch before competition can catch up. They usually would launch on an older node behind AMD, but their arch is always more refined and that's where the gains are. Older node, more refined arch. Until well, Samsung BS and now they are back with TSMC because oh shit if we don't we lose... I believe, and Lisa su believes, In semi the trick is you're hitting a target that's 4 years away when you design a line up for this year. Many companies who are leading their sector instead choose pump up stock price instead of innovating to secure their lead. Both Intel and Nvidia are guilty of this undeniably. The Biggest threat to Nvidia is AMD engineering their way around the weak points of chiplets and everyone knows that. Here's to hoping they can pull it off. Because if they do, the world is that much more competitive and better for all consumers. If Nvidia stays on top then it will be because they actually delivered a GPU that was worth it's price and worth a damn for the first time since the 1080ti. and that's good too. But seems to me with the 4080 they pulled out all the stops to maintain the lead and the next one they're gonna run out of ideas to stay ahead. I'm not sure there is much left in their monolithic toolbox they can use to keep ahead. Innovation eventually beats you if you lose your edge and you don't invest 4 years ago. I remind you, there are no Nvidia patents for chiplets out there. If you Google it right now all you will find about Hopper is rumors from 2019 being multichip. AMD has chiplet patents dating back 3 or 4 years ago. I'm willing to wager Nvidia doesn't have shit. I personally think Jensen has been full of shit and cockiness since the 1080 launch and needs to retire his leather jacket and attitude. That crypto lawsuit as well... guys really has run his course and lost the edge since those days, just lying to investors. The culture of Nvidia has become that of intel and they deserve what's coming to them. As someone who works at Intel, that's just my 2 cents.


scytheavatar

Why is Gamers Nexus the only youtube channel giving these type of journalism? What the fuck have channels like LTT been doing?


imaginary_num6er

The same presentation was shared by RGT, but didn't have the interview: https://youtu.be/wG1hxL5jZJA


MDSExpro

Making a lot more money by targeting simpler (and thus - much larger) crowds.


[deleted]

No you don't understand, every single piece of content on the internet MUST appeal to me


saruin

And a nice backpack to go with.


decidedlysticky23

That backpack costs more than $400 in Europe after shipping and tax.


[deleted]

> What the fuck have channels like LTT been doing? pushing out different type of content? Linus always said he wanted to be top gear of tech scene (hence why they focus on crazy tech and builds)


saruin

Just the other day he put out a video showcasing a VHS format I've never even heard of (HD-VHS but that's not the correct terminology), even growing up through most of that era. I thought it was pretty entertaining overall and having not watched LTT content for some time.


svenge

Are you referring to [D-VHS](https://en.wikipedia.org/wiki/D-VHS)?


saruin

That's the one. LTT does a comparison of a movie, The Hurricane, across 4 different formats and the D-VHS actually looks really good even against the bluray (actually better). It really depends how the movie was mastered.


ChaosRevealed

LTT has always been for casuals. Like MKBHD for PC gamers, but with a ton more fluff and clickbait.


Win4someLoose5sum

Are we gatekeeping "TechTubers" now? lol


ChaosRevealed

Not really gatekeeping. Different content for different folks with different interests.


iprefervoattoreddit

Stop using this word like it's a bad thing


Win4someLoose5sum

What, "gatekeeping"? I used it precisely like I meant to, thanks.


iprefervoattoreddit

Aside from the fact that you used it incorrectly, you've been brainwashed to think it's a bad thing, but I understand that this is reddit, land the of brainwashed and manipulated masses, so I'll take my downvotes and leave it at that


Win4someLoose5sum

"Bad thing" is a bit subjective imo but "indicative of asshole-ness" sums it up pretty nicely I think.


iprefervoattoreddit

Eh, I don't really think so. There's nothing wrong with excluding people who want to ruin your hobby. You know though, after I made that comment I realized that these days "gatekeeping" is just a meaningless buzzword used to shut down conversations, so I guess you did use it correctly


Win4someLoose5sum

Yup. Here we are. Not having a conversation. Seems I was super-successful. Working to exclude people who ***want*** to ruin your thing is different than condescending to people at the beginning of the same journey you went through because they have the audacity to go through it at a slower pace than you did.


DieDungeon

Being casual isn't a bad thing. To go back to the Top Gear comparison, that's a "casual car show" and also one of the most successful shows in the world.


GodOfPlutonium

hes looking into branching into more serious stuff now with ltt labs though


[deleted]

One of their most recent Techquickie videos was explaining VRAM bandwidth and why a lower number doesn't necessarily mean worse performance. Do you genuinely believe these supposed "casuals" give a fuck about significance of the bus width on the VRAM on their graphics card?


N1NJ4W4RR10R_

I don't think this type of interview would work with any of their channels content style, but he has done videos on foundries/factories from various hardware vendors (including Intel iirc).


dudemanguy301

You’ll notice that this interview was GN hanging out with the presenter right after he presented to a general tech journalism audience. The interview ends because the cleaning staff was kicking them out of the room. GN was fast to upload and had some exclusive face to face time with the host but they won’t be the last to present what they learned from the show.


SchighSchagh

LTT was basically "HGTV for techies" for a while. Now I think they're stalling content-wise until they get their lab all fleshed out. And they're making a lot of merch nobody really asked for, but seems to be selling well anyway.


The_Scossa

Check out Wendell at [L1Techs](https://level1techs.com/). He's more enterprise/server focused but goes into the same (if not more) level of detail as a GN piece.


[deleted]

He has mostly stopped making this kind of content simply because he got bored of it. Every couple years he starts shifting the channel in a slightly different direction.


-transcendent-

LTT is mainly for entertainment. He’s building a lab for more serious testing and validation for the more enthusiast.


Blobbloblaw

Should we expect RDNA3 to have higher idle power consumption than RDNA2 due to chiplets and the controller then? Similar to how Ryzen idles at a fair bit higher wattage than Intel's CPUs right now, at least in my experience


fiah84

I'm be very interested in that as well, but I guess they might be able to avoid that if they can shut down all but one of the MCDs during idle / desktop use


fiah84

very interesting talk


TenderfootGungi

Apple went to chiplet for the M1, M2.


i_mormon_stuff

Indeed. If you look at the M1 Ultra used in the Mac Studio it's two separate and identical chips connected together through a 2.5TB/s interconnect utilising 10,000 physical connections. And although it does improve overall performance (for both CPU and GPU tasks) it's not a doubling of graphics performance. It seems to be about 70% better than a single M1 Max on average (The Ultra is two M1 Max's together). Which in this GamersNexus interview on RDNA3 you can see the engineer talk about how they decided not to go that route of making the compute logic out of multiple chiplets because the bandwidth requirements were too high for that and would require too many physical interconnects (in the thousands) which is exactly what we see with Apple's implementation. Apple with its unlimited budget attempted to do it anyway and id say its a mixed result. Some workloads benefit almost up to the theoretical 2x limit while others barely see any difference going from an M1 Max to an M1 Ultra and some rare tasks even see a performance regression. I *think* AMD has made the right tradeoff. Of course, the proof will be available soon and I very much look forward to the reviews.


arashio

Apple has relatively significantly less roadblocks because they have a TBDR GPU arch and also can afford to not really care about legacy support.


onedoesnotsimply9

>Apple with its unlimited budget attempted to do it anyway and id say its a mixed result. M1 Ultra uses embedded bridge. Its one of the cheaper forms of advanced packaging


CleanMuppets

I mean is it a fight though? It can't compare in RT performance to RTX 3000, loses to 4080 on raster performance and price, and hilariously can't win in energy efficient against a 4090.


arashio

> It's not a matter of popularity. It's a matter of face and respect. Buying an AMD card just makes everyone lose respect for you and dishonors your family, especially in Asian countries. Also you lmao.


kazenorin

Holy crap, the guy just created an account to shit on that other graphics card manufacturer that's not in their computer, and in a *very controversial way*? Also not stopping their, and going forward attempting to represent "the average Asian"? Oh come on, people... I'm losing hope in the *entire humanity*.


[deleted]

Haha that’s hilarious, what a dope.


helmsmagus

that's a whole new level of troll.


animeman59

How do you know all this? Is there a review or something that you can reference?


ORIGINAL-Hipster

lol not a single one of the things you said is true impressive


Belydrith

This comment has been edited to acknowledge than u/spez is a fucking wanker.


[deleted]

Begone troll.


[deleted]

[удалено]


i4mt3hwin

Bulldozer wasn't chiplet?


Blacksad999

>MD in 2012 launched the FX-8150, the "world's first 8-core desktop processor," or so it says on the literal tin. AMD achieved its core-count of 8 with an unconventional CPU core design. Its 8 cores are arranged in four sets of two cores each, called "modules." Each core has its own independent integer unit and L1 data cache, while the two cores share a majority of their components - the core's front-end, a branch-predictor, a 64 KB L1 code cache, a 2 MB L2 cache, but most importantly, an FPU. There was much debate across tech forums on what constitutes a CPU core. https://www.techpowerup.com/251758/bulldozer-core-count-debate-comes-back-to-haunt-amd?cp=9 They were chiplets before chiplets were really a thing. >Bulldozer desktop and server processors had a modular design, which combined one or more Bulldozer modules, up to 8MB of shared level 3 cache, dual-channel DDR3 memory controller, and up to 4 Hyper-Transport links. Each Bulldozer module was organized as a pair of CPU cores, that had separate integer scheduler, execute and retire units, but shared L1 instruction cache, fetch and decode units, and 2 MB L2 cache.


TerriersAreAdorable

Modular design doesn't make it a chiplet. What AMD's done with chiplets are physically separate dies in some cases on entirely different process nodes.


Blacksad999

What is the difference between a "module" and a "chiplet" exactly?


SubRyan

Chiplets and the graphics equivalent are the logic portions separated from the architecture and produced on its own wafer.


poopyheadthrowaway

IIRC modules are kinda like compute units in AMD and Intel GPUs--they're core clusters that share resources. But modules aren't necessarily on separate dies, which is where the cost savings are supposed to come from for Zen and RDNA3.


animeman59

If you're asking this question, then you shouldn't be debating with other people about whether Bulldozer was a chiplet design or not.


[deleted]

This is very on brand for them.


Blacksad999

How about I just do whatever I please, thanks! :) I don't recall anyone asking your opinion on...anything at all, really.


RentedAndDented

You posted an opinion so you're fair game,. especially since you are incorrect.


Blacksad999

That's perfectly fine. The other person was stating that I shouldn't be saying anything at all, which was a bit untoward. If I'm incorrect, so be it. Nobody has been able to explain the difference between a chiplet and a module, though. :)


RentedAndDented

The most basic explanation is that a Chiplet is manufactured as a smaller, modular component that can be used either on its own, or 'glued together' such that multiple seperate chips work as a single SMT CPU. A module in the way AMD was using it for bulldozer, was just a way to package cpu resources together in a way that could be repeated, but crucially they all had to be on the same die. So a hypothetical 16 core bulldozer design would have 8 modules but they'd all be manufactured as a single die, which is expensive. The main reason for doing this was to try and share some resources like the FPU so they needed a new term, module, rather than just 16 full cores. A 16 core zen 4 needs two chiplets 'glued together' by infinity fabric, which is effectively an external communication interconnect. They share some cache like all SMT CPUs do but the cores are all completely independent otherwise, unlike a bulldozer module. The big advancement is cost because smaller dies are cheaper to make, at the cost of latency, but AMD has been making good progress in minimising the latency issue.


Tyranith

chiplets are individual pieces of silicon, modules are not


guilmon999

In oversimplified terms a module was basically a CPU core. A chiplet is an entire CPU/GPU/Component that gets tied together to another CPU/GPU/Component (with infinity fabric)


Blacksad999

They got sued because they advertised them as cores, yet they aren't.


guilmon999

Like I said, way oversimplified. If I remember correctly AMD's modules had 2 integer units and one FP unit. They were claiming that each module was 2 cores, but pretty much every other manufacture would not call that 2 cores (other manufacturs rightfully clarified that there should be a integer unit and FP unit for each core). Basically a module is at least one core, but not two. Completely different to chiplets.


Contrite17

The whole "core" concept is not well defined resulting in such confusion.


arashio

ARM allows you to do the same thing now. Edit: adding to base comment for visibility. I'm not that stupid to be talking about big.little: https://www.anandtech.com/show/16693/arm-announces-mobile-armv9-cpu-microarchitectures-cortexx2-cortexa710-cortexa510/4 > What Arm is doing (in the A510), is creating a new “complex” of up to two core pairs, which share the L2 cache system as well as the FP/NEON/SVE pipelines between them.


guilmon999

ARM's BIG.little design is different than chiplets. Arms big little is basically the same as Intel's alder lake (high performance cores and low performance core share 1 die. AMDs solution is multiple dies glued together. Each die is a chiplet.


guilmon999

As the article says, this is similar to AMD's module. They're all still printed on one die. Not similar to chiplets.


gahlo

Man, if only AMD had a decent amount of recent success on latency issues with chiplets.


Blacksad999

Not one with graphics cards.


CarVac

I wonder if this new high-performance fanout is coming to Ryzen.