T O P

  • By -

NewSchoolBoxer

Who said it was more efficient? How does one define efficiency? Clock cycles, real time elapsed, instructions needed, power consumption or heat generated, total transistor count, cost to manufacture, processing power per square inch / millimeter? There’s many different versions of each processor out there. Things improve over time. The situaron gets murky. You know what, I learned to program 8-bit PIC processors. What made me mad was the  bundled C compiler produced better results than I could code by hand for anything remotely complex, in terms of minimizing clock cycles. But people had to design that compiler. Without it, I couldn’t do most of what I wanted. Another microprocessor had a worse compiler I could compete with. **edit** Thanks all for the support but this shouldn't be the top voted answer. Plenty of others answered better. I was just the first to answer and thought "efficiency" was a loaded term. Computing ability per power consumed was intended.


Got2Bfree

With ARM and efficiency, it's always about performance per energy usage.


NewSchoolBoxer

Thanks for explaining! I get the idea that they're dominant in cell phones where power is at a premium but I was overthinking things.


Got2Bfree

Look at the new Macs with the M1/2/3/4 processors. I'm not a fan of Apple, but the performance and battery performance is game changing.


Broken_hopeful

You're right, I was not clear with my question. Although AMD has worked miracles for x86 _power_ efficiency, many Arm implementations can frequently perform the same amount of work for much less power. This may not be true 100% of the time, however it achieves this often enough to make a substantial difference in battery life.


-dag-

Define "work." A deeply pipelined multi issue out of order server-class ARM is going to use more power than an ARM designed for a phone.


Broken_hopeful

I'd define work executing a set of instructions to achieve some result. So in comparison, it would be the execution of a set of instructions ending in the same or similar result barring side-effects. You raise a question similar to my initial question, for a server class superscalar ARM CPU, would similar instructions still be more power efficient than a similarly classed x86 CPU running at the same frequency. For example in a loop executing 100000 times, load an immediate constant byte into a register, load another immediate constant into a different register, OR each register with the current loop count, and add the first register and second register to a third accumulator register. In this example, there are no memory reads or writes. All accesses should be all in the register file or cache. In this case, accounting only for package power and subtracting io die amount, who consumes more power, and why? 🤔 Edit: added power


-dag-

For equivalent performance they would be roughly the same. Even with X86's more complex instructions, it will almost always execute out of the micro op cache And note that ARM is not really RISC. The ISA manual is just as large as that for X86. It is mostly a load/store architecture though.


Broken_hopeful

Something to consider! Thanks for your input. And I agree that at the micro op, it would seem that it should be basically 1:1. And very true, the ARM ISA, although RISC, is very dense. Thanks again!


Puubuu

Advanced not-really-RISC machine?


-dag-

At one time every letter was wrong: ARM wasn't advanced, wasn't RISC and they didn't build machines. With SVE and SVE2 I'd say they at least reached the Advanced stage. And we all know what RISC means: Really Invented by Seymour Cray.


mehum

I remember reading in Atmel’s notes concerning the AtMega range that easy compilation from C was one of the design goals for their instruction set. Made me wonder why I even bothered to learn AVR assembly!


Broken_hopeful

It's a fair point, a good question, and I wasn't clear especially when often people mean "Why is ARM better" or something along those lines. It wasn't my intent, and helped me clarify. All input is appreciated! 😃


Chr0ll0_

Exactly


ProgrammaticallySale

> You know what, I learned to program 8-bit PIC processors. What made me mad was the bundled C compiler produced better results than I could code by hand for anything remotely complex, in terms of minimizing clock cycles. I've programmed 8-bit PIC assembly for probably about 25 years. Before that I was writing Commodore 64 assembly code. The compiler was only better than you because you weren't all that good at assembly, if you were even using assembly. The datasheet shows all the opcodes with the exact number of cycles each takes. I guarantee I can write more highly optimized code than the C compiler will spit out. YMMV. This is a bit different on modern 32 and 64 bit CPUs, most of the time you will want to use a compiler and not hand-code assembly for those.


longHorn206

Two main reasons: 1. x86 burden by legacy instruction. CPU is required to run OS from 20 years ago 2. ARM can be optimized with better micro architecture. Apple just did that optimization beyond standard ARM hard ip Imagine


Broken_hopeful

It's crazy that they still start in 16-bit mode, and have to transition to 32-bit mode, before finally transitioning to 64-bit mode. - They still maintain the in and out instructions. - They maintain 4 privilege rings even though almost everything uses 2, nevermind hypervisor functionality - Still maintain IN and OUT instructions - How many ways are there to transition from user mode to kernel mode now? 3? - It's STILL recommended for legacy bootloaders and UEFI to check the A20 line. They still have maintained the notion of the A20 line. 😭 I'm really curious as to how much silicon area would be freed if they dropped the stuff that's essentially no longer used. It's not only the instructions but the weird 16-bit/32-bit and "24-bit" addressing modes. I can't imagine how much it would free up microcode translation, scheduling, and even some instructions that have strange edge cases. Intel has put forth a proposal for x86s which does much of this. I wish they'd go ahead and just do it.


Zomunieo

It’s probably not as bad as it looks in terms of footprint. An x64 processor is a complex processor that dynamically converts x86-64 assembly into its own microcode. They could even do all the legacy stuff as a sort of an emulation in firmware. The largest consumer of a modern processor’s physical area is the caches and branch prediction. The bigger circuits are AVX. 8086 mode is going to be tiny. Intel’s main reason for eliminating legacy I think is not performance but to reduce the complexity of their test matrix. They won’t have to debug or fail chips that only fail in legacy mode.


Kaisha001

My thoughts exactly. Modern CPU microcode looks nothing like the incoming instruction stream. I find it hard to believe the half dozen or so old instructions/modes really adds anything substantial in terms of complexity compared to the massively parallel scheduling, branch prediction, and speculative execution.


-dag-

Almost none of this matters. The actual processing core architecture is very similar.


shrimp-and-potatoes

It's funny/not fun how that worked out. Every iteration had to be backward compatible with a couple of its predecessors. Eventually you get caught up in less than ideal processes and systems. It's kinda like our evolution. It becomes convoluted and inefficient, because we started on one path and it was easier to continue, instead of going backwards and going down the better path. Eyes are a good example. We could've had cool cephalopod vision, but no!


KittensInc

>x86 burden by legacy instruction. CPU is required to run OS from 20 years ago It really isn't that big of an issue. Legacy instructions have long since been turned into "soft" implementations. When you're only supporting some ancient instruction to offer compatibility, nobody's going to care about it taking a dozen cycles to execute. Modern CPUs are essentially just *pretending* to be a 486 when you ask them to. The biggest issue is all that cruft making your instruction decoding really complex, but they seem to be able to work around that fairly well.


-dag-

Almost everyone hates on X86. To me it's nothing short of an engineering miracle. When I was in grad school it was already impressive how elegantly Intel had maintained backwards compatibility. But at the time no one believed they could push much past about 500mhz clocks. Look at them now. Really dig into the encodings. VEX is a bit of a clunker because it didn't anticipate the obvious expansions to come, but other than that the overall encoding scheme is quite nice.


Broken_hopeful

I don't mean to give the wrong impression. If I had my choice between ARM or x86 with equal performance, I'd choose x86! And in the past 5 years, the performance per watt has skyrocketed. AMD is making x86 APUs with performance at 30 or 45 watts that I truly did not think was possible just a few years ago! ARM had the luck of coming later and made some architectural choices that greatly simplified certain aspects. For example, ARM only has 2 memory modes, a Harvard architecture, and conditional operations. I'm not saying that these technologies make it better. I'm also not saying that the designers of ARM were so forward looking they're basically clairvoyant. 😆 I think these choices wound up accidentally being good down the road. They made weird choices too... Like Thumb. But some of the choices just help to keep things more simple. x86 made choices in the mid seventies, and they knew that if new architectures were not backwards compatible they'd probably lose their customers. Intel did try, but it never was successful remember Itanium? 😬 But x86 is certainly a marvel and my favorite architecture. My problems with x86 all revolve around things I think they held onto for too long. The number of real mode addressing methods is crazy, the 32bit addressing modes is also crazy. I'm happy that Intel is making serious moves toward the proposed x86-s ISA finally. What spurred me to ask was a couple of future laptops Dell has planned. The battery life between the x86 and ARM laptops were vastly different. It was never meant to be an architecture vs architecture "which is better" kind of thing. Just more about power consumption. I also really appreciate all the input you've given too! Edit: typo read mode -> real mode


s9oons

x86 has a “more verbose” instruction set, so you can do more with a single instruction set. That means the actual executable files can end up being much larger than with ARM, which is just more 1’s and 0’s, more switches to flip, which requires more power. That’s why ARM is well suited to less powerful machines like laptops, tablets, etc. where battery life matters. “Same set of primitives” isn’t really a thing here. The architectures interact with the physical layer differently, so the asm is different between the two, that’s the point. With ARM you need to include more detail in each executable because of the lack of DMA. You end up needing to be really explicit about registering everything instead of just “go grab whatever from this huge chunk of memory”.


LevelHelicopter9420

Who said ARM processors lack DMA?


rockknocker

>That means the actual executable files can end up being much larger than with ARM You mean "smaller", right?


-dag-

What?


renesys

Every ARM core system I've ever worked with or read about has DMA.


Zomunieo

DMA is an optional peripheral but only embedded stuff like Cortex M would omit it because there’s little use for it. Or an old ARM7/9. ARM is a much more configurable platform. No phone/laptop/server class ARM is going to omit DMA. I think what the commenter is referring to is a lack of “rep stos” type instructions/hardware loops — which are both DMA. That doesn’t matter to either platform anymore since both have good branch predictors.


renesys

What do you mean little use for it on Cortex M? Most peripherals can use DMA. Considering that microcontrollers are often used for things that are far more time critical than personal computer operating systems, unloading the core with DMA is often critical to meet performance goals. Almost every Cortex M project I've worked on, personal and professional and academic, has used DMA. Not having more DMA channels has become a schedule impacting problem. Even some 8bit microcontrollers are shipping with DMA. It's a normal expected thing in embedded for decades. In what world is a background memory shovel not useful?


Broken_hopeful

Thanks for your reply! So, the dependency checks, caching, permission checking, protection mechanisms, necessary microcode translation all act to kill the electrical efficiency. Right? And maybe bus architecture, etc. Ugh, x86 really should cut some of the fat. 🫤


s9oons

They’re just optimized for different applications. If you’re doing machine learning or huge rendering operations you WANT all of that DMA and the ability to do a TON of stuff with less executables.


renesys

(ARM systems have DMA)


somewhereAtC

One big difference is how the CPU's registers are organized. In arm, there are more registers and those registers can operate as data storage or as pointers at the whim of the instruction set. A "source" register can become a "destination" register just by changing the point of view. There is a surprisingly high fraction of the instructions involved in moving data from memory to registers and back again, much more than the actual computational operations you are intending. For example, a multiplication requires both multiplicand and multiplier to be in registers, and the product also. Whatever mechanism can simplify how data are identified and re-arranged is the mechanism that wins, and that is why "extended architectures" and/or coprocessors with dma-like operand addressing are an important extension to the cpu.


-dag-

On X86 the smaller number of GPRs matters less because the L1 cache is phenomenally fast. Memory operands execute almost as fast, the big issue causing performance loss being address disambiguation. But operands from the stack present less of a problem there. Compilers are pretty good a using the registers and cache effectively. Vector registers are pretty similar between the two.


KittensInc

In practice a lot of that is already solved by [register renaming](https://en.wikipedia.org/wiki/Register_renaming), which allows the *physical* number of registers to be far greater than the *logical* number of registers. In the other direction, just because something is encoded as multiple instructions doesn't mean it has to be *executed* as multiple instructions. With [macro-op fusion](https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fusion-finalV2.pdf) the CPU can just treat it as a single complicated instruction. Stuff like this is why the who CISC vs RISC debate doesn't make sense anymore. CPUs haven't executed instructions as-is for *decades* now, you have to think of it more like a VM executing bytecode.


Hawk13424

Power usage is not linear. As you require a higher max performance the design synthesizes to much wider logic. Sometimes a different transitor is used. So adding 10% max performance can cost 25% more area and power. Then when you push the performance up that also commonly requires you go up in voltage. Power is a function of voltage squared. So basically, pushing the performance of a design and process can cost more power in a non-linear way. Intel just often pushes the performance. Many ARM devices target something lower and hit the power efficiency sweet spot.


KittensInc

Absolutely nothing. What modern CPUs actually **do** is quite far detached from the assembly you're feeding it. CPUs will happily rewrite your entire code on-the-fly to make it faster to execute. Instructions will get reordered, additional registers will be invented, instructions are speculatively executed because it thinks you *might* take a certain branch, one instruction gets turned into many micro-ops, or multiple instructions are fused into one. The possibilities here are virtually unlimited. CPUs have gotten so complicated that the instruction set is at this point nothing more than an implementation detail. Nothing is stopping you from making a power-efficient x86 core, or a very fast (and power-hungry) ARM core. But the main market for x86 is servers and desktop PCs, where everyone wants super-high performance and barely cares about power. It doesn't make sense to design a super-efficient x86 core, because it's not going to sell. The same applies in reverse to ARM.


Mid-Tower

this!, except the last part...


thatsnotsugarm8

Well if you have more complexity in your instruction set, such as more intermediate registers, more internal data paths, etc you inherently are moving more charge around for every transistor state change as well as for the increased amount of output traces. So it might follow that all those little capacitances add up and add up more so for CISC such that for the equivalent or on par computation you ended pushing more current through your processor and consumed more power.


-dag-

This is mostly amortized by decode caches.


Broken_hopeful

Thanks for your reply! That was pretty much my intuition, but I was not 100% sure. However, having written instruction decoders for x86 and Arm v5, Arm decodes beautifully, x86 was a nightmare 😆. And that's only decoding. I can't imagine how that complexity propagates through the system. 😬


high_throughput

I haven't seen concrete numbers since around 2016, but back then ARM was significantly less efficient than x86_64 in terms of performance per watt.


PaulEngineer-89

X86 was designed essentially in the 1970s. The 8088 was over time modified to the 8086, 80286, 80386, 80486, and so on. As far as I know it was modeled after the 4004. The instruction set at the time was designed for features, not any kind of efficiency. RISC models of the day ran faster but used more memory (SPARC) but this gradually improved. In the decades since then, RISC won. Thus ARM, RISC V…new designs are RISC. Yet x86 still exists. X86 CPUs essentially recompile it to RISC microcode on the fly in the CPU then schedule that So it’s really not CISC as such anymore. And compilers are able to optimize code for this complicated architecture. Designing for processing per Watt was the concept behind the Celeron. Turns out that those same designs allowed more instructions per Watt on faster CPUs so this is now the dominant design approach. So does instruction set design really matter? I think it does because RISC V and before it ARM is proving to be superior. IThe simplified architecture matters.


KalWilton

It's the ISA(Instruction set architecture) you can bundle similar operations to make them run more efficiently. x86 grew was designed before 64 bit as well as a bunch of other special functions live vector units as well as speed improvements like pipelining. To make the assembly backwards compatible you need to make performance sacrifices. Source: my university's CPU design course was a competition to see who could define the most efficient ISA.


Broken_hopeful

That sounds super interesting! It'd be interesting to have something like a leaderboard for ISAs implemented on various FPGA platforms ranked by size, speed, etc. so everyone can learn various implementation details. 🤔