ChatGPT解决这个技术问题 Extra ChatGPT

Why does Intel hide internal RISC core in their processors?

Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core.

At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking compatibility, it's reasonable.

However I don't understand one thing, why Intel still keeps an internal RISC instructions set hidden for so many years? Why wouldn't they let programmers use RISC instructions like the use old x86 CISC instructions set?

If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).

Looking at new Intel 'Core i' series I see, that they only extends CISC instructions set adding AVX, SSE4 and others.

note that there are certain x86 CPUs where the internal RISC instruction set is exposed

j
jalf

No, the x86 instruction set is certainly not deprecated. It is as popular as ever. The reason Intel uses a set of RISC-like micro-instructions internally is because they can be processed more efficiently.

So a x86 CPU works by having a pretty heavy-duty decoder in the frontend, which accepts x86 instructions, and converts them to an optimized internal format, which the backend can process.

As for exposing this format to "external" programs, there are two points:

it is not a stable format. Intel can change it between CPU models to best fit the specific architecture. This allows them to maximize efficiency, and this advantage would be lost if they had to settle on a fixed, stable instruction format for internal use as well as external use.

there's just nothing to be gained by doing it. With today's huge, complex CPU's, the decoder is a relatively small part of the CPU. Having to decode x86 instructions makes that more complex, but the rest of the CPU is unaffected, so overall, there's just very little to be gained, especially because the x86 frontend would still have to be there, in order to execute "legacy" code. So you wouldn't even save the transistors currently used on the x86 frontend.

This isn't quite a perfect arrangement, but the cost is fairly small, and it's a much better choice than designing the CPU to support two completely different instruction sets. (In that case, they'd probably end up inventing a third set of micro-ops for internal use, just because those can be tweaked freely to best fit the CPU's internal architecture)


Good points. RISC is a good core architecture, where GOOD means runs-fast and possible to implement correctly, and x86 ISA which has a CISC architectural history, is merely now, an instruction set layout with a huge history and fabulous wealth of binary software available for it, as well as being efficient for storage and processing. It's not a CISC shell, it's the industry defacto standard ISA.
@Warren: on the last part, I actually don't think so. A well-designed CISC instruction set is more efficient in terms of storage, yes, but from the few tests I've seen, the "average" x86 instruction is something like 4.3 bytes wide, which is more than it'd typically be in a RISC architecture. x86 loses a lot of storage efficiency because it's been so haphazardly designed and extended over the years. But as you say, its main strength is the history and huge amount of existing binary code.
I didn't say it was "well designed CISC", just "huge history". The GOOD parts are the RISC chip design parts.
@jalf - From inspecting actual binaries, instruction size in x86 is about 3 bytes each on average. There are much longer instructions of course, but the smaller ones tend to dominate in actual use.
Average instruction length is not a good measure of code density: the most common type of x86 instruction in typical code is load and store (just moving data to where it can be processed, and back to memory, RISC processors and about ½ of CISC have lots of registers so do not need to do this much. Also how much can one instruction do (arm instructions can do around 3 things).
J
Jorge Aldo

The real answer is simple.

The major factor behind the implementation of RISC processors was to reduce complexity and gain speed. The downside of RISC is the reduced instruction density, that means that the same code expressed in RISC like format needs more instructions than the equivalent CISC code.

This side effect doesnt means much if your CPU runs at the same speed as the memory, or at least if they both run at reasonably similar speeds.

Currently the memory speed compared to the CPU speed shows a big difference in clocks. Current CPUs are sometimes five times or more faster than the main memory.

This state of the technology favours a more dense code, something that CISC provides.

You can argue that caches could speed up RISC CPUs. But the same can be said about CISC cpus.

You get a bigger speed improvement by using CISC and caches than RISC and caches, because the same size cache has more effect on high density code that CISC provides.

Another side effect is that RISC is harder on compiler implementation. Its easier to optimize compilers for CISC cpus. etc.

Intel knows what they are doing.

This is so true that ARM has a higher code density mode called Thumb.


Also an internal RISC core reduces transistor count on a CISC CPU. Instead of hard wiring every CISC instruction, you can use microcode to execute them. This leads to reusing RISC microcode instructions for different CISC instructions hence using less die area.
Intel notes that usually an instruction gets decoded into multiple μOps. But there are quite a few cases where multiple instructions back-to-back get fused into a single μOp. One example they give is compare followed by branch, which gets fused into a single μOp.
M
Mike Thomsen

If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).

You need to look at the business angle of this. Intel has actually tried to move away from x86, but it's the goose that lays golden eggs for the company. XScale and Itanium never came even close to the level of success that their core x86 business has.

What you're basically asking is for Intel to slit its wrists in exchange for warm fuzzies from developers. Undermining x86 is not in their interests. Anything that makes more developers not have to choose to target x86 undermines x86. That, in turn, undermines them.


Yes, when Intel tried to do this (Itanium), the marketplace merely responded with a shrug.
It should be noted there were a variety of factors while Itanium failed, and not just because it was a new architecture. For example, off-loading CPU scheduling to a compiler that never actually achieved it's goal. If the Itanium was 10x or 100x faster than x86 CPUs, it would have sold like hot cakes. But it wasn't faster.
P
Peter Cordes

Via C3 processors do allow something like this, after enabling it via an MSR and executing an undocumented 0F 3F instruction to activate the https://en.wikipedia.org/wiki/Alternate_Instruction_Set which doesn't enforce the usual privileged (ring 0) vs. unprivileged (ring 3) protections. (Unfortunately Via Samuel II shipped with the MSR setting to allow this defaulting to allowed. And they didn't document it, so OSes didn't know they should turn off that capability. Other Via CPUs default to disabled.)

See Christopher Domas's talk from DEF CON 26:
GOD MODE UNLOCKED Hardware Backdoors in redacted x86.
He also developed an assembler for that AIS (Alternate Instruction Set):
https://github.com/xoreaxeaxeax/rosenbridge, along with tools for activating it (or closing the vulnerability!)

After running 0F 3F (which jumps to EAX), AIS instructions are encoded with a 3-byte prefix in front of a 4-byte RISC instruction. (Not distinct from existing x86 instruction encodings, e.g. it takes over LEA and Bound, but you can otherwise mix Via RISC and x86 instructions.)

The AIS (Alternate Instruction Set) uses RISC-like fixed-width 32-bit instructions; thus we already know that not all possible uops can be encoded as RISC instructions. The machine decodes x86 instructions like 6-byte add eax, 0x12345678 (with a 32-bit immediate) to a single uop. But a 32-bit instruction word doesn't have room for a 32-bit constant and an opcode and destination register. So it's an alternate RISC-like ISA that's limited to a subset of things the back-end can execute and that their RISC decoder can decode from a 32-bit instruction.

(related: Could a processor be made that supports multiple ISAs? (ex: ARM + x86) discusses some challenges of doing this as more than a gimmick, like having a full ARM mode with actual expectations of performance, and all the addressing modes and instructions ARM requires.)

uops wouldn't be as nice as an actual ARM or PowerPC

@jalf's answer covers most of the reasons, but there's one interesting detail it doesn't mention: The internal RISC-like core isn't designed to run an instruction set quite like ARM/PPC/MIPS. The x86-tax isn't only paid in the power-hungry decoders, but to some degree throughout the core. i.e. it's not just the x86 instruction encoding; it's every instruction with weird semantics.

(Unless those clunky semantics are handled with multiple uops, in which case you can just use the one useful uop. e.g. for shl reg, cl with raw uops you could just leave out the inconvenient requirement to leave FLAGS unmodified when the shift-count is 0, which is why shl reg,cl is 3 uops on Intel SnB-family, so using raw uops would be great. Without raw uops, you need BMI2 shlx for single-uop shifts (which don't touch FLAGS at all).)

Let's pretend that Intel did create an operating mode where the instruction stream was something other than x86, with instructions that mapped more directly to uops. Let's also pretend that each CPU model has its own ISA for this mode, so they're still free to change the internals when they like, and expose them with a minimal amount of transistors for instruction-decode of this alternate format.

Presumably you'd still only have the same number of registers, mapped to the x86 architectural state, so x86 OSes can save/restore it on context switches without using the CPU-specific instruction set. But if we throw out that practical limitation, yes we could have a few more registers because we can use the hidden temp registers normally reserved for microcode1.

If we just have alternate decoders with no changes to later pipeline stages (execution units), this ISA would still have many x86 eccentricities. It would not be a very nice RISC architecture. No single instruction would be very complex, but some of the other craziness of x86 would still be there.

For example: int->FP conversion like cvtsi2sd xmm0, eax merges into the low element of an XMM register, thus has a (false) dependency on the old register value. Even the AVX version just takes a separate arg for the register to merge into, instead of zero-extending into an XMM/YMM register. This is certainly not what you usually want, so GCC usually does an extra pxor xmm0, xmm0 to break the dependency on whatever was previously using XMM0. Similarly sqrtss xmm1, xmm2 merges into xmm1.

Again, nobody wants this (or in the rare case they do, could emulate it), but SSE1 was designed back in the Pentium III days when Intel's CPUs handled an XMM register as two 64-bit halves. Zero-extending into the full XMM register would have cost an extra uop on every scalar-float instruction in that core, but packed-float SIMD instructions were already 2 uops each. But this was very short-sighted; it wasn't long before P4 had full-width XMM registers. (Although when they returned to P6 cores after abandoning P4, Pentium-M and Core (not Core2) still had half-width XMM hardware.) Still, Intel's short-term gain for P-III is ongoing long-term pain for compilers, and future CPUs that have to run code with either extra instructions or possible false dependencies.

If you're going to make a whole new decoder for a RISC ISA, you can have it pick and choose parts of x86 instructions to be exposed as RISC instructions. This mitigates the x86-specialization of the core somewhat.

The instruction encoding would probably not be fixed-size, since single uops can hold a lot of data. Much more data than makes sense if all insns are the same size. A single micro-fused uop can add a 32bit immediate and a memory operand that uses an addressing mode with 2 registers and a 32bit displacement. (In SnB and later, only single-register addressing modes can micro-fuse with ALU ops).

uops are very large, and not very similar to fixed-width ARM instructions. A fixed-width 32bit instruction set can only load 16bit immediates at a time, so loading a 32bit address requires a load-immediate low-half / loadhigh-immediate pair. x86 doesn't have to do that, which helps it not be terrible with only 15 GP registers limiting the ability to keep constants around in registers. (15 is a big help over 7 registers, but doubling again to 31 helps a lot less, I think some simulation found. RSP is usually not general purpose, so it's more like 15 GP registers and a stack.)

TL;DR summary:

Anyway, this answer boils down to "the x86 instruction set is probably the best way to program a CPU that has to be able to run x86 instructions quickly", but hopefully sheds some light on the reasons.

Internal uop formats in the front-end vs. back-end

See also Micro fusion and addressing modes for one case of differences in what the front-end vs. back-end uop formats can represent on Intel CPUs.

Footnote 1: There are some "hidden" registers for use as temporaries by microcode. These registers are renamed just like the x86 architectural registers, so multi-uop instructions can execute out-of-order.

e.g. xchg eax, ecx on Intel CPUs decodes as 3 uops (why?), and our best guess is that these are MOV-like uops that do tmp = eax; ecx=eax ; eax=tmp;. In that order, because I measure the latency of the dst->src direction at ~1 cycle, vs. 2 for the other way. And these move uops aren't like regular mov instructions; they don't seem to be candidates for zero-latency mov-elimination.

See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for a mention of trying to experimentally measure PRF size, and having to account for physical registers used to hold architectural state, including hidden registers.

In the front-end after the decoders, but before the issue/rename stage that renames registers onto the physical register file, the internal uop format use register numbers similar to x86 reg numbers, but with room to address these hidden registers.

The uop format is somewhat different inside the out-of-order core (ROB and RS), aka back-end (after the issue/rename stage). The int/FP physical register files each have 168 entries in Haswell, so each register field in a uop needs to be wide enough to address that many.

Since the renamer is there in the HW, we'd probably be better off using it, instead of feeding statically scheduled instructions directly to the back-end. So we'd get to work with a set of registers as large as the x86 architectural registers + microcode temporaries, not more than that.

The back-end is designed to work with a front-end renamer that avoids WAW / WAR hazards, so we couldn't use it like an in-order CPU even if we wanted to. It doesn't have interlocks to detect those dependencies; that's handled by issue/rename.

It might be neat if we could feed uops into the back-end without the bottleneck of the issue/rename stage (the narrowest point in modern Intel pipelines, e.g. 4-wide on Skylake vs. 4 ALU + 2 load + 1 store ports in the back-end). But if you did that, I don't think you can statically schedule code to avoid register reuse and stepping on a result that's still needed if a cache-miss stalled a load for a long time.

So we pretty much need to feed uops to the issue/rename stage, probably only bypassing decode, not the uop cache or IDQ. Then we get normal OoO exec with sane hazard detection. The register allocation table is only designed to rename 16 + a few integer registers onto the 168-entry integer PRF. We couldn't expect the HW to rename a larger set of logical registers onto the same number of physical register; that would take a larger RAT.


g
geo

The answer is simple. Intel isn't developing CPUs for developers! They're developing them for the people who make the purchasing decisions, which BTW, is what every company in the world does!

Intel long ago made the commitment that, (within reason, of course), their CPUs would remain backward compatible. People want to know that, when they buy a new Intel based computer, that all of their current software will run exactly the same as it did on their old computer. (Although, hopefully, faster!)

Furthermore, Intel knows exactly how important that commitment is, because they once tried to go a different way. Exactly how many people do you know with an Itanium CPU?!?

You may not like it, but that one decision, to stay with the x86, is what made Intel one of the most recognizable business names in the world!


I disagree with the insinuation that Intel processors are not developer-friendly. Having programmed PowerPC and x86 for many years, I've come to believe that CISC is much more programmer-friendly. (I work for Intel now, but I made up my mind on this issue before I was hired.)
@Jeff That wasn't my intention at all! The question was, why hasn't Intel opened the RISC instruction set so that developers can use it. I didn't say anything about x86 being non-developer friendly. What I said was that decisions such as this weren't decided with developers in mind, but, rather, were strictly business decisions.
C
CoolOppo

Intel has been the leader for an extremely long time up until very recently. They had no reason to change their architecture because the iterative changes they could make every year with better internal optimization kept them ahead. That, and AMD--their only real competitor in the space of desktop and server CPUs--also uses x86. So essentially what either of the two ONLY companies in this field have to do is beat the other at optimizing the x86 code each year.

Creating a new architecture and instruction set to go along with it is a large risk for a company, because they're giving up their foothold in the x86 optimization race to invest talent in creating a new architecture that will need extensive support from Microsoft and/or Linux in order to even maintain slight compatibility. Doing a partnership with Microsoft to have binary translation in the Windows OS (a necessity) could be seen as trust activity unless both manufacturers agree to sign on and work together to create a standard architecture that Microsoft can make their translation layer translate to.

Apple just recently released their new M1 chips, which are really just ARM, but these are RISC at heart and what you write in assembly is what is run on the CPU. This took close cooperation between Apple and the manufacturers, something their company has always done pretty well (which has its pro's and con's). One thing that they're able to do with such strict control over both the software and hardware is that they can create the exact translation layer needed for the specific hardware they want things to run on.

My prediction is that AMD and Intel will introduce RISC only CPUs in the near future, because there is no doubt that Apple is going to continue improving on the "M" line of chips, creating better and better ahead-of-time optimizations on the compiler/software side to make their chips have the exact code they need when they need it. This approach is clearly better, but like I said before: Intel and AMD were caught in lockstep with one another and couldn't afford to make the move. Now their hands are being forced.

As for the main question of why they hide the internal RISC architecture? I think the question is slightly "off". It's not like they're purposely "hiding" it from you...that implies the intent to keep you away from it. The real reason you don't have access is that it would require significantly more work for them to allow you to use two architectures on the same core. You need two pipelines where code can come in as data. Do you sync up the clocks? Can they interoperate with one another? If they're segregated, do you lose an x86 core and get a RISC core instead? Or can the same core just run both at once? What about potential security vulnerabilities...can we have RISC code interfere with the x86 code in a way that messes with the internal optimizer? I could go on and on, but I think you see my point: it's way too hard to have two architectures available for programming the thing.

That leaves us only one option: we have to choose which architecture we're gonna support. As I have explained way up there somewhere a few paragraphs up, there are quite a few reasons that they can't just deliver a RISC processor. So we're bestowed x86 by our tech overlords.


K
KOLANICH

Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly?

In addition to the previous answers, the another reason is market segmentation. Some instructions are thought to be implemented in microcode rather than in hardware, so allowing anyone to execute arbitrary microoperations can undermine sells of new cpus with "new" more performant CISC instructions.


I don't think this makes sense. A RISC can use microcode, especially if we're talking about just adding RISC decoders to an x86 frontend.
That's still wrong. The AES new instructions (and the upcoming SHA instructions), and other stuff like PCLMULQDQ have dedicated hardware. On Haswell, AESENC decodes to a single uop (agner.org/optimize), so it's definitely not microcoded at all. (The decoders only need to activate the microcode ROM sequencer for instructions that decode to more than 4 uops.)
You're right that some new instructions do just use existing functionality in a way that isn't available with x86 instructions. A good example would be BMI2 SHLX, which lets you do variable-count shifts without putting the count in CL, and without incurring the extra uops required to handle the crappy x86 flag semantics (flags are unmodified if the shift count is zero, so SHL r/m32, cl has an input dependency on FLAGS, and decodes to 3 uops on Skylake. It was only 1 uop on Core2/Nehalem, though, according to Agner Fog's testing.)
Thank you for your comments.