In the x86-64 Tour of Intel Manuals, I read
Perhaps the most surprising fact is that an instruction such as MOV EAX, EBX automatically zeroes upper 32 bits of RAX register.
The Intel documentation (3.4.1.1 General-Purpose Registers in 64-Bit Mode in manual Basic Architecture) quoted at the same source tells us:
64-bit operands generate a 64-bit result in the destination general-purpose register. 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register. 8-bit and 16-bit operands generate an 8-bit or 16-bit result. The upper 56 bits or 48 bits (respectively) of the destination general-purpose register are not be modified by the operation. If the result of an 8-bit or 16-bit operation is intended for 64-bit address calculation, explicitly sign-extend the register to the full 64-bits.
In x86-32 and x86-64 assembly, 16 bit instructions such as
mov ax, bx
don't show this kind of "strange" behaviour that the upper word of eax is zeroed.
Thus: what is the reason why this behaviour was introduced? At a first glance it seems illogical (but the reason might be that I am used to the quirks of x86-32 assembly).
r32
destination operand zero the high 32, rather than merging. For example, some assemblers will replace pmovmskb r64, xmm
with pmovmskb r32, xmm
, saving a REX, because the 64bit destination version behaves identically. Even though the Operation section of the manual lists all 6 combinations of 32/64bit dest and 64/128/256b source separately, the implicit zero-extension of the r32 form duplicates the explicit zero-extension of the r64 form. I'm curious about the HW implementation...
xor eax,eax
or xor r8d,r8d
is the best way to zero RAX or R8 (saving a REX prefix for RAX, and 64-bit XOR isn't even handled specially on Silvermont). Related: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
I'm not AMD or speaking for them, but I would have done it the same way. Because zeroing the high half doesn't create a dependency on the previous value, that the CPU would have to wait on. The register renaming mechanism would essentially be defeated if it wasn't done that way.
This way you can write fast code using 32-bit values in 64-bit mode without having to explicitly break dependencies all the time. Without this behaviour, every single 32-bit instruction in 64-bit mode would have to wait on something that happened before, even though that high part would almost never be used. (Making int
64-bit would waste cache footprint and memory bandwidth; x86-64 most efficiently supports 32 and 64-bit operand sizes)
The behaviour for 8 and 16-bit operand sizes is the strange one. The dependency madness is one of the reasons that 16-bit instructions are avoided now. x86-64 inherited this from 8086 for 8-bit and 386 for 16-bit, and decided to have 8 and 16-bit registers work the same way in 64-bit mode as they do in 32-bit mode.
See also Why doesn't GCC use partial registers? for practical details of how writes to 8 and 16-bit partial registers (and subsequent reads of the full register) are handled by real CPUs.
It simply saves space in the instructions, and the instruction set. You can move small immediate values to a 64-bit register by using existing (32-bit) instructions.
It also saves you from having to encode 8 byte values for MOV RAX, 42
, when MOV EAX, 42
can be reused.
This optimization is not as important for 8 and 16 bit ops (because they are smaller), and changing the rules there would also break old code.
XOR EAX, EAX
because XOR RAX, RAX
would need an REX prefix.
[rsi + edx]
isn't allowed). Of course avoiding false dependencies / partial-register stalls (the other answer) is another major reason.
Without zero extending to 64 bits, it would mean an instruction reading from rax
would have 2 dependencies for its rax
operand (the instruction that writes to eax
and the instruction that writes to rax
before it), this would result in a partial register stall, which starts to get tricky when there are 3 possible widths, so it helps that rax
and eax
write to the full register, meaning the 64-bit instruction set doesn't introduce any new layers of partial renaming.
mov rdx, 1
mov rax, 6
imul rax, rdx
mov rbx, rax
mov eax, 7 //retires before add rax, 6
mov rdx, rax // has to wait for both imul rax, rdx and mov eax, 7 to finish before dispatch to the execution units, even though the higher order bits are identical anyway
The only benefit of not zero extending is ensuring the higher order bits of rax
are included, for instance, if it originally contains 0xffffffffffffffff, the result would be 0xffffffff00000007, but there's very little reason for the ISA to make this guarantee at such an expense, and it's more likely that the benefit of zero extension would actually be required more, so it saves the extra line of code mov rax, 0
. By guaranteeing it will always be zero extended to 64 bits, the compilers can work with this axiom in mind whilst in mov rdx, rax
, rax
only has to wait for its single dependency, meaning it can begin execution quicker and retire, freeing up execution units. Furthermore, it also allows for more efficient zero idioms like xor eax, eax
to zero rax
without requiring a REX byte.
cmovbe
is 2 uops but cmovb
is 1). But no CPU that does any partial-register renaming does it the way you suggest. Instead they insert a merging uop if a partial reg is renamed separately from the full reg (i.e. is "dirty"). See Why doesn't GCC use partial registers? and How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX
I can't find an example of the 'merging uop' that would be used to solve this though, same for a partial flag stall
mov al, [mem]
is a micro-fused load+ALU-merge, only renaming AH, and an AH-merging uop still issues alone. The partial-flag merging mechanisms in these CPUs vary, e.g. Core2/Nehalem still just stall for partial-flags, unlike partial-reg.
From a hardware perspective, the ability to update half a register has always been somewhat expensive, but on the original 8088, it was useful to allow hand-written assembly code to treat the 8088 as having either two non-stack-related 16-bit registers and eight 8-bit registers, six non-stack-related 16-bit registers and zero 8-bit registers, or other intermediate combinations of 16-bit and 8-bit registers. Such usefulness was worth the extra cost.
When the 80386 added 32-bit registers, no facilities were provided to access just the top half of a register, but an instruction like ROR ESI,16
would be fast enough that there could still be value in being able to hold two 16-bit values in ESI and switch between them.
With the migration to x64 architecture, the increased register set and other architectural enhancements reduced the need for programmers to squeeze the maximum amount of information into each register. Further, register renaming increased the cost of doing partial register updates. If code were to do something like:
mov rax,[whatever]
mov [something],rax
mov rax,[somethingElse]
mov [yetAnother],rax
register renaming and related logic would make it possible to have the CPU record the fact that the value loaded from [whatever]
will need to be written to something
, and then--so long as the last two addresses are different--allow the load of somethingElse
and store to yetAnother
to be processed without having to wait for the data to actually be read from whatever
. If the third instruction were mov eax,[somethingElse
, however, and it were specified as leaving the upper bits unaffaected, the fourth instruction couldn't store RAX until the first load was completed, and even allowing even the load of EAX
to occur would be difficult, since the processor would have to keep track of the fact that while the lower half was available, the upper half wasn't.
mov eax, 1
(opcode + imm32) work as a way to set the full 64-bit register, instead of needing 7-byte mov rax, sign_extended_imm32
(REX + opcode + modrm + imm32) or 10-byte mov rax, imm64
(rex + opcode + imm64). And many other cases where zero-extending for free is useful, e.g. when using an unsigned 32-bit integer as an array index (part of an addressing mode), or a signed integer that's known to be non-negative.
int
, 64-bit pointers). Related: MOVZX missing 32 bit register to 64 bit register - some ISAs like MIPS64 have made different choices, like keeping narrow values sign-extended.
imul
slower on K8 (64-bit multiply wasn't as fast as 32), unless that also set the operand-size and truncated / extended the result from 32-bit to fill a reg.) But comments here aren't the place to discuss further :/
Success story sharing