ChatGPT解决这个技术问题 Extra ChatGPT

Why does Windows64 use a different calling convention from all other OSes on x86-64?

AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it's own x86-64 calling convention. Why?

Does anyone know the technical, historical, or political reasons for this difference, or is it purely a matter of NIHsyndrome?

I understand that different OSes may have different needs for higher level things, but that doesn't explain why for example the register parameter passing order on Windows is rcx - rdx - r8 - r9 - rest on stack while everyone else uses rdi - rsi - rdx - rcx - r8 - r9 - rest on stack.

P.S. I am aware of how these calling conventions differ generally and I know where to find details if I need to. What I want to know is why.

Edit: for the how, see e.g. the wikipedia entry and links from there.

Well, just for the first register: rcx: ecx was the "this" parameter for the msvc __thiscall x86 convention. So probably just to ease porting their compiler to x64, they started with rcx as the first. That everything else would then be different too was just a consequence of that initial decision.
@Chris: I've added a reference to the AMD64 ABI supplement document (and some explanations what it actually is) below.
I haven't found a rationale from MS but I found some discussion here

C
Community

Choosing four argument registers on x64 - common to UN*X / Win64

One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX, ?CX, ?DX, ?BX, ?SP, ?BP, ?SI, ?DI.

Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8 / R9 as the first ones.

Keeping that in mind, Microsoft's choice of RAX (return value) and RCX, RDX, R8, R9 (arg[0..3]) are an understandable selection if you choose four registers for arguments.

I don't know why the AMD64 UN*X ABI chose RDX before RCX.

Choosing six argument registers on x64 - UN*X specific

UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.

So if you want six registers to pass arguments in, and it's logical to choose RCX, RDX, R8 and R9 for four of them, which other two should you pick ?

The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP and RSP these aren't available, and RBX traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI / RDI.

So if you have to take RSI / RDI as argument registers, which arguments should they be ?

Making them arg[0] and arg[1] has some advantages. See cHao's comment.
?SI and ?DI are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy() function, for example, only consists of the two CPU instructions repz movsb; ret because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk(), or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.

So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI/RDI registers, to the natural choice of four arguments in RCX, RDX, R8 and R9.

Beyond that ...

There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:

http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx

Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.


About register names: That prefix byte may be a factor. But then it would be more logical for MS to choose rcx - rdx - rdi - rsi as argument registers. But the numerical value of the first eight could guide you if you're designing an ABI from scratch, but there's no reason to change them if a perfectly fine ABI already exists, that only leads to more confusion.
On RSI/RDI: These instructions will usually be inlined, in which case calling convention doesn't matter. Otherwise, there's only one copy (or maybe a few) of that function systemwide, so it only saves a handfull of bytes in total. Not worth it. On other differences / call stack: The usefullness of specific choices is explained in the ABI references, but they don't make a comparison. They don't tell why other optimizations were not chosen - e.g. why doesn't Windows have the 128 byte red zone, and why doesn't the AMD ABI have the extra stack slots for arguments?
@Somejan: Win64 and Win32 __fastcall are 100% identical for the case of having no more than two arguments no larger than 32bit and returning a value no larger than 32bit. That's not a small class of functions. No such backward compatibility at all is possible between the UN*X ABIs for i386 / amd64.
Why is RDX passed before RCX in the System V ABI? strcpy is not 2 instructions then but 3 (plus a mov rcx, rdx)?
@szx: I just found the relevant mailing list thread from Nov 2000, and posted an answer summarizing the reasoning. Note that it's memcpy that could be implemented that way, not strcpy.
P
Peter Cordes

IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.

It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX etc.

Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.

The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.

He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.

The choice of rdi, rsi, rdx as the first three args was motivated by:

minor code-size saving in functions that call memset or other C string function on their args (where gcc inlines a rep string operation?)

rbx is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because they're the only "legacy" registers that aren't implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).

None of the registers that common instructions force you to use are call-preserved (see prev point), so a function that wants to use a variable-count shift or division might have to move function args somewhere else, but doesn't have to save/restore the caller's value. cmpxchg16b and cpuid need RBX, but are rarely used so not a big factor. (cmpxchg16b wasn't part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b exists but was obsoleted by qword cmpxchg)

We are trying to avoid RCX early in the sequence, since it is register used commonly for special purposes, like EAX, so it has same purpose to be missing in the sequence. Also it can't be used for syscalls and we would like to make syscall sequence to match function call sequence as much as possible.

(background: syscall / sysret unavoidably destroy rcx(with rip) and r11(with RFLAGS), so the kernel can't see what was originally in rcx when syscall ran.)

The kernel system-call ABI was chosen to match the function call ABI, except for r10 instead of rcx, so a libc wrapper functions like mmap(2) can just mov %rcx, %r10 / mov $0x9, %eax / syscall.

Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.

Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.

They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.

I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdefs in some ASM so the same source can more easily build a 32 or 64bit version of a function.

Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.


I think I have read somewhere on Raymond Chen's blog about the rationale for choosing those registers after benchmarking from MS side but I can't find it anymore. However some reasons regarding the homezone was explained here blogs.msdn.microsoft.com/oldnewthing/20160623-00/?p=93735 blogs.msdn.microsoft.com/freik/2006/03/06/…
@phuclv: See also Is it valid to write below ESP?. Raymond's comments on my answer there pointed out some SEH details I didn't know which explain why x86 32/64 Windows doesn't currently have a de-facto red zone. His blog post has some plausible cases for the same code page-in handler possibility I mentioned in that answer :) So yeah, Raymond did a better job of explaining it than I did (unsurprisingly because I started from knowing very little about Windows), and the table of red-zone sizes for non-x86 is really neat.
@PeterCordes 'Presumably chosen because it's the only other reg that isn't implicitly used by any instruction' Which are the registers that are not implicitly used by any instructions in r0-r7? I thought none, that's why they have special names like rax, rcx etc.
@SouravKannanthaB: yes, all the legacy registers have some implicit uses. (Why are rbp and rsp called general purpose registers?) What I really meant to say is that there's are no common instructions you'd want to use for other reasons (like shl rax, cl, mul) that requires you to use RBX or RBP. Only cmpxchg16b and cpuid need RBX, and RBP is only used implicitly by leave (and the unusably-slow enter instruction). So for RBP, the only implicit uses are just manipulating RBP, and not something you'd want if not using it as a frame pointer
M
Michael Burr

Remember that Microsoft was initially "officially noncommittal toward the early AMD64 effort" (from "A History of Modern 64-bit Computing" by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn't have done so as it would mean publicly supporting the AMD64 effort when they hadn't yet officially done so (and would have probably upset Intel).

On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.

So why would they have cooperated on an ABI? I'd guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.

Another quote from "A History of Modern 64-bit Computing":

In parallel with the Microsoft collaboration, AMD also engaged the open source community to prepare for the chip. AMD contracted with both Code Sorcery and SuSE for tool chain work (Red Hat was already engaged by Intel on the IA64 tool chain port). Russell explained that SuSE produced C and FORTRAN compilers, and Code Sorcery produced a Pascal compiler. Weber explained that the company also engaged with the Linux community to prepare a Linux port. This effort was very important: it acted as an incentive for Microsoft to continue to invest in the AMD64 Windows effort, and also ensured that Linux, which was becoming an important OS at the time, would be available once the chips were released. Weber goes so far as to say that the Linux work was absolutely crucial to AMD64’s success, because it enabled AMD to produce an end-to-end system without the help of any other companies if necessary. This possibility ensured that AMD had a worst-case survival strategy even if other partners backed out, which in turn kept the other partners engaged for fear of being left behind themselves.

This indicates that even AMD didn't feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn't worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.

Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn't work together on it, and AMD didn't see that as a problem.


Nice perspective on the politics. I agree that it's not AMD's fault or responsibility. I blame Microsoft for choosing a worse calling convention. If their calling convention had turned out to be better, I'd have some sympathy, but they had to change from their initial ABI to __vectorcall because passing __m128 on the stack sucked. Having call-preserved semantics for the low 128b of some of the vector regs is also weird (partly Intel's fault for not designing an extensible save/restore mechanism with SSE originally, and still not with AVX.)
I don't really have any expertise or knowledge of how good the ABIs are. I just occasionally need to know what they are so I can understand/debug at the assembly level.
A good ABI minimizes code size and number of instructions, and keeps dependency chains low-latency by avoiding extra round-trips through memory. (for args, or for locals that need to be spilled/reloaded). There are tradeoffs. SysV's red-zone takes a couple extra instructions in one place (the kernel's signal-handler dispatcher), for a relatively large benefit for leaf functions of not having to adjust the stack pointer to get some scratch space. So that's a clear win with near-zero downside. It was adopted with pretty much no discussion after it was proposed for SysV.
@dgnuff: Right, that's the answer to Why can't kernel code use a Red Zone. Interrupts use the kernel stack, not the user-space stack, even if they arrive when the CPU is running user-space code. The kernel doesn't trust user-space stacks because another thread in the same user-space process could modify it, thus taking over control of the kernel!
@DavidA.Gray: yeah, the ABI doesn't say you have to use RBP as a frame pointer so optimized code usually doesn't (except in functions that use alloca or a few other cases). This is normal if you're used to gcc -fomit-frame-pointer being the default on Linux. The ABI defines stack-unwind metadata that allows exception handling to still work. (I assume it works something like GNU/Linux x86-64 System V's CFI stuff in .eh_frame). gcc -fomit-frame-pointer has been the default (with optimization enabled) since forever on x86-64, and other compilers (like MSVC) do the same thing.
c
cHao

Win32 has its own uses for ESI and EDI, and requires that they not be modified (or at least that they be restored before calling into the API). I'd imagine 64-bit code does the same with RSI and RDI, which would explain why they're not used to pass function arguments around.

I couldn't tell you why RCX and RDX are switched, though.


All calling conventions have some registers designated as scratch and some as preserved like ESI/EDI and RSI/RDI on Win64. But those are general purpose registers, Microsoft could have chosen without a problem to use them differently.
@Somejan: Sure, if they wanted to rewrite the whole API and have two different OSes. I wouldn't call that "without a problem", though. For dozens of years now, MS has made certain promises about what it will and won't do with x86 registers, and they've been more or less consistent and compatible all that time. They're not gonna toss all that out the window just because of some edict from AMD, especially one so arbitrary and outside the realm of "building a processor".
@Somejan: The AMD64 UN*X ABI was always exactly that - a UNIX-specific piece. The document, x86-64.org/documentation/abi.pdf, is titled System V Application Binary Interface, AMD64 Architecture Processor Supplement for a reason. The (common) UNIX ABIs (a multi-volume collection, sco.com/developers/devspecs) leave a section for processor-specific chapter 3 - the Supplement - which are the function calling conventions and data layout rules for a specific processor.
@Somejan: Microsoft Windows has never attempted to be particularly close to UN*X, and when it came to porting Windows to x64/AMD64 they simply chose to extend their own __fastcall calling convention. You claim Win32/Win64 aren't compatible, but then, look closely: For a function that takes two 32bit args and returns 32bit, Win64 and Win32 __fastcall actually are 100% compatible (same regs for passing two 32bit args, same return value). Even some binary(!) code may work in both operating modes. The UNIX side completely broke with "old ways". For good reasons, but a break is a break.
@Olof: It's more than just a compiler thing. I had issues with ESI and EDI when i did standalone stuff in NASM. Windows definitely cares about those registers. But yes, you can use them if you save them before you do and restore them before Windows needs them.

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now