Those familiar with x86 assembly programming are very used to the typical function prologue / epilogue:
push ebp ; Save old frame pointer.
mov ebp, esp ; Point frame pointer to top-of-stack.
sub esp, [size of local variables]
...
mov esp, ebp ; Restore frame pointer and remove stack space for locals.
pop ebp
ret
This same sequence of code can also be implemented with the ENTER
and LEAVE
instructions:
enter [size of local variables], 0
...
leave
ret
The ENTER
instruction's second operand is the nesting level, which allows multiple parent frames to be accessed from the called function.
This is not used in C because there are no nested functions; local variables have only the scope of the function they're declared in. This construct does not exist (although sometimes I wish it did):
void func_a(void)
{
int a1 = 7;
void func_b(void)
{
printf("a1 = %d\n", a1); /* a1 inherited from func_a() */
}
func_b();
}
Python however does have nested functions which behave this way:
def func_a():
a1 = 7
def func_b():
print 'a1 = %d' % a1 # a1 inherited from func_a()
func_b()
Of course Python code isn't translated directly to x86 machine code, and thus would be unable (unlikely?) to take advantage of this instruction.
Are there any languages which compile to x86 and provide nested functions? Are there compilers which will emit an ENTER
instruction with a nonzero second operand?
Intel invested a nonzero amount of time/money into that nesting level operand, and basically I'm just curious if anyone uses it :-)
References:
Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol 2: Instruction Set Reference
NASM Manual - ENTER: Create Stack Frame
grep
-ing gcc-4.8.2/gcc/config/i386/i386.c:10339
that GCC simply never emits ENTER
at all nowadays. And the comment at that line is quite clear: /* Note: AT&T enter does NOT have reversed args. Enter is probably slower on all targets. Also sdb doesn't like it. */
git log -p
on their cvs->svn->git converted repository shows that it already existed in the initial check-in in 1992.
llvm/lib/Target/X86/X86FrameLowering.cpp:355
a comment for the emitPrologue()
method which reads in part ; Spill general-purpose registers [for all callee-saved GPRs] pushq %<reg> [if not needs FP] .cfi_def_cfa_offset (offset from RETADDR) .seh_pushreg %<reg>
. There are no mentions of ENTER
, only pushes; And the enum constant for x86 ENTER
occurs only 3 times in all of LLVM; It doesn't even look as though they have testcases for it.
enter
is avoided in practice as it performs quite poorly - see the answers at "enter" vs "push ebp; mov ebp, esp; sub esp, imm" and "leave" vs "mov esp, ebp; pop ebp". There are a bunch of x86 instructions that are obsolete but are still supported for backwards compatibility reasons - enter
is one of those. (leave
is OK though, and compilers are happy to emit it.)
Implementing nested functions in full generality as in Python is actually a considerably more interesting problem than simply selecting a few frame management instructions - search for 'closure conversion' and 'upwards/downwards funarg problem' and you'll find many interesting discussions.
Note that the x86 was originally designed as a Pascal machine, which is why there are instructions to support nested functions (enter
, leave
), the pascal
calling convention in which the callee pops a known number of arguments from the stack (ret K
), bounds checking (bound
), and so on. Many of these operations are now obsolete.
As Iwillnotexist Idonotexist pointed out, GCC does support nested functions in C, using the exact syntax I've shown above.
However, it does not use ENTER
instruction. Instead, variables which are used in nested functions are grouped together in the local variables area, and a pointer to this group is passed to the nested function. Interestingly, this "pointer to parent variables" is passed via a nonstandard mechanism: On x64 it is passed in r10
, and on x86 (cdecl) it is passed in ecx
, which is reserved for the this
pointer in C++ (which doesn't support nested functions anyway).
#include <stdio.h>
void func_a(void)
{
int a1 = 0x1001;
int a2=2, a3=3, a4=4;
int a5 = 0x1005;
void func_b(int p1, int p2)
{
/* Use variables from func_a() */
printf("a1=%d a5=%d\n", a1, a5);
}
func_b(1, 2);
}
int main(void)
{
func_a();
return 0;
}
Produces the following (snippet of) code when compiled for 64-bit:
00000000004004dc <func_b.2172>:
4004dc: push rbp
4004dd: mov rbp,rsp
4004e0: sub rsp,0x10
4004e4: mov DWORD PTR [rbp-0x4],edi
4004e7: mov DWORD PTR [rbp-0x8],esi
4004ea: mov rax,r10 ; ptr to calling function "shared" vars
4004ed: mov ecx,DWORD PTR [rax+0x4]
4004f0: mov eax,DWORD PTR [rax]
4004f2: mov edx,eax
4004f4: mov esi,ecx
4004f6: mov edi,0x400610
4004fb: mov eax,0x0
400500: call 4003b0 <printf@plt>
400505: leave
400506: ret
0000000000400507 <func_a>:
400507: push rbp
400508: mov rbp,rsp
40050b: sub rsp,0x20
40050f: mov DWORD PTR [rbp-0x1c],0x1001
400516: mov DWORD PTR [rbp-0x4],0x2
40051d: mov DWORD PTR [rbp-0x8],0x3
400524: mov DWORD PTR [rbp-0xc],0x4
40052b: mov DWORD PTR [rbp-0x20],0x1005
400532: lea rax,[rbp-0x20] ; Pass a, b to the nested function
400536: mov r10,rax ; in r10 !
400539: mov esi,0x2
40053e: mov edi,0x1
400543: call 4004dc <func_b.2172>
400548: leave
400549: ret
Output from objdump --no-show-raw-insn -d -Mintel
This would be equivalent to something more verbose like this:
struct func_a_ctx
{
int a1, a5;
};
void func_b(struct func_a_ctx *ctx, int p1, int p2)
{
/* Use variables from func_a() */
printf("a1=%d a5=%d\n", ctx->a1, ctx->a5);
}
void func_a(void)
{
int a2=2, a3=3, a4=4;
struct func_a_ctx ctx = {
.a1 = 0x1001,
.a5 = 0x1005,
};
func_b(&ctx, 1, 2);
}
gcc -O0
does. It's probably rare for gcc not to inline a nested function with optimization enabled. Although maybe if there are many call-sites in the outer function... (especially if you optimize for size with -Os
.)
Our PARLANSE compiler (for fine-grain parallel programs on SMP x86) has lexical scoping.
PARLANSE tries to generate many, many small parallel grains of computation, and then multiplexes them on top of threads (1 per CPU). In fact, the stack frames are heap allocated; we didn't want to pay the price of a "big stack" for each grain since we have many, and we didn't want to put a limit on how deep anything could recurse. Because of parallel forks, the stack is actually a cactus stack.
Each procedure, on entry, builds a lexical display to enable access to surrounding lexical scopes. We considered using the ENTER instruction, but decided against it for two reasons:
As others have noted, it isn't particularly fast. MOV instructions do just as well.
We observed that the display is often sparse, and tends to be denser on the lexically deeper side. Most internal helper functions do fine with access only to their direct lexical parent; you don't always need access to all of your parents. Sometimes none.
Consequently, the compiler figures out exactly which lexical scopes a function needs access to, and generates, in the function prolog where ENTER would go, just the MOV instructions to copy the part of the parent's display that is actually needed. That often turns out to be 1 or 2 pairs of moves.
So we win twice on performance over using ENTER.
IMHO, ENTER is now one of those legacy CISC instructions, which seemed like a good idea at the time it was defined, but get outperformed by RISC instruction sequences that even Intel x86 optimizes.
setcc r/m32
, saving instructions to booleanize into an int
instead of char
)
I did some instruction counting statistics on Linux boots using the Simics virtual platform, and found that ENTER was never used. However,there were quite a few LEAVE instructions in the mix. There was almost a 1-1 correlation between CALL and LEAVE. That would seem to corroborate the idea that ENTER is just slow and expensive, while LEAVE is pretty handy. This was measured on a 2.6-series kernel.
The same experiments on a 4.4-series and a 3.14-series kernel showed zero use of either LEAVE or ENTER. Presumably, the gcc code generation for the newer gccs used to compile these kernels has stopped emitting LEAVE (or the machine options are set differently).
-fomit-frame-pointer
is the default now. gcc still uses leave
when it makes frame pointers. (It does so even in optimized code for functions with a VLA: godbolt.org/g/LF3Rrk). I tested with a few different -mtune=
options, and they all used leave
. clang doesn't use leave
, though, ever. That's a missed optimization for -Os
(optimize for size), because it's only 3 uops vs. at least 2 for mov/pop (and maybe a stack-sync uop).
enter
even if you compile with -Os
or -Oz
. enter n,0
is 12 uops on Skylake, with 1 per 8 clocks throughput. On Ryzen, it's 12 uops with 1 per 16 clocks throughput. At -Oz
: optimize for size at all costs, it might make sense for clang to use enter
, because it does stuff like push 2
/ pop rax
to save 2 bytes vs. mov eax,2
. (gcc doesn't have a -Oz
mode.) See agner.org/optimize for instruction tables and a microarch guide to make sense of them. See also the SO x86 tag wiki
-Os
is not really "optimize for size" but rather "optimize for performance without performing any optimizations which are likely to adversely affect size".