ChatGPT解决这个技术问题 Extra ChatGPT

LEA or ADD instruction?

When I'm handwriting assembly, I generally choose the form

lea eax, [eax+4]

Over the form..

add eax, 4

I have heard that lea is a "0-clock" instruction (like NOP), while 'add' isn't. However, when I look at compiler produced Assembly I often see the latter form used instead of the first. I'm smart enough to trust the compiler, so can anyone shed some light on which one is better? Which one is faster? Why is the compiler choosing the latter form over the former?

How can any instruction be a "zero clock" instruction if it's actually doing useful work?
It's a zero clock instruction because all the work required is done in the decode step - when the CPU decodes the instruction, the offset is calculated from MODRM/SIB anyway. At least that's my theory. Also, I know exactly what the lea instruction is and what it does - my question is about lea vs. add, not lea vs. mov (there is a huge difference - you can't use displacements in 'mov' without accessing memory).
This was true a long time ago, back to the original Pentium. Modern compilers generate code for much later cores. The days of hand-optimizing machine code are over and done with.
It used to be cheap (not free) as it used dedicated address calculation hardware on old chips, and bought you some parallelism. On current CPUs both instructions will likely result in the same micro ops.
GCC 4.8 is using lea by default. lea vs mov: stackoverflow.com/questions/1699748/…

F
FrankH.

One significant difference between LEA and ADD on x86 CPUs is the execution unit which actually performs the instruction. Modern x86 CPUs are superscalar and have multiple execution units that operate in parallel, with the pipeline feeding them somewhat like round-robin (bar stalls). Thing is, LEA is processed by (one of) the unit(s) dealing with addressing (which happens at an early stage in the pipeline), while ADD goes to the ALU(s) (arithmetic / logical unit), and late in the pipeline. That means a superscalar x86 CPU can concurrently execute a LEA and an arithmetic/logical instruction.

The fact that LEA goes through the address generation logic instead of the arithmetic units is also the reason why it used to be called "zero-clocks"; it takes no time to execute because address generation has already happened by the time it would be / is executed.

It's not free, since address generation is a step in the execution pipeline, but it's got no execution overhead. And it doesn't occupy a slot in the ALU pipeline(s).

Edit: To clarify, LEA is not free. Even on CPUs that do not implement it via the arithmetic unit it takes time to execute due to instruction decode / dispatch / retire and/or other pipeline stages that all instructions go through. The time taken to do LEA just occurs in a different stage of the pipeline for CPUs that implement it via address generation.


@harold: Can you provide some references as to what you mean by "true" ? Historically, LEA and ADD being done in different ways (/by different units in the CPU) is true, and even today Intel's CPUs still have different latency/throughput timings for LEA vs. ADD.
Historically maybe, but just look here: agner.org/optimize/instruction_tables.pdf and look at what unit lea goes to (it's often 'alu', a few times 'agu') and the latency (never zero, sometimes more than 1). More detailed timings (but less analysis) here: instlatx64.atw.hu
@harold: That's precisely the reference ... "AGU" == "address generation unit", the point I was trying to stress. Also note that I've explicitly said it's not free, and put "zero clocks" in quotes. As I see it, the question here is largely about where in the pipeline the overhead for LEA occurs ... compared to ADD.
This answer is only correct for AMD K8/K10. (Intel P6/SnB/P4/Silvermont, and AMD Bulldozer-family/Bobcat/Jaguar all run LEA on their ALUs). CPUs like Atom (pre-Silvermont) and Via Nano run LEA on their AGU port, but with worse latency than ADD, according to Agner Fog's tables. Only AMD k8/k10 run LEA on their AGUs with good performance, but even then it's still 2 cycle latency in the AGU vs. 1 cycle latency for simple addressing modes that K10 runs in on an ALU port.
It is also correct for very old Intel CPUs. Certainly older than the Pentium 4, since the P4 dropped the AGU's barrel-shifter. Pentium, Pentium Pro, and Pentium II did the LEA computation in the AGU, not on the ALU, as this answer originally suggested. This led to nice optimization possibilities. There were situations where LEA was effectively free, if you knew how to take advantage of it.
佚名

I'm smart enough to trust the compiler, so can anyone shed some light on which one is better?

Yes, a little. Firstly, I'm taking this from the following message: https://groups.google.com/group/bsdnt-devel/msg/23a48bb18571b9a6

In this message a developer optimises some assembly I wrote very badly to run crazily fast in Intel Core 2 processors. As a background to this project, it's a bsd bignum library which I and a few other developers have been involved in.

In this case, all that's being optimised is addition of two arrays that look like this: uint64_t* x, uint64_t* y. Each "limb" or member of the array represents part of the bignum; the basic process is to iterate over it starting from the least significant limb, add the pair up and continue upwards, passing the carry (any overflow) up each time. adc does this for you on a processor (it's not possible to access the carry flag from C I don't think).

In that piece of code, a combination of lea something, [something+1] and jrcxz are used, which are apparently more efficient than the jnz/add something, size pair we might previously have used. I'm not sure if this was discovered as a result of simply testing different instructions, however. You'd have to ask.

However, in a later message, it is measured on an AMD chip and does not perform so well.

I'm also given to understand different operations perform differently on different processors. I know, for example, the GMP project detect processors using cpuid and pass in different assembly routines based on different architectures, e.g. core2, nehalem.

The question you have to ask yourself is does your compiler produce optimised output for your cpu architecture? The Intel compiler, for example, is known to do this, so it might be worth measuring performance and seeing what output it produces.


Very good response - I had no idea the lea instruction could be slower than the add instruction on AMD CPU's! I'm actually using MSVC 10, but I'm on an Intel CPU.
@Jakob from what I understand it's more a case that AMD k8's and k10's are blazingly fast at integer arithmetic... but, it's not necessarily the case. It could be that it's the jrcxz instruction slowing down AMDs. Optimising assembly is an area I know very little about, but I get the impression you have to think about the whole algorithm, not just a single instruction. Still, I'd wait around, there are many clever people on SO and somebody may well know more than I do.
ok, I'll wait a bit to see what other answers SO can cough up. Thanks for the answer though!
I have check your optimisation case... It is bad because there is no need to use "lea rcx, [rcx+1]" instead "inc rcx" because inc instruction doesn' affect to carry flag what was the purpose of the trick as I red in comment.
I’m curious: what happened to the tenth?
C
Community

LEA isn't faster than ADD instruction the execution speed is the same.

But LEA sometimes offer more than ADD. If we need simple and fast addition/multiplication in combination with second register than LEA can speed-up program execution. From the other side the LEA doesn't affect to the CPU flags so there is no overflow detection possibility.


P
Peter Cordes

The main reason is next. As you can note if you look carefully at the x86, this ISA is two-address. Every instruction accepts at most two arguments. Thus, the semantic of operations is next:

DST = DST <operation> SRC

The LEA is a kind of hack instruction, because it is the SINGLE instruction in the x86 ISA which is actually three-address:

DST = SRC1 <operation> SRC2

It is a kind of hack instruction, because it reuses the arguments dispatcher circuit of x86 CPU for performing addition and shift.

Compilers use LEA because this intruction allows them to replace few intructions by single instruction in the cases when the content of summand registers is beneficial to preserve unchanged. Take a note, that in all cases when compiler uses LEA DST register differs from the SRC register or SRC argument exploits complex address calculation logic.

For example, it is almost impossible to find in the generated code such use case:

LEA EAX, [EAX   ] // equivalent of NOP
LEA EAX, [ECX   ] // equivalent of MOV EAX, ECX
LEA EAX, [EAX+12] // equivalent of ADD EAX, 12

but the next use cases are common:

LEA EAX, [ECX      +12] // there is no single-instruction equivalent
LEA EAX, [ECX+EDX*4+12] // there is no single-instruction equivalent
LEA EDX, [ECX+EDX*4+12] // there is no single-instruction equivalent

Indeed, imagine the next scenario with assumption that value of EBP should be preserved for future use:

LEA EAX, [EBP+12]
LEA EDX, [EBP+48]

Just two instructions! But in the case of absence of LEA the code will be next

MOV EAX, EBP
MOV EDX, EBP
ADD EAX, 12
ADD EDX, 48

I believe that the benefit of LEA use should be evident now. You can try to replace this instruction

LEA EDX, [ECX+EDX*4+12] // there is no single-instruction equivalent

by ADD-based code.


LEA EAX, [EAX ] // equivalent of NOP in 64-bit mode isn't a NOP
Michael, can you clarify the difference between IA-32 and AMD64 that you have mentioned?
In 64-bit mode if the destination of an operation is a 32-bit register, the CPU automatically zero extends the result across the entire 64-bit register.
Michael, Thanks! I wasn't aware about this feature.
S
Sebi2020

You can perfom a lea instruction in the same clock cycle like an add operation, but if you use lea and add together you can perfom an addition of three operands in only one cycle! If you would use two add operations that could only performed in 2 clock-cycles:

mov eax, [esp+4]   ; get a from stack
mov edx, [esp+8]   ; get b from stack
mov ecx, [esp+12]  ; get c from stack
lea eax, [eax+edx] ; add a and b in the adress decoding/fetch stage of the pipeline
add eax, ecx       ; Add c + eax in the execution stage of the pipeline
ret 12

Wrong, most superscalar CPUs can run multiple add instructions at the same time. Intel Haswell and later can run 4 add instructions per clock (if that's all its doing). Or it can run 2 ADD and 2 LEA insns per clock. Also, ret with an immediate argument is slow. Also, you'd fold one of the loads into the add if you were optimizing. add eax, [esp+12].
No it's not wrong. Just because this doesn't apply to every cpu on the market it isn't wrong. Next we're not talking about return strategies, we're talking about the add/lea instruction. I don't see any perfomance win if using 'add eax, [esp+12]' because it's doesn't bypass the memory fetch stage. And again, we're not talking about how to get data from a to b but about advantages of using lea instead of add instructions.
The perf win in using add eax, [esp+12] is code size and number of instruction / uops. The whole function can be mov eax, [esp+4] / add eax, [esp+8] / add eax, [esp+12] / ret 12, which is 2 instructions shorter so it decodes faster, etc. You do have a point if you're optimizing for Atom, though, which does LEA in the AGUs instead of the ALUs, so I think you're right that add could consume the lea result with lower latency. However, lea needs the data to be ready sooner (in an earlier stage of the pipeline), and in-order Atom can't do the preceding loads ahead of time.

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now