Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

performance optimization assembly x86-64 freepascal

Background:

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.

To my surprise, removing the un-necessary instruction caused my program to slow down.

I found that adding arbitrary, useless MOV instructions increased performance even further.

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.

The data:

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):

avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without:        1836.44 ms

The programs were run 25 times in a loop, with the run order changing randomly each time.

Excerpt:

{$asmmode intel}
procedure example_junkop_in_sha256;
  var s1, t2 : uint32;
  begin
    // Here are parts of the SHA-256 algorithm, in Pascal:
    // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
    // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
    // Here is how I translated them (side by side to show symmetry):
  asm
    MOV r8d, a                 ; MOV r9d, e
    ROR r8d, 2                 ; ROR r9d, 6
    MOV r10d, r8d              ; MOV r11d, r9d
    ROR r8d, 11    {13 total}  ; ROR r9d, 5     {11 total}
    XOR r10d, r8d              ; XOR r11d, r9d
    ROR r8d, 9     {22 total}  ; ROR r9d, 14    {25 total}
    XOR r10d, r8d              ; XOR r11d, r9d

    // Here is the extraneous operation that I removed, causing a speedup
    // s1 is the uint32 variable declared at the start of the Pascal code.
    //
    // I had cleaned up the code, so I no longer needed this variable, and 
    // could just leave the value sitting in the r11d register until I needed
    // it again later.
    //
    // Since copying to RAM seemed like a waste, I removed the instruction, 
    // only to discover that the code ran slower without it.
    {$IFDEF JUNKOPS}
    MOV s1,  r11d
    {$ENDIF}

    // The next part of the code just moves on to another part of SHA-256,
    // maj { r12d } := (a and b) xor (a and c) xor (b and c)
    mov r8d,  a
    mov r9d,  b
    mov r13d, r9d // Set aside a copy of b
    and r9d,  r8d

    mov r12d, c
    and r8d, r12d  { a and c }
    xor r9d, r8d

    and r12d, r13d { c and b }
    xor r12d, r9d

    // Copying the calculated value to the same s1 variable is another speedup.
    // As far as I can tell, it doesn't actually matter what register is copied,
    // but moving this line up or down makes a huge difference.
    {$IFDEF JUNKOPS}
    MOV s1,  r9d // after mov r12d, c
    {$ENDIF}

    // And here is where the two calculated values above are actually used:
    // T2 {r12d} := S0 {r10d} + Maj {r12d};
    ADD r12d, r10d
    MOV T2, r12d

  end
end;

Try it yourself:

The code is online at GitHub if you want to try it out yourself.

My questions:

Why would uselessly copying a register's contents to RAM ever increase performance?

Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?

Is this behavior something that could be exploited predictably by a compiler?

There are all sorts of 'useless' instructions that can actually serve to break dependency chains, mark physical registers as retired, etc. Exploiting these operations requires some knowledge of the microarchitecture. Your question should provide a short sequence of instructions as a minimal example, rather than directing people to github.

@BrettHale good point, thanks. I added a code excerpt with some commentary. Would copying a register's value to ram mark the register as retired, even if the value in it is used later?

Can you put the standard deviation on those averages? There's no actual indication in this post that there's a real difference.

Can you please try timing the instructions using the rdtscp instruction, and check the clock cycles for both versions?

Can it also be due to memory alignment? I didn't do the math myself (lazy :P) but adding some dummy instructions can cause your code do be memory aligned...

Community

The most likely cause of the speed improvement is that:

inserting a MOV shifts the subsequent instructions to different memory addresses

one of those moved instructions was an important conditional branch

that branch was being incorrectly predicted due to aliasing in the branch prediction table

moving the branch eliminated the alias and allowed the branch to be predicted correctly

Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.

This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).

If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643

Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.

Hmm. This sounds promising. The only conditional branches in this sha256 implementation are the checks for the end of the FOR loops. At the time, I had tagged this revision as an oddity in git and continued optimizing. One of my next steps was to rewrite the pascal FOR loop myself in assembly, at which point these extra instructions no longer had a positive effect. Perhaps free pascal's generated code was harder for the processor to predict than the simple counter I replaced it with.

@tangentstorm That sounds like a good summary. The branch prediction table isn't very large, so one table entry might refer to more than one branch. This can render some predictions useless. The problem is easily fixed if one of the conflicting branches moves to another part of the table. Almost any little change can make this happen :-)

I think this is the most reasonable explanation of the specific behavior I observed, so I'm going to mark this as the answer. Thanks. :)

There is an absolutely excellent discussion of a similar problem one of the contributers to Bochs ran into, you may want to add this to your answer: emulators.com/docs/nx25_nostradamus.htm

Insn alignment matters for a lot more than just branch targets. Decode bottlenecks are a huge issue for Core2 and Nehalem: it often has a hard time keeping its execution units busy. Sandybridge's introduction of the uop cache increased frontend throughput a huge amount. Aligning branch targets is done because of this issue, but it affects all code.

Jonas Maebe

You may want to read http://research.google.com/pubs/pub37077.html

TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).

Interesting. But is the processor (or FPC) smart enough to see that writing to ram is a NOP in this case?

Assembler is not optimized.

Compilers could exploit it by doing incredibly expensive optimizations like repeatedly building and profiling and then varying the compiler output with a simulated annealing or genetic algorithm. I've read about some work in that area. But we're talking a minimum of 5-10 minutes of 100% CPU to compile, and the resulting optimizations would probably be CPU core model and even core or microcode revision specific.

I wouldn't call it random NOP, they explain why NOPs can have a positive effect on the performance (tl;dr: stackoverflow.com/a/5901856/357198) and the random insertion of NOP did result in performance degradation. What is interesting of the paper is that the removal of 'strategic' NOP by GCC had no effect on performance overall!

Peter Mortensen

I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.

Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).

CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.

The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.

Your answer is kind of confusing. In many places it seems like you are guessing, although most of what you say is correct.

Maybe I should clarify. What I find confusing is the lack of certainty

guessing that makes sense and with good argumentation is completely valid.

No-one can really know for sure why the OP is observing this strange behavior, unless it was an engineer at Intel who had access to special diagnostic equipment. So all others can do is guess. That isn't @cowarldlydragon's fault.

Downvote; none of what you say explains the behaviour OP is seeing. Your answer is useless.

Maxim Masiutin

Preparing the cache

Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.

Breaking the false dependencies

Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.

It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:

32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register

So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX. Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.

That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

I think you now see that it is too obvious.

This is all true, but has nothing to do with the code presented in the question.

@CodyGray - thank you for your feedback. I have edited the reply and added a chapter about the case - that mov to memory surrounded by register operations prepares the cache and it's free since the store unit is idle anyway. So the subsequent store operation will be faster.

there's no MOVZX for 32-bit operands, because all instructions with 32-bit destination zero the upper part of the full 64-bit register

Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now

相似问题

Why does leave do "mov esp,ebp" in x86 assembly?

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US