ChatGPT解决这个技术问题 Extra ChatGPT

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:

nasm -felf64 p14.asm && gcc p14.o -o p14

The C++ was compiled with:

g++ p14.cpp -o p14

Assembly, p14.asm:

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

C++, p14.cpp:

#include <iostream>

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = 3*n + 1;
        ++count;
    }
    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }
    std::cout << maxi << std::endl;
}

I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).

The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.

But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.

Execution times

My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).

g++ (unoptimized): avg 1272 ms.

g++ -O3: avg 578 ms.

asm (div) (original): avg 2650 ms.

asm (shr): avg 679 ms.

@johnfound asm (assembled with NASM): avg 501 ms.

@hidefromkgb asm: avg 200 ms.

@hidefromkgb asm, optimized by @Peter Cordes: avg 145 ms.

@Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

Have you examined the assembly code that GCC generates for your C++ program?
Compile with -S to get the assembly that the compiler generated. The compiler is smart enough to realize that the modulus does the division at the same time.
I think your options are 1. Your measuring technique is flawed, 2. The compiler writes better assembly that you, or 3. The compiler uses magic.
@jefferson The compiler can use faster brute force. For example maybe with SSE instructions.

P
Peter Cordes

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the tag wiki for more perf links.

See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).

Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.

As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)

GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).

gcc -Os (optimize for size) does use IDIV for non-power-of-2 division, unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.

Helping the compiler

(summary for this case: use uint64_t n)

First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)

In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)

You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).

BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:

 # from gcc5.4 -O3  plus my comments

 # edx= count=1
 # rax= uint64_t n

.L9:                   # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi         # rdi = n>>1;
    test   al, 1       # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi    # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1      # ++count;
    cmp    rax, 1
    jne   .L9          #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n

The inner loop is branchless, and the critical path of the loop-carried dependency chain is:

3-component LEA (3 cycles)

cmov (2 cycles on Haswell, 1c on Broadwell or later).

Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).

The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.

Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.

The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.

Beating the compiler

GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.

It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.

 ### Hand-optimized version of what gcc does
.L9:                       #do{
    lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
    shr     rax, 1         # n>>=1;    CF = n&1 = n%2
    cmovc   rax, rcx       # n= (n&1) ? 3*n+1 : n/2;
    inc     edx            # ++count;
    cmp     rax, 1
    jne     .L9            #}while(n!=1)

See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)

Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.

Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).

LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).

So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)

Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.

Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.

The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.

You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).

I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.

Untested idea for manual vectorization

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vpsllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.

You can and should implement this with intrinsics instead of hand-written asm.

Algorithmic / implementation improvement:

Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)

@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.

So the loop might look like this:

goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).

It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).

We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).

TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.

They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)

Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.

Improvements to ideas / code from other answers:

@hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.

I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)

Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.

@Veedrac's code

I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.

code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.

See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)


I've tried out the vectorized approach a while ago, it didn't help (because you can do much better in scalar code with tzcnt and you're locked to the longest-running sequence among your vector-elements in the vectorized case).
@EOF: no, I meant breaking out of the inner loop when any one of the vector elements hits 1, instead of when they all have (easily detectable with PCMPEQ/PMOVMSK). Then you use PINSRQ and stuff to fiddle with the one element that terminated (and its counters), and jump back into the loop. That can easily turn into a loss, when you're breaking out of the inner loop too often, but it does mean you're always getting 2 or 4 elements of useful work done every iteration of the inner loop. Good point about memoization, though.
@jefferson Best I managed is godbolt.org/g/1N70Ib. I was hoping I could do something smarter, but it seems not.
The thing that amazes me about incredible answers such as this is the knowledge shown to such detail. I will never know a language or system to that level and I wouldn't know how. Well done sir.
@csch: thanks. I'm glad so many people got something out of what I wrote. I'm pretty proud of it, and think it does a good job of explaining some optimization basics and specific details relevant for this problem.
A
Adrian Mole

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.

The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.

(The below code is 32-bit, but can be easily converted to 64-bit)

For example, the sequence function can be optimized to only 5 instructions:

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

The whole code looks like:

include "%lib%/freshlib.inc"
@BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"

start:
        InitializeAll
        mov ecx, 999999
        xor edi, edi        ; max
        xor ebx, ebx        ; max i

    .main_loop:

        xor     esi, esi
        mov     eax, ecx

    .seq:
        inc     esi                 ; counter
        lea     edx, [3*eax+1]      ; edx = 3*n+1
        shr     eax, 1              ; eax = n/2
        cmovc   eax, edx            ; if CF eax = edx
        jnz     .seq                ; jmp if n<>1

        cmp     edi, esi
        cmovb   edi, esi
        cmovb   ebx, ecx

        dec     ecx
        jnz     .main_loop

        OutputValue "Max sequence: ", edi, 10, -1
        OutputValue "Max index: ", ebx, 10, -1

        FinalizeAll
        stdcall TerminateAll, 0

In order to compile this code, FreshLib is needed.

In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.

The output of both programs is the same: max sequence = 525 on i = 837799.


Huh, that's clever. SHR sets ZF only if EAX was 1 (or 0). I missed that when optimizing gcc's -O3 output, but I did spot all other optimizations you made to the inner loop. (But why do you use LEA for the counter increment instead of INC? It's ok to clobber flags at that point, and lead to a slowdown on anything except maybe P4 (false dependency on old flags for both INC and SHR). LEA can't run on as many ports, and could lead to resource conflicts delaying the critical path more often.)
Oh, actually Bulldozer might bottleneck on throughput with the compiler output. It has lower latency CMOV and 3-component LEA than Haswell (which I was considering), so the loop-carried dep chain is only 3 cycles in your code. It also doesn't have zero-latency MOV instructions for integer registers, so g++'s wasted MOV instructions actually increase the latency of the critical path, and are a big deal for Bulldozer. So yeah, hand-optimization really does beat the compiler in a significant way for CPUs that aren't ultra-modern enough to chew through the useless instructions.
"Claiming the C++ compiler better is very bad mistake. And especially in this case. The human always can make the code better that the and this particular problem is good illustration of this claim." You can reverse it and it would be just as valid. "Claiming a human is better is very bad mistake. And especially in this case. The human always can make the code worse that the and this particular question is good illustration of this claim." So I don't think you have a point here, such generalizations are wrong.
@luk32 - But the author of the question can not be any argument at all, because his knowledge of assembly language is close to zero. Every arguments about human vs compiler, implicitly assume human with at least some middle level of asm knowledge. More: The theorem "The human written code will always be better or the same as the compiler generated code" is very easy to be formally proved.
@luk32: A skilled human can (and usually should) start with compiler output. So as long as you benchmark your attempts to make sure they're actually faster (on the target hardware you're tuning for), you can't do worse than the compiler. But yeah, I have to agree it's a bit of a strong statement. Compilers usually do much better than novice asm coders. But it's usually possible to save an instruction or two compared to what compilers come up with. (Not always on the critical path, though, depending on uarch) . They're highly useful pieces of complex machinery, but they're not "smart".
U
UmNyobe

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:

while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
    n = (3*n + 1) / 2;
    if (n % 2 == 0) {
        do n /= 2; while (n % 2 == 0);
        if (n == 1) break;
    }
}

Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:

3n+1 -> ???? 0000 0100
/ 2  -> ???? ?000 0010
/ 2  -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2  -> ???? ???0 0010
/ 2  -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2  -> ???? ???? ?010
/ 2  -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2  -> ???? ???? ???0
/ 2  -> ???? ???? ????

So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:

k = n / 256;
m = n % 256;

switch (m) {
    case 0: n = 1 * k + 0; break;
    case 1: n = 81 * k + 1; break; 
    case 2: n = 81 * k + 1; break; 
    ...
    case 155: n = 729 * k + 425; break;
    ...
}

Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.

PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like

static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }

while (n > 128) {
    size_t lastBits = n % 256;
    n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}

In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.

PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).

We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).

If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.


Or make data lookup tables for the multiply and add constants, instead of a switch. Indexing two 256-entry tables is faster than a jump table, and compilers probably aren't looking for that transformation.
Hmm, I thought for a minute this observation might prove the Collatz conjecture, but no, of course not. For every possible trailing 8 bits, there's a finite number of steps until they're all gone. But some of those trailing 8-bit patterns will lengthen the rest of the bitstring by more than 8, so this can't rule out unbounded growth or a repeating cycle.
To update count, you need a third array, right? adders[] doesn't tell you how many right-shifts were done.
For larger tables, it would be worth using narrower types to increase cache density. On most architectures, a zero-extending load from a uint16_t is very cheap. On x86, it's just as cheap as zero-extending from 32-bit unsigned int to uint64_t. (MOVZX from memory on Intel CPUs only needs a load-port uop, but AMD CPUs do need the ALU as well.) Oh BTW, why are you using size_t for lastBits? It's a 32-bit type with -m32, and even -mx32 (long mode with 32-bit pointers). It's definitely the wrong type for n. Just use unsigned.
h
hidefromkgb

On a rather unrelated note: more performance hacks!

[the first «conjecture» has been finally debunked by @ShreevatsaR; removed]

When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first): [even] [odd] [odd] [even] [even] [even] To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively. Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1. Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds: (N << 1) + N + 1: (N >> 1) + N + 1: b10 b1 b1 b + 1 + 1 ---- --- bBb0 bBb where B = !b. Right-shifting the first result gives us exactly what we want. Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1. As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.

[even] [odd]

[odd] [even]

[even] [even]

The resulting algorithm looks like this:

uint64_t sequence(uint64_t size, uint64_t *path) {
    uint64_t n, i, c, maxi = 0, maxc = 0;

    for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
        c = 2;
        while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
            c += 2;
        if (n == 2)
            c++;
        if (c > maxc) {
            maxi = i;
            maxc = c;
        }
    }
    *path = maxc;
    return maxi;
}

int main() {
    uint64_t maxi, maxc;

    maxi = sequence(1000000, &maxc);
    printf("%llu, %llu\n", maxi, maxc);
    return 0;
}

Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.

[EDIT:]

Let`s translate this into assembly!

MOV RCX, 1000000;



DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;

@main:
  XOR RSI, RSI;
  LEA RDI, [RCX + 1];

  @loop:
    ADD RSI, 2;
    LEA RDX, [RDI + RDI*2 + 2];
    SHR RDX, 1;
    SHRD RDI, RDI, 2;    ror rdi,2   would do the same thing
    CMOVL RDI, RDX;      Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
    CMOVS RDI, RDX;
    CMP RDI, 2;
  JA @loop;

  LEA RDX, [RSI + 1];
  CMOVE RSI, RDX;

  CMP RAX, RSI;
  CMOVB RAX, RSI;
  CMOVB RBX, RCX;

  SUB RCX, 2;
JA @main;



MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;

@itoa:
  XOR RDX, RDX;
  DIV RCX;
  ADD RDX, '0';
  PUSH RDX;
  TEST RAX, RAX;
JNE @itoa;

  PUSH RCX;
  LEA RAX, [RBX + 1];
  TEST RBX, RBX;
  MOV RBX, RDI;
JNE @itoa;

POP RCX;
INC RDI;
MOV RDX, RDI;

@outp:
  MOV RSI, RSP;
  MOV RAX, RDI;
  SYSCALL;
  POP RAX;
  TEST RAX, RAX;
JNE @outp;

LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;

Use these commands to compile:

nasm -f elf64 file.asm
ld -o file file.o

See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)


There is no integral Q such that 12 = 3Q + 1. Your first point is not right, methinks.
@Veedrac: Been playing around with this: It can be implemented with better asm than the implementation in this answer, using ROR / TEST and only one CMOV. This asm code infinite-loops on my CPU, since it apparently relies on OF, which is undefined after SHRD or ROR with count > 1. It also goes to great lengths to try to avoid mov reg, imm32, apparently to save bytes, but then it uses the 64-bit version of register everywhere, even for xor rax, rax, so it has lots of unnecessary REX prefixes. We obviously only need REX on the regs holding n in the inner loop to avoid overflow.
Timing results (from a Core2Duo E6600: Merom 2.4GHz. Complex-LEA=1c latency, CMOV=2c). The best single-step asm inner-loop implementation (from Johnfound): 111ms per run of this @main loop. Compiler output from my de-obfuscated version of this C (with some tmp vars): clang3.8 -O3 -march=core2: 96ms. gcc5.2: 108ms. From my improved version of clang's asm inner loop: 92ms (should see a much bigger improvement on SnB-family, where complex LEA is 3c not 1c). From my improved + working version of this asm loop (using ROR+TEST, not SHRD): 87ms. Measured with 5 reps before printing
Here are the first 66 record-setters (A006877 on OEIS); I've marked the even ones in bold: 2, 3, 6, 7, 9, 18, 25, 27, 54, 73, 97, 129, 171, 231, 313, 327, 649, 703, 871, 1161, 2223, 2463, 2919, 3711, 6171, 10971, 13255, 17647, 23529, 26623, 34239, 35655, 52527, 77031, 106239, 142587, 156159, 216367, 230631, 410011, 511935, 626331, 837799, 1117065, 1501353, 1723519, 2298025, 3064033, 3542887, 3732423, 5649499, 6649279, 8400511, 11200681, 14934241, 15733191, 31466382, 36791535, 63728127, 127456254, 169941673, 226588897, 268549803, 537099606, 670617279, 1341234558
@hidefromkgb Great! And I appreciate your other point better too now: 4k+2 → 2k+1 → 6k+4 = (4k+2) + (2k+1) + 1, and 2k+1 → 6k+4 → 3k+2 = (2k+1) + (k) + 1. Nice observation!
4
4444

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.

However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:

Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network). Your datasize must be greater in size. Your test must run for something more than 5-10 seconds. Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.


Yes I haven't done any formal profiling but I have ran them both a few times and am capable of telling 2 seconds from 3 seconds. Anyway thanks for answering. I already picked up a good deal of info here
It's probably not just a measurement error, the hand-written asm code is using a 64-bit DIV instruction instead of a right-shift. See my answer. But yes, measuring correctly is important, too.
Bullet points are more appropriate formatting than a code block. Please stop putting your text into a code block, because it's not code and doesn't benefit from a monospaced font.
I don't really see how this answers the question. This isn't a vague question about whether assembly code or C++ code might be faster---it is a very specific question about actual code, which he's helpfully provided in the question itself. Your answer doesn't even mention any of that code, or do any type of comparison. Sure, your tips on how to benchmark are basically correct, but not enough to make an actual answer.
P
Ped7g

From comments:

But, this code never stops (because of integer overflow) !?! Yves Daoust

For many numbers it will not overflow.

If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.

Still this poses interesting question, is there some overflow-cyclic seed number?

Any simple final converging series starts with power of two value (obvious enough?).

2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.

Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).

How about some more complex number, which will create cycle without 0? Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.

So I just put few numbers into sheet and took a look on 8 bit truncated numbers.

There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).

But there's no value creating cyclic overflow seed.

Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).

Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.

Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.

But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).


"the overflown number will very likely converge toward 1 without another overflow": the code never stops. (That's a conjecture as I cannot wait until the end of times to be sure...)
@YvesDaoust oh, but it does?... for example the 27 series with 8b truncation looks like this: 82 41 124 62 31 94 47 142 71 214 107 66 (truncated) 33 100 50 25 76 38 19 58 29 88 44 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1 (rest of it works without truncation). I don't get you, sorry. It would never stop if the truncated value would be equal to some of previously reached in currently ongoing series, and I can't find any such value vs k-bit truncation (but I either can't figure out the Math theory behind, why this holds up for 8/16/32/64 bits truncation, just intuitively I think it works).
I should have checked the original problem description sooner: "Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1." ... ok, no wonder I can't get grasp of it with my limited hazy Math knowledge... :D And from my sheet experiments I can assure you it does converge for every 2-255 number, either without truncation (to 1), or with 8 bit truncation (to either expected 1 or to 0 for three numbers).
Hem, when I say that it never stops, I mean... that it does not stop. The given code runs forever if you prefer.
Upvoted for analysis of what happens on overflow. The CMP-based loop could use cmp rax,1 / jna (i.e. do{}while(n>1)) to also terminate on zero. I thought about making an instrumented version of the loop that records the max n seen, to give an idea of how close we get to overflow.
D
Damon

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:

test rax, 1
jpe even

... has a 50% chance of mispredicting the branch, and that will come expensive.

The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.


There is some pattern to the branching; e.g. an odd number is always followed by an even number. But sometimes 3n+1 leaves multiple trailing zero bits, and that's when this will mispredict. I started writing about division in my answer, and didn't address this other big red flag in the OP's code. (Note also that using a parity condition is really weird, compared to just JZ or CMOVZ. It's also worse for the CPU, because Intel CPUs can macro-fuse TEST/JZ, but not TEST/JPE. Agner Fog says AMD can fuse any TEST/CMP with any JCC, so in that case it's only worse for human readers)
E
Emanuel Landeholm

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization (https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.

Example python implementation:

import sys

inner_loop = 0

def collatz_sequence(N, cache):
    global inner_loop

    l = [ ]
    stop = False
    n = N

    tails = [ ]

    while not stop:
        inner_loop += 1
        tmp = n
        l.append(n)
        if n <= 1:
            stop = True  
        elif n in cache:
            stop = True
        elif n % 2:
            n = 3*n + 1
        else:
            n = n // 2
        tails.append((tmp, len(l)))

    for key, offset in tails:
        if not key in cache:
            cache[key] = l[offset:]

    return l

def gen_sequence(l, cache):
    for elem in l:
        yield elem
        if elem in cache:
            yield from gen_sequence(cache[elem], cache)
            raise StopIteration

if __name__ == "__main__":
    le_cache = {}

    for n in range(1, 4711, 5):
        l = collatz_sequence(n, le_cache)
        print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))

    print("inner_loop = {}".format(inner_loop))

gnasher's answer shows that you can do much more than just cache the tails: high bits don't affect what happens next, and add / mul only propagate carry to the left, so high bits don't affect what happens to the low bits. i.e. you can use LUT lookups to go 8 (or any number) of bits at a time, with multiply and add constants to apply to the rest of the bits. memoizing the tails is certainly helpful in a lot of problems like this, and for this problem when you haven't thought of the better approach yet, or haven't proved it correct.
If I understand gnasher's idea above correctly, I think tail memoization is an orthogonal optimization. So you could conceivably do both. It would be interesting to investigate how much you could gain from adding memoization to gnasher's algorithm.
We can maybe make memoization cheaper by only storing the dense part of the results. Set an upper limit on N, and above that, don't even check memory. Below that, use hash(N) -> N as the hash function, so key = position in the array, and doesn't need to be stored. An entry of 0 means not present yet. We can further optimize by only storing odd N in the table, so the hash function is n>>1, discarding the 1. Write the step code to always end with a n>>tzcnt(n) or something to make sure it's odd.
That's based on my (untested) idea that very large N values in the middle of a sequence are less likely to be common to multiple sequences, so we don't miss out on too much from not memoizing them. Also that a reasonably-sized N will be part of many long sequences, even ones that start with very large N. (This may be wishful thinking; if it's wrong then only caching a dense range of consecutive N may lose out vs. a hash table that can store arbitrary keys.) Have you done any kind of hit-rate testing to see if nearby starting N tend to have any similarity in their sequence values?
You can just store pre-computed results for all n < N, for some large N. So you don't need the overhead of a hash table. The data in that table will be used eventually for every starting value. If you just want to confirm that the Collatz sequence always ends in (1, 4, 2, 1, 4, 2, ...): This can be proven to be equivalent to proving that for n > 1, the sequence will eventually be less than the original n. And for that, caching tails will not help.
g
gnasher729

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.

Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.

In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.


Totally agree with this. gcc -O3 made code that was within 20% of optimal on Haswell, for that exact algorithm. (Getting those speedups were the main focus of my answer only because that's what the question asked, and has an interesting answer, not because it's the right approach.) Much bigger speedups were obtained from transformations that the compiler would be extremely unlikely to look for, like deferring right shifts, or doing 2 steps at a time. Far bigger speedups than that can be had from memoization / lookup-tables. Still exhaustive testing, but not pure brute force.
Still, having a simple implementation that's obviously correct is extremely useful for testing other implementations. What I'd do is probably just look at the asm output to see if gcc did it branchlessly like I expected (mostly out of curiousity), and then move on to algorithmic improvements.
D
Dmitry Rubanovich

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.

Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.

On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to

{
   n = (n*3+1) >> 1;
   count += 2;
}

So the whole statement would then be

if (n & 1)
{
    n = (n*3 + 1) >> 1;
    count += 2;
}
else
{
    n >>= 1;
    ++count;
}

Integer division is not actually faster than FP division on modern x86 CPUs. I think this is due to Intel/AMD spending more transistors on their FP dividers, because it's a more important operation. (Integer division by constants can be optimized to a multiply by a modular inverse). Check Agner Fog's insn tables, and compare DIVSD (double-precision float) with DIV r32 (32-bit unsigned integer) or DIV r64 (much slower 64-bit unsigned integer). Especially for throughput, FP division is much faster (single uop instead of micro-coded, and partially pipelined), but latency is better too.
e.g. on the OP's Haswell CPU: DIVSD is 1 uop, 10-20 cycles latency, one per 8-14c throughput. div r64 is 36 uops, 32-96c latency, and one per 21-74c throughput. Skylake has even faster FP division throughput (pipelined at one per 4c with not much better latency), but not much faster integer div. Things are similar on AMD Bulldozer-family: DIVSD is 1M-op, 9-27c latency, one per 4.5-11c throughput. div r64 is 16M-ops, 16-75c latency, one per 16-75c throughput.
Isn't FP division basically the same as integer-subtract exponents, integer-divide mantissa's, detect denormals? And those 3 steps can be done in parallel.
@MSalters: yeah, that sounds right, but with a normalization step at the end ot shift bits between exponent and mantiss. double has a 53-bit mantissa, but it's still significantly slower than div r32 on Haswell. So it's definitely just a matter of how much hardware Intel/AMD throw at the problem, because they don't use the same transistors for both integer and fp dividers. The integer one is scalar (there's no integer-SIMD divide), and the vector one handles 128b vectors (not 256b like other vector ALUs). The big thing is that integer div is many uops, big impact on surrounding code.
Err, not shift bits between mantissa and exponent, but normalize the mantissa with a shift, and add the shift amount to the exponent.
T
Tyler Durden

The simple answer:

doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice

ADD 1 is probably faster than INC here

MOV 2 and DIV is very expensive; just shift right

64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code

If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.


1): adding 3 times would be dumb compared to LEA. Also mul rbx on the OP's Haswell CPU is 2 uops with 3c latency (and 1 per clock throughput). imul rcx, rbx, 3 is only 1 uop, with the same 3c latency. Two ADD instructions would be 2 uops with 2c latency.
2) ADD 1 is probably faster than INC here. Nope, the OP is not using a Pentium4. Your point 3) is the only correct part of this answer.
4) sounds like total nonsense. 64-bit code can be slower with pointer-heavy data structures, because larger pointers means bigger cache footprint. But this code is working only in registers, and code alignment issues are the same in 32 and 64 bit mode. (So are data alignment issues, no clue what you're talking about with alignment being a bigger issue for x86-64). Anyway, the code doesn't even touch memory inside the loop.
The commenter has no idea what is talking about. Do a MOV+MUL on a 64-bit CPU will be roughly three times slower than adding a register to itself twice. His other remarks are equally incorrect.
Well MOV+MUL is definitely dumb, but MOV+ADD+ADD is still silly (actually doing ADD RBX, RBX twice would multiply by 4, not 3). By far the best way is lea rax, [rbx + rbx*2]. Or, at the cost of making it a 3-component LEA, do the +1 as well with lea rax, [rbx + rbx*2 + 1] (3c latency on HSW instead of 1, as I explained in my answer) My point was that 64-bit multiply is not very expensive on recent Intel CPUs, because they have insanely fast integer multiply units (even compared to AMD, where the same MUL r64 is 6c latency, with one per 4c throughput: not even fully pipelined.