ChatGPT解决这个技术问题 Extra ChatGPT

Using this pointer causes strange deoptimization in hot loop

I recently came across a strange deoptimization (or rather missed optimization opportunity).

Consider this function for efficient unpacking of arrays of 3-bit integers to 8-bit integers. It unpacks 16 ints in each loop iteration:

void unpack3bit(uint8_t* target, char* source, int size) {
   while(size > 0){
      uint64_t t = *reinterpret_cast<uint64_t*>(source);
      target[0] = t & 0x7;
      target[1] = (t >> 3) & 0x7;
      target[2] = (t >> 6) & 0x7;
      target[3] = (t >> 9) & 0x7;
      target[4] = (t >> 12) & 0x7;
      target[5] = (t >> 15) & 0x7;
      target[6] = (t >> 18) & 0x7;
      target[7] = (t >> 21) & 0x7;
      target[8] = (t >> 24) & 0x7;
      target[9] = (t >> 27) & 0x7;
      target[10] = (t >> 30) & 0x7;
      target[11] = (t >> 33) & 0x7;
      target[12] = (t >> 36) & 0x7;
      target[13] = (t >> 39) & 0x7;
      target[14] = (t >> 42) & 0x7;
      target[15] = (t >> 45) & 0x7;
      source+=6;
      size-=6;
      target+=16;
   }
}

Here is the generated assembly for parts of the code:

 ...
 367:   48 89 c1                mov    rcx,rax
 36a:   48 c1 e9 09             shr    rcx,0x9
 36e:   83 e1 07                and    ecx,0x7
 371:   48 89 4f 18             mov    QWORD PTR [rdi+0x18],rcx
 375:   48 89 c1                mov    rcx,rax
 378:   48 c1 e9 0c             shr    rcx,0xc
 37c:   83 e1 07                and    ecx,0x7
 37f:   48 89 4f 20             mov    QWORD PTR [rdi+0x20],rcx
 383:   48 89 c1                mov    rcx,rax
 386:   48 c1 e9 0f             shr    rcx,0xf
 38a:   83 e1 07                and    ecx,0x7
 38d:   48 89 4f 28             mov    QWORD PTR [rdi+0x28],rcx
 391:   48 89 c1                mov    rcx,rax
 394:   48 c1 e9 12             shr    rcx,0x12
 398:   83 e1 07                and    ecx,0x7
 39b:   48 89 4f 30             mov    QWORD PTR [rdi+0x30],rcx
 ...

It looks quite efficent. Simply a shift right followed by an and, and then a store to the target buffer. But now, look what happens when I change the function to a method in a struct:

struct T{
   uint8_t* target;
   char* source;
   void unpack3bit( int size);
};

void T::unpack3bit(int size) {
        while(size > 0){
           uint64_t t = *reinterpret_cast<uint64_t*>(source);
           target[0] = t & 0x7;
           target[1] = (t >> 3) & 0x7;
           target[2] = (t >> 6) & 0x7;
           target[3] = (t >> 9) & 0x7;
           target[4] = (t >> 12) & 0x7;
           target[5] = (t >> 15) & 0x7;
           target[6] = (t >> 18) & 0x7;
           target[7] = (t >> 21) & 0x7;
           target[8] = (t >> 24) & 0x7;
           target[9] = (t >> 27) & 0x7;
           target[10] = (t >> 30) & 0x7;
           target[11] = (t >> 33) & 0x7;
           target[12] = (t >> 36) & 0x7;
           target[13] = (t >> 39) & 0x7;
           target[14] = (t >> 42) & 0x7;
           target[15] = (t >> 45) & 0x7;
           source+=6;
           size-=6;
           target+=16;
        }
}

I thought the generated assembly should be quite the same, but it isn't. Here is a part of it:

...
 2b3:   48 c1 e9 15             shr    rcx,0x15
 2b7:   83 e1 07                and    ecx,0x7
 2ba:   88 4a 07                mov    BYTE PTR [rdx+0x7],cl
 2bd:   48 89 c1                mov    rcx,rax
 2c0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2c3:   48 c1 e9 18             shr    rcx,0x18
 2c7:   83 e1 07                and    ecx,0x7
 2ca:   88 4a 08                mov    BYTE PTR [rdx+0x8],cl
 2cd:   48 89 c1                mov    rcx,rax
 2d0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2d3:   48 c1 e9 1b             shr    rcx,0x1b
 2d7:   83 e1 07                and    ecx,0x7
 2da:   88 4a 09                mov    BYTE PTR [rdx+0x9],cl
 2dd:   48 89 c1                mov    rcx,rax
 2e0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2e3:   48 c1 e9 1e             shr    rcx,0x1e
 2e7:   83 e1 07                and    ecx,0x7
 2ea:   88 4a 0a                mov    BYTE PTR [rdx+0xa],cl
 2ed:   48 89 c1                mov    rcx,rax
 2f0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 ...

As you see, we introduced an additional redundant load from memory before each shift (mov rdx,QWORD PTR [rdi]). It seems like the target pointer (which is now a member instead of a local variable) has to be always reloaded before storing into it. This slows down the code considerably (around 15% in my measurements).

First I thought maybe the C++ memory model enforces that a member pointer may not be stored in a register but has to be reloaded, but this seemed like an awkward choice, as it would make a lot of viable optimizations impossible. So I was very surprised that the compiler did not store target in a register here.

I tried caching the member pointer myself into a local variable:

void T::unpack3bit(int size) {
    while(size > 0){
       uint64_t t = *reinterpret_cast<uint64_t*>(source);
       uint8_t* target = this->target; // << ptr cached in local variable
       target[0] = t & 0x7;
       target[1] = (t >> 3) & 0x7;
       target[2] = (t >> 6) & 0x7;
       target[3] = (t >> 9) & 0x7;
       target[4] = (t >> 12) & 0x7;
       target[5] = (t >> 15) & 0x7;
       target[6] = (t >> 18) & 0x7;
       target[7] = (t >> 21) & 0x7;
       target[8] = (t >> 24) & 0x7;
       target[9] = (t >> 27) & 0x7;
       target[10] = (t >> 30) & 0x7;
       target[11] = (t >> 33) & 0x7;
       target[12] = (t >> 36) & 0x7;
       target[13] = (t >> 39) & 0x7;
       target[14] = (t >> 42) & 0x7;
       target[15] = (t >> 45) & 0x7;
       source+=6;
       size-=6;
       this->target+=16;
    }
}

This code also yields the "good" assembler without additional stores. So my guess is: The compiler is not allowed to hoist the load of a member pointer of a struct, so such a "hot pointer" should always be stored in a local variable.

So, why is the compiler unable to optimize out these loads?

Is it the C++ memory model that forbids this? Or is it simply a shortcoming of my compiler?

Is my guess correct or what is the exact reason why the optimization can't be performed?

The compiler in use was g++ 4.8.2-19ubuntu1 with -O3 optimization. I also tried clang++ 3.4-1ubuntu3 with similar results: Clang is even able to vectorize the method with the local target pointer. However, using the this->target pointer yields the same result: An extra load of the pointer before each store.

I checked the assembler of some similar methods and the result is the same: It seems that a member of this always has to be reloaded before a store, even if such a load could simply be hoisted outside the loop. I will have to rewrite a lot of code to get rid of these additional stores, mainly by caching the pointer myself into a local variable that is declared above the hot code. But I always thought fiddling with such details as caching a pointer in a local variable would surely qualify for premature optimization in these days where compilers have gotten so clever. But it seems I am wrong here. Caching a member pointer in a hot loop seems to be a necessary manual optimization technique.

Not sure why this got a down-vote - it's an interesting question. FWIW I've seen similar optimisation problems with non-pointer member variables where the solution has been similar, i.e. cache the member variable in a local variable for the lifetime of the method. I'm guessing it's something to do with aliasing rules ?
Looks like the compiler don't optimize because he cannot ensure that the member is not accessed through some "external" code. So if the member can be modified outside, then it should be reloaded each time is accessed. Seems to be considered like a kind of volatile...
No not using this-> is just syntactic sugar. The problem is related to the nature of the variables (local vs member) and the things that the compiler deduces from this fact.
Anything to do with pointer aliases ?
As a more semantic matter, "premature optimization" applies only to optimization that is premature, i.e., before profiling has found it to be an issue. In this case, you diligently profiled and decompiled and found the source of an issue and formulated and profiled a solution. It is absolutely not "premature" to apply that solution.

a
akim

Pointer aliasing seems to be the problem, ironically between this and this->target. The compiler is taking into account the rather obscene possibility that you initialized:

this->target = &this

In that case, writing to this->target[0] would alter the contents of this (and thus, this->target).

The memory aliasing problem is not restricted to the above. In principle, any use of this->target[XX] given an (in)appropriate value of XX might point to this.

I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.


I can confirm this! Changing target from uint8_t to uint16_t (so that strict aliasing rules kick in) changed it. With uint16_t, the load is always optimized out.
Changing the contents of this is not what you mean (it is not a variable); you mean changing the contents of *this.
@gexicide mind elaborating how strict alias kicks in and fixes the issue?
P
Peter Mortensen

Strict aliasing rules allows char* to alias any other pointer. So this->target may alias with this, and in your code method, the first part of the code,

target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;

is in fact

this->target[0] = t & 0x7;
this->target[1] = (t >> 3) & 0x7;
this->target[2] = (t >> 6) & 0x7;

as this may be modified when you modify this->target content.

Once this->target is cached into a local variable, the alias is no longer possible with the local variable.


So, can we say as a general rule: Whenever you have a char* or void* in your struct, be sure to cache it in a local variable before writing to it?
In fact it is when you use a char*, not necessary as a member.
C
Community

The issue here is strict aliasing which says that we are allowed to alias through a char* and so that prevents compiler optimization in your case. We are not allowed to alias through a pointer of a different type which would be undefined behavior, normally on SO we see this problem which is users attempting to alias through incompatible pointer types.

It would seem reasonable to implement uint8_t as a unsigned char and if we look at cstdint on Coliru it includes stdint.h which typedefs uint8_t as follows:

typedef unsigned char       uint8_t;

if you used another non-char type then the compiler should be able to optimize.

This is covered in the draft C++ standard section 3.10 Lvalues and rvalues which says:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined

and includes the following bullet:

a char or unsigned char type.

Note, I posted a comment on possible work arounds in a question that asks When is uint8_t ≠ unsigned char? and the recommendation was:

The trivial workaround, however, is to use the restrict keyword, or to copy the pointer to a local variable whose address is never taken so that the compiler does not need to worry about whether the uint8_t objects can alias it.

Since C++ does not support the restrict keyword you have to rely on compiler extension, for example gcc uses __restrict__ so this is not totally portable but the other suggestion should be.


This is an example of a place where the Standard is worse for optimizers than would be a rule would allow a compiler to assume that between two accesses to an object of type T, or such an access and the start or end of a loop/function wherein it occurs, all accesses to the storage will use the same object unless an intervening operation uses that object (or a pointer/reference to it) to derive a pointer or reference to some other object. Such a rule would eliminate the need for the "character-type exception" which can kill the performance of code that works with sequences of bytes.