Is this loop pattern with two end pointers a missed optimization

ghz 8months ago ⋅ 92 views

With Clang -O3, the function f1

typedef unsigned char u8;

u8 f1(u8* p, u8* end1, u8* end2)
{
    while (p < end1 && p < end2) {
        if (*p & 0x80)
            return 0xff;
        p++;
    }
    return 0x7f;
}

compiles to

f1:                                     # @f1
        cmp     rdi, rsi
        setae   cl
        cmp     rdi, rdx
        setae   r8b
        mov     al, 127
        or      r8b, cl
        jne     .LBB0_6
        inc     rdi
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        cmp     byte ptr [rdi - 1], 0
        js      .LBB0_5
        cmp     rdi, rsi
        jae     .LBB0_6
        lea     rcx, [rdi + 1]
        cmp     rdi, rdx
        mov     rdi, rcx
        jb      .LBB0_4
.LBB0_6:
        ret
.LBB0_5:
        mov     al, -1
        ret

while the function f2

u8 f2(u8* p, u8* end1, u8* end2)
{
    if (end2 < end1)
        end1 = end2;
    while (p < end1) {
        if (*p & 0x80)
            return 0xff;
        p++;
    }
    return 0x7f;
}

compiles to

f2:                                     # @f2
        cmp     rdx, rsi
        cmovb   rsi, rdx
        mov     al, 127
        cmp     rsi, rdi
        jbe     .LBB1_4
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
        cmp     byte ptr [rdi], 0
        js      .LBB1_3
        inc     rdi
        cmp     rdi, rsi
        jb      .LBB1_2
.LBB1_4:
        ret
.LBB1_3:
        mov     al, -1
        ret

which has less instructions in the loop.

With GCC -O3 generated code is different but similar, there're more instructions in f1's loop.

Live demo: https://godbolt.org/z/edb5Eda5b

Is there corner case that prevents the compiler from optimizing f1 as f2, e.g. are they nonequivalent?

Or is it a balance/trade-off to provide a close mapping from source code to binary code so that developers have more control?

I tried adding assumptions like ASSUME(p <= end1 && p <= end2) but nothing changed.

Answers

The functions f1 and f2 are not strictly equivalent due to a subtle difference in their loop conditions.

In f1:

while (p < end1 && p < end2)

In f2:

while (p < end1)

In f1, the loop condition checks both p < end1 and p < end2. This means that the loop will terminate as soon as either condition becomes false. However, in f2, the loop condition only checks p < end1, and the check for p < end2 is handled separately outside the loop.

This difference may affect the behavior of the loop and hence the optimization opportunities for the compiler. Specifically, in f1, the compiler needs to ensure that p < end2 is checked on every iteration of the loop, which may limit certain optimizations.

The reason why f2 has fewer instructions in the loop is likely due to the simpler loop condition, which allows the compiler to generate more optimized code.

To achieve similar optimization in f1, you can try explicitly checking p < end2 outside the loop, similar to how it's done in f2. This may allow the compiler to generate more efficient code by simplifying the loop condition.

In general, compiler optimizations are complex and can be influenced by various factors, including the specific code patterns, compiler versions, optimization settings, and target architecture. The goal of optimizations is to balance between code size, execution speed, and correctness, and the compiler may make different decisions based on these factors. Therefore, it's not always possible to predict or force specific optimizations, and sometimes, minor changes in the code may lead to significant differences in the generated code.