With Clang -O3
, the function f1
typedef unsigned char u8;
u8 f1(u8* p, u8* end1, u8* end2)
{
while (p < end1 && p < end2) {
if (*p & 0x80)
return 0xff;
p++;
}
return 0x7f;
}
compiles to
f1: # @f1
cmp rdi, rsi
setae cl
cmp rdi, rdx
setae r8b
mov al, 127
or r8b, cl
jne .LBB0_6
inc rdi
.LBB0_4: # =>This Inner Loop Header: Depth=1
cmp byte ptr [rdi - 1], 0
js .LBB0_5
cmp rdi, rsi
jae .LBB0_6
lea rcx, [rdi + 1]
cmp rdi, rdx
mov rdi, rcx
jb .LBB0_4
.LBB0_6:
ret
.LBB0_5:
mov al, -1
ret
while the function f2
u8 f2(u8* p, u8* end1, u8* end2)
{
if (end2 < end1)
end1 = end2;
while (p < end1) {
if (*p & 0x80)
return 0xff;
p++;
}
return 0x7f;
}
compiles to
f2: # @f2
cmp rdx, rsi
cmovb rsi, rdx
mov al, 127
cmp rsi, rdi
jbe .LBB1_4
.LBB1_2: # =>This Inner Loop Header: Depth=1
cmp byte ptr [rdi], 0
js .LBB1_3
inc rdi
cmp rdi, rsi
jb .LBB1_2
.LBB1_4:
ret
.LBB1_3:
mov al, -1
ret
which has less instructions in the loop.
With GCC -O3
generated code is different but similar, there're more instructions in f1
's loop.
Live demo: https://godbolt.org/z/edb5Eda5b
Is there corner case that prevents the compiler from optimizing f1
as f2
, e.g. are they nonequivalent?
Or is it a balance/trade-off to provide a close mapping from source code to binary code so that developers have more control?
I tried adding assumptions like ASSUME(p <= end1 && p <= end2)
but nothing changed.
Answers
The functions f1
and f2
are not strictly equivalent due to a subtle difference in their loop conditions.
In f1
:
while (p < end1 && p < end2)
In f2
:
while (p < end1)
In f1
, the loop condition checks both p < end1
and p < end2
. This means that the loop will terminate as soon as either condition becomes false. However, in f2
, the loop condition only checks p < end1
, and the check for p < end2
is handled separately outside the loop.
This difference may affect the behavior of the loop and hence the optimization opportunities for the compiler. Specifically, in f1
, the compiler needs to ensure that p < end2
is checked on every iteration of the loop, which may limit certain optimizations.
The reason why f2
has fewer instructions in the loop is likely due to the simpler loop condition, which allows the compiler to generate more optimized code.
To achieve similar optimization in f1
, you can try explicitly checking p < end2
outside the loop, similar to how it's done in f2
. This may allow the compiler to generate more efficient code by simplifying the loop condition.
In general, compiler optimizations are complex and can be influenced by various factors, including the specific code patterns, compiler versions, optimization settings, and target architecture. The goal of optimizations is to balance between code size, execution speed, and correctness, and the compiler may make different decisions based on these factors. Therefore, it's not always possible to predict or force specific optimizations, and sometimes, minor changes in the code may lead to significant differences in the generated code.