Please note that this is not an endorsement or criticism of either of these languages. It’s simply something I found interesting with how LLVM handles code generation between the two. This is an implementation quirk, not a language issue.
Update (April 9, 2021): A bug report was filed and a fix was pushed!
The LLVM project is a modular set of tools that make designing and implementing a compiler significantly easier. The most well known part of LLVM is their intermediate representation; IR for short. LLVM’s IR is an extremely powerful tool, designed to make optimization and targeting many architectures as easy as possible. Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler (rustc) are both notable examples. However, despite this unified architecture, code generation can still vary wildly between implementations and how the IR is used. Some time ago, I stumbled upon this tweet discussing Rust’s implementation of clamping compared to C++:
Rust 1.50 is out and has f32.clamp. I had extremely low expectations for performance based on C++ experience but as usual Rust proves to be "C++ done right".
— Arseny Kapoulkine (@zeuxcg) February 11, 2021
Of course Zig already has clamp and also gets codegen right. pic.twitter.com/0WI1fLrQaB
Rust’s code generation on the latest version of LLVM is far superior compared to an equivalent Clang version using std::clamp
, even though they use the same underlying IR:
With f32.clamp
:
pub fn clamp(v: f32) -> f32 {
v.clamp(-1.0, 1.0)
}
The corresponding assembly is shown below. It is short, concise, and pretty much the best you’re going to get. We can see two memory accesses to get the clamp bounds and efficient use of x86 instructions.
.LCPI0_0:
.long 0xbf800000
.LCPI0_1:
.long 0x3f800000
example::clamp:
movss xmm1, dword ptr [rip + .LCPI0_0]
maxss xmm1, xmm0
movss xmm0, dword ptr [rip + .LCPI0_1]
minss xmm0, xmm1
ret
Next is a short C++ program using std::clamp
:
#include <algorithm>
float clamp2(float v) {
return std::clamp(v, -1.f, 1.f);
}
The corresponding assembly is shown below. It is significantly longer with many more data accesses, conditional moves, and is in general uglier.
.LCPI1_0:
.long 0x3f800000 float 1
.LCPI1_1:
.long 0xbf800000 float -1
clamp2(float): @clamp2(float)
movss dword ptr [rsp - 4], xmm0
mov dword ptr [rsp - 8], -1082130432
mov dword ptr [rsp - 12], 1065353216
ucomiss xmm0, dword ptr [rip + .LCPI1_0]
lea rax, [rsp - 12]
lea rcx, [rsp - 4]
cmova rcx, rax
movss xmm1, dword ptr [rip + .LCPI1_1] # xmm1 = mem[0],zero,zero,zero
ucomiss xmm1, xmm0
lea rax, [rsp - 8]
cmovbe rax, rcx
movss xmm0, dword ptr [rax] # xmm0 = mem[0],zero,zero,zero
ret
Interestingly enough, reimplementing std::clamp
causes this issue to disappear:
float clamp(float v, float lo, float hi) {
v = (v < lo) ? lo : v;
v = (v > hi) ? hi : v;
return v;
}
float clamp1(float v) {
return clamp(v, -1.f, 1.f);
}
The assembly generated here is the same as with Rust’s implementation:
.LCPI0_0:
.long 0xbf800000 # float -1
.LCPI0_1:
.long 0x3f800000 # float 1
clamp1(float): # @clamp1(float)
movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
maxss xmm1, xmm0
movss xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero
minss xmm0, xmm1
ret
Clearly, something is off between std::clamp
and our implementation. According to the C++ reference, std::clamp
takes two references along with a predicate (which defaults to std::less
) and returns a reference. Functionally, the only difference between our code and std::clamp
is that we do not use reference types. Knowing this, we can then reproduce the issue.
const float& bad_clamp(const float& v, const float& lo, const float& hi) {
return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
return bad_clamp(v, -1.f, 1.f);
}
Once again, we’ve generated the same bad code as with std::clamp
:
.LCPI1_0:
.long 0x3f800000 # float 1
.LCPI1_1:
.long 0xbf800000 # float -1
clamp2(float): # @clamp2(float)
movss dword ptr [rsp - 4], xmm0
mov dword ptr [rsp - 8], -1082130432
mov dword ptr [rsp - 12], 1065353216
ucomiss xmm0, dword ptr [rip + .LCPI1_0]
lea rax, [rsp - 12]
lea rcx, [rsp - 4]
cmova rcx, rax
movss xmm1, dword ptr [rip + .LCPI1_1] # xmm1 = mem[0],zero,zero,zero
ucomiss xmm1, xmm0
lea rax, [rsp - 8]
cmovbe rax, rcx
movss xmm0, dword ptr [rax] # xmm0 = mem[0],zero,zero,zero
ret
LLVM IR and Clang
LLVM IR is a Static Single Assignment (SSA) intermediate representation. What this means is that every variable is only assigned to once. In order to represent conditional assignments, SSA form uses a special type of instruction called a “phi” node, which picks a value based on the block that was previously running. However, Clang does not initially use phi nodes. Instead, to make initial code generation easier, variables in functions are allocated on the stack using alloca
instructions. Reads and assignments to the variable are load and store instructions to the alloca
, respectively:
int main() {
float x = 0;
}
In this unoptimized IR, we can see an alloca
instruction that then has the float value 0
stored to it:
define dso_local i32 @main() #0 {
%1 = alloca float, align 4
store float 0.000000e+00, float* %1, align 4
ret i32 0
}
LLVM will then (hopefully) optimize away the alloca
instructions with a relevant pass, like SROA.
LLVM IR and reference types
Reference types are represented as pointers in LLVM IR:
void test(float& x2) {
x2 = 1;
}
In this optimized IR, we can see that the reference has been converted to a pointer with specific attributes.
define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
store float 1.000000e+00, float* %0, align 4, !tbaa !2
ret void
}
When a function is given a reference type as an argument, it is passed the underlying object’s address instead of the object itself. Also passed is some metadata about reference types. For example, nonnull
and dereferenceable
are set as attributes to the argument because the C++ standard dictates that references always have to be bound to a valid object. For us, this means the alloca
instructions are passed directly to the clamp function:
__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
return bad_clamp(v, -1.f, 1.f);
}
In this optimized IR, we can see alloca
instructions passed to bad_clamp
corresponding to the variables passed as references.
define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
%4 = load float, float* %0, align 4
%5 = load float, float* %1, align 4
%6 = fcmp olt float %4, %5
%7 = load float, float* %2, align 4
%8 = fcmp ogt float %4, %7
%9 = select i1 %8, float* %2, float* %0
%10 = select i1 %6, float* %1, float* %9
ret float* %10
}
define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
%2 = alloca float, align 4
%3 = alloca float, align 4
%4 = alloca float, align 4
store float %0, float* %2, align 4
store float -1.000000e+00, float* %3, align 4
store float 1.000000e+00, float* %4, align 4
%6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
%7 = load float, float* %7, align 4
ret float %7
}
Lifetime annotations are omitted to make the IR a bit clearer.
In this example, the noinline
attribute was used to demonstrate passing references to functions. If we remove the attribute, the call is inlined into the function:
const float& bad_clamp(const float& v, const float& lo, const float& hi) {
return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
return bad_clamp(v, -1.f, 1.f);
}
However, even after optimization, the alloca
instructions are still there for seemingly no good reason. These alloca
instructions should have been optimized away by LLVM’s passes; they’re not used anywhere else and there are no tricky stores or lifetime problems.
define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
%2 = alloca float, align 4
%3 = alloca float, align 4
%4 = alloca float, align 4
store float %0, float* %2, align 4, !tbaa !2
store float -1.000000e+00, float* %3, align 4, !tbaa !2
store float 1.000000e+00, float* %4, align 4, !tbaa !2
%5 = fcmp olt float %0, -1.000000e+00
%6 = fcmp ogt float %0, 1.000000e+00
%7 = select i1 %8, float* %4, float* %2
%9 = select i1 %7, float* %3, float* %9
%9 = load float, float* %10, align 4, !tbaa !2
ret float %9
}
The only candidate here is the two sequential select
instructions, as they operate on the pointers created by the alloca
instructions instead of the underlying value. However, LLVM also has a pass for this; if possible, LLVM will try to “speculate” across select
instructions that load their results.
select
instructions are essentially ternary operators that pick one of the last two operands (float pointers in our case) based on the value of the first operand.
Select speculation - where things go wrong
A few calls down the chain, this function calls isDereferenceableAndAlignedPointer, which is what determines whether a pointer can be dereferenced. The code here exposes the main issue: the select instruction is never considered ‘dereferenceable’. As such, when there are two selects in sequence (as seen with our std::clamp
), it will not try to speculate the select instruction and will not remove the alloca
.
Fix 1: libcxx
A potential fix is modifying the original code to not produce select
instructions in the same way. For example, we can mimic our original implementation with pointers instead of value types. Though the IR output change is relatively small, this gives us the code generation we want without modifying the LLVM codebase:
const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
const float *out;
out = (v < lo) ? &lo : &v;
out = (*out > hi) ? &hi : out;
return *out;
}
float clamp3(float v) {
return better_ref_clamp(v, -1.f, 1.f);
}
As you can see, the IR generated after the call is inlined is significantly shorter and more efficient than before:
define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
%2 = fcmp olt float %0, -1.000000e+00
%3 = select i1 %2, float -1.000000e+00, float %0
%4 = fcmp ogt float %3, 1.000000e+00
%5 = select i1 %4, float 1.000000e+00, float %3
ret float %5
}
And the corresponding assembly is back to what we want it to be:
.LCPI1_0:
.long 0xbf800000 # float -1
.LCPI1_1:
.long 0x3f800000 # float 1
clamp3(float): # @clamp3(float)
movss xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero,zero,zero
maxss xmm1, xmm0
movss xmm0, dword ptr [rip + .LCPI1_1] # xmm0 = mem[0],zero,zero,zero
minss xmm0, xmm1
ret
Fix 2: LLVM
A much more general approach is fixing the code generation issue in LLVM itself, which could be as simple as this:
diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
CtxI, DT, TLI, Visited, MaxDepth);
}
+ // For select instructions, both operands need to be dereferenceable.
+ if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+ return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+ Size, DL, CtxI, DT, TLI,
+ Visited, MaxDepth) &&
+ isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+ Size, DL, CtxI, DT, TLI,
+ Visited, MaxDepth);
// For gc.relocate, look through relocations
if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),
All it does is add select
instructions to the list of instruction types to consider potentially dereferenceable. Though it seems to fix the issue (and alive2 seems to like it), this is otherwise untested. Also, the codegen still isn’t perfect. Though the redundant memory accesses are removed, there are still many more instructions than in our libcxx fix (and Rust’s implementation):
.LCPI0_0:
.long 0x3f800000 # float 1
.LCPI0_1:
.long 0xbf800000 # float -1
clamp2(float): # @clamp2(float)
movss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
minss xmm1, xmm0
movss xmm2, dword ptr [rip + .LCPI0_1] # xmm2 = mem[0],zero,zero,zero
cmpltss xmm0, xmm2
movaps xmm3, xmm0
andnps xmm3, xmm1
andps xmm0, xmm2
orps xmm0, xmm3
ret
However, this is because of the ternary operators done in the original libcxx clamp:
template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
{
_LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;
}
The reason this doesn’t look as good is because LLVM needs to store the original value of __v
for the second comparison. Due to this, it then can’t optimize the second part of this computation into a maxss
as that would produce different behavior when __lo
is greater than __hi
and __v
is negative.
const float& ref_clamp(const float& v, const float& lo, const float& hi) {
return (v < lo) ? lo : (v > hi) ? hi : v;
}
const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
const float *out;
out = (v < lo) ? &lo : &v;
out = (*out > hi) ? &hi : out;
return *out;
}
int main() {
printf("%f\n", ref_clamp(-2.f, 1.f, -1.f)); // this prints 1.000
printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000
}
Even though we know this is undefined behavior in C++, LLVM doesn’t have enough information to know that. Adjusting code generation accordingly would be no easy task either. Despite all of this though, it does show how versatile LLVM truly is; relatively simple changes can have significant results.