
Recently, there’s been a lot of chatter online about compilers “ruining” hand-tuned branchless code. Developers expect a ternary expression or manual cmove trick to result in branchless assembly. Instead, the compiler happily emits a branch. What gives?
I had my own confusion…see my original article here.
Branch Prediction vs. Conditional Moves (cmove) – A Nuanced Trade-off
The reality is, cmove is not free. While it avoids mispredictions, it introduces longer dependency chains. Modern out-of-order cores are great at hiding branch misprediction latency if the prediction rate is decent (Agner Fog suggests ~75% or better). If the branch is predictable (often the case in real-world workloads), a predicted branch can outperform a conditional move.
To recap:
- cmove avoids branch mispredictions (good for unpredictable data)
- cmove extends dependency chains (bad for high ILP code)
- Branches with high predictability (over 75%) are often faster than cmove
- Unpredictable branches are catastrophic for performance
The Compiler’s View of the Ternary Operator
Many developers assume that writing:
r = cond ? a : b;
forces the compiler to generate a cmov
or equivalent branchless code. This is wrong. Both GCC and Clang lower this to IR using a branch, not a cmove. Why? Because a
and b
could have side effects, and the C++ language requires that:
- Only
a
executes ifcond
is true. - Only
b
executes ifcond
is false.
This makes the ternary operator semantically equivalent to:
if (cond) r = a; else r = b;
As a result, the optimizer sees a branch in the IR, not a branchless cmove, and the backend codegen can’t magically turn it into a cmove unless it can prove both sides are side-effect-free and cheap.
When to Force cmove (and When Not To)
If you really want a cmove, you need to force the compiler’s hand by writing low-level code that eliminates ambiguity:
r = cond * a + (1 - cond) * b; // classic trick, but awkward
Or, with intrinsics:
r = _mm_blendv_epi8(b, a, mask); // for SIMD
Here is a complete example for Compiler Explorer:
#include <immintrin.h> #include <cstdint> // Example inputs bool cond = true; int32_t a = 42; int32_t b = 24; // Case 1: Plain ternary (compiler decides) int32_t ternary_example() { return cond ? a : b; } // Case 2: Classic arithmetic trick to force cmove-like behavior int32_t arithmetic_trick() { return cond * a + (1 - cond) * b; } // Case 3: SIMD blend (forces vectorized cmove if supported) int32_t simd_blend() { __m128i va = _mm_set1_epi32(a); __m128i vb = _mm_set1_epi32(b); __m128i mask = _mm_set1_epi32(cond ? -1 : 0); // -1 = all bits set (true) __m128i result = _mm_blendv_epi8(vb, va, mask); return _mm_extract_epi32(result, 0); // extract first int }
Which shows the different assembly generated:
ternary_example(): lea rax, [rip + a] lea rcx, [rip + b] cmp byte ptr [rip + cond], 0 cmovne rcx, rax mov eax, dword ptr [rcx] ret arithmetic_trick(): movzx eax, byte ptr [rip + cond] mov ecx, dword ptr [rip + a] imul ecx, eax xor eax, 1 imul eax, dword ptr [rip + b] add eax, ecx ret simd_blend(): movzx eax, byte ptr [rip + cond] neg eax movd xmm1, dword ptr [rip + b] movd xmm2, dword ptr [rip + a] movd xmm0, eax pblendvb xmm1, xmm2, xmm0 movd eax, xmm1 ret cond: .byte 1 a: .long 42 b: .long 24
Summary
Case | Best Option |
---|---|
Highly predictable (>=75%) | Branch (compiler default) |
Highly unpredictable | cmove (forced if necessary) |
Side-effectful expressions | Branch (required by language) |
Final Word
Compilers aren’t out to sabotage your performance. They’re making reasonable trade-offs based on heuristics and analysis. If in doubt, profile real prediction rates and measure performance before jumping to conclusions.
This section is part of the ongoing updates to Branch Prediction – The Definitive Guide for High-Performance C++.
Discover more from John Farrier
Subscribe to get the latest posts sent to your email.
One thought on “Update: Conditional Moves vs. Branches – What Compilers Really Do”