A previous article described a solution to a typical l33tcode interview question, and mainly explored 1) how modern CPUs prefer predictable, cache-friendly memory access patterns, and 2) how algorithmic complexity can impact runtime. We alluded to SIMD-accelerated variants of the solution, but didn’t explore them in depth.
This followup article describes the SIMD implementations of the original “Third Largest” question, with performance results. In the earlier article, we wanted to know: what values of k cause the asymptotically-faster Heap version of the algorithm to actually be faster than the Sort version? The question explored in this article: With the aid of SIMD, how much faster can the Sort version go? And, it is worth the effort?
Quoting from the previous article:
So far, we’ve explored the design space that a good coder might be able to explore during a coding interview. We’ve been good computer science students and used our algorithm analysis skills, replacing an O(Nk) algorithm with an O(N
lg
k) algorithm. But what if we give ourselves permission to revisit the sorting algorithm, with the help of SIMD? Can the insertion step for Insertion Sort be implemented in O(1) time on a small array?Yes. Yes, it can. But exploring that implementation strategy will be left for another article.
And here we are.
It bears mention that the SIMD-accelerated implementations of this function only work for small k, namely k<=8. We’ll touch on implementation strategies for any k, and I believe SIMD instruction sets will deliver higher performance than equivalent scalar implementations, but today we are only talking about small k.
An AVX register can hold a maxSoFar
array of up to size 8, and with the right instruction sequence, we can identify where a candidate value must be inserted, conditionally shift the SIMD lanes of the maxSoFar
register, and insert the incoming value in sorted order—all using a handful of SIMD instructions.
inline void insertNewMax_x4 ( __m128i& v_maxSoFar, __m128i x ) {
__m128i v_cmplt = _mm_cmplt_epi32( v_maxSoFar, x );
__m128i v_insertion_mask = _mm_xor_si128( v_cmplt, _mm_srli_si128( v_cmplt, 4 ) );
__m128i v_shift1 = _mm_srli_si128( v_maxSoFar, 4 );
v_maxSoFar = _mm_blendv_epi8( v_maxSoFar, v_shift1, v_cmplt );
v_maxSoFar = _mm_blendv_epi8( v_maxSoFar, x, v_insertion_mask );
}
inline void insertNewMax( __m128i& v_maxSoFar, int32_t x ) {
insertNewMax_x4( v_maxSoFar, _mm_set1_epi32( x ) );
}
NOTE: The _mm_srli_si128() intrinsic has two properties that are not obvious at first blush. First, the immediate parameter specifies a byte count. Secondly, the 256-bit analog of this function performs the requested shift on the upper and lower halves of the 256-bit register, placing _mm256_srli_si256() firmly in the Set Of Annoying Nonorthogonal x86 SIMD instructions.
The above function updates maxSoFar
, not in O(lg
k) time, but in a few instructions: we broadcast the candidate value across one SIMD register, then compare against the values already in the maxSoFar
register. If the candidate is “less than” all of the values in maxSoFar
, the result is a mask of 0’s, and the subsequent operations do nothing; but if the candidate needs to be inserted into maxSoFar
, the mask is 1’s in the lanes where the maxSoFar
values are less than the candidate, then 0’s. The lane where the candidate must be inserted can be isolated with a shift and XOR
operation. We then use the VBLENDPD
instruction (wrapped with the _mm_blend_epi8() intrinsic) to conditionally overwrite maxSoFar
, first with a shifted version of itself to make room for the candidate if it will be inserted, then with the candidate value in the lane where it must land.
As with most SIMD coding, this code is not work-efficient! It is comparing and shifting and masking and blending 4 SIMD lanes at a time, zeroing in on the lanes that need to be shifted and/or replaced (and if we needed a bigger maxSoFar
array, we could rewrite to use AVX2 or AVX512 sized registers, which would hold 8 or 16 values, respectively).
The AVX2 formulation of the third-largest problem, using an SSE2-valued maxSoFar
array, is as follows:
int32_t
thirdLargest_avx2( const std::vector<int32_t>& v )
{
int32_t minMax;
__m128i v_maxSoFar;
{
v_maxSoFar = _mm_set_epi32( std::numeric_limits<int32_t>::max(),
std::numeric_limits<int32_t>::min(),
std::numeric_limits<int32_t>::min(),
std::numeric_limits<int32_t>::min() );
insertNewMax( v_maxSoFar, v[0] );
insertNewMax( v_maxSoFar, v[1] );
insertNewMax( v_maxSoFar, v[2] );
}
minMax = _mm_cvtsi128_si32( v_maxSoFar );
size_t i = 3;
while ( i < v.size() ) {
if ( v[i] > minMax ) {
insertNewMax( v_maxSoFar, v[i] );
minMax = _mm_cvtsi128_si32( v_maxSoFar );
}
++i;
}
return minMax;
}
The initialization of v_maxSoFar
uses sentinels to ensure the first 3 input values are inserted correctly:
v_maxSoFar = _mm_set_epi32( std::numeric_limits<int32_t>::max(),
std::numeric_limits<int32_t>::min(),
std::numeric_limits<int32_t>::min(),
std::numeric_limits<int32_t>::min() );
For k==2, we’d instead initialize as:
v_maxSoFar = _mm_set_epi32( std::numeric_limits<int32_t>::max(),
std::numeric_limits<int32_t>::max(),
std::numeric_limits<int32_t>::min(),
std::numeric_limits<int32_t>::min() );
The code then scans the input array, checking to see if the new candidate element v[i]
should be inserted into v_minMax
:
minMax = _mm_cvtsi128_si32( v_maxSoFar );
size_t i = 3;
while ( i < v.size() ) {
if ( v[i] > minMax ) {
insertNewMax( v_maxSoFar, v[i] );
minMax = _mm_cvtsi128_si32( v_maxSoFar );
}
++i;
}
Using clang
per Godbolt, this code features only 9 AVX2 instructions in the inner loop:
.LBB0_2:
mov edi, dword ptr [rcx + 4*rsi]
cmp edi, eax
jle .LBB0_4
vmovd xmm1, edi
vpbroadcastd xmm1, xmm1
vpcmpgtd xmm2, xmm1, xmm0
vpsrldq xmm3, xmm2, 4
vpsrldq xmm4, xmm0, 4
vblendvps xmm0, xmm0, xmm4, xmm2
vpxor xmm2, xmm3, xmm2
vpblendvb xmm0, xmm0, xmm1, xmm2
vmovd eax, xmm0
jmp .LBB0_4
With SIMD, of course a further optimization is possible: Instead of checking for v[i] > minMax
, why not check the next eight (8) incoming values to see if any of them require insertion? If none of them do, we can skip ahead by 8 values, not 1. But if any of them do, we have to find the value or values inside the AVX2 register that need to be inserted. Superficially, it seems the likelihood of each new candidate requiring insertion into v_maxSoFar
should go down as the algorithm progresses.
The loop structure looks something like this:
__m256i v_minMax = _mm256_broadcastd_epi32( v_maxSoFar );
size_t Nleft = v.size() - 3;
while ( Nleft >= 8 ) {
__m256i v_next = _mm256_loadu_si256( (__m256i *) (v.data() + i) );
__m256i v_cmplt = _mm256_cmpgt_epi32( v_next, v_minMax );
int mask_lt = _mm256_movemask_ps( _mm256_castsi256_ps( v_cmplt ) );
// if mask_lt has any set bits, some lanes must be inserted into v_minMax
v_minMax = _mm256_broadcastd_epi32( v_maxSoFar );
Nleft -= 8;
i += 8;
}
This loop structure assumes that all 8 lanes of v_next
will be processed in each loop iteration (hence the decrement of Nleft
and increment of i
by 8 at the end of the loop). That needn’t necessarily be the case; we could just insert the “first” (since x86 is little-endian, the least-significant lane), and increment the loop such that the next iteration of the loop restarts just after the just-inserted element. Since we are using unaligned loads, the code would work; it would do some redundant comparisons, but those comparisons would be against the updated v_maxSoFar
array, so may result in less work being done. (There’s that pesky data dependence again!)
Here’s the loop with that logic filled in - the g++ intrinsic ffs() returns 0 for inputs of zero, otherwise the bit position of the least significant set bit, plus one. For example, ffs(12)==3
. The resulting code - I call it kthLargest_avx2_least
, because it inserts the smallest lane of the candidate input - looks like this:
while ( Nleft >= 8 ) {
__m256i v_next = _mm256_loadu_si256( (__m256i *) (v.data() + i) );
__m256i v_cmplt = _mm256_cmpgt_epi32( v_next, v_minMax );
int mask_lt = _mm256_movemask_ps( _mm256_castsi256_ps( v_cmplt ) );
int lsb = ffs( mask_lt );
if ( lsb ) {
__m256i v_newCandidate = _mm256_permutevar8x32_epi32( v_next,
_mm256_broadcastd_epi32( _mm_cvtsi32_si128( lsb-1) ) );
insertNewMax_x8( v_maxSoFar, v_newCandidate );
minMax = _mm_cvtsi128_si32( _mm256_castsi256_si128( v_maxSoFar ) );
v_minMax = _mm256_set1_epi32( minMax );
Nleft -= lsb;
i += lsb;
}
else {
Nleft -= 8;
i += 8;
}
}
Alternatively, if mask_lt
is nonzero, indicating that there is at least one lane whose value must be inserted into v_maxSoFar
, we can fully process the candidate input, inserting each lane for which the corresponding bit in mask_lt
is nonzero. The while
loop in this code never executes if mask_lt==0
. I call this variant kthLargest_avx2_peel(), because it peels and inserts the smallest lane in turn:
while ( Nleft >= 8 ) {
__m256i v_next = _mm256_loadu_si256( (__m256i *) (v.data() + i) );
__m256i v_cmplt = _mm256_cmpgt_epi32( v_next, v_minMax );
int mask_lt = _mm256_movemask_ps( _mm256_castsi256_ps( v_cmplt ) );
int mask_done = -1;
while ( 0 != (mask_lt&mask_done) ) {
int lsb = ffs( mask_done & mask_lt );
__m128i v_newCandidate = _mm256_castsi256_si128(
_mm256_permutevar8x32_epi32( v_next,
_mm256_broadcastd_epi32( _mm_cvtsi32_si128( lsb-1) ) ) );
insertNewMax_x4( v_maxSoFar, v_newCandidate );
mask_done = ~((1<<lsb)-1);
}
v_minMax = _mm256_broadcastd_epi32( v_maxSoFar );
Nleft -= 8;
i += 8;
}
A final variant would unconditionally insert all 8 lanes, if any of them would trigger insertion. This variant relies heavily on the insertNewMax_x4() function being a NOP for inputs that should not be inserted. The code uses a lambda to avoid code duplication:
while ( Nleft >= 8 ) {
__m256i v_next = _mm256_loadu_si256( (__m256i *) (v.data() + i) );
__m256i v_cmplt = _mm256_cmpgt_epi32( v_next, v_minMax );
int mask_lt = _mm256_movemask_ps( _mm256_castsi256_ps( v_cmplt ) );
auto insert_lane = [&v_maxSoFar, v_next]( int i ) {
__m128i v_nexti = _mm256_castsi256_si128(
_mm256_permutevar8x32_epi32( v_next,
_mm256_broadcastd_epi32( _mm_cvtsi32_si128( i ) ) ) );
insertNewMax_x4( v_maxSoFar, v_nexti );
};
if ( mask_lt ) {
insert_lane( 0 );
insert_lane( 1 );
insert_lane( 2 );
insert_lane( 3 );
insert_lane( 4 );
insert_lane( 5 );
insert_lane( 6 );
insert_lane( 7 );
}
v_minMax = _mm256_broadcastd_epi32( v_maxSoFar );
Nleft -= 8;
i += 8;
}
All of these AVX2 implementations have the following properties in common:
they hold the
maxSoFar
‘array’ in an AVX2 register (which we callv_maxSoFar)
;they can insert a candidate value into the correct lane of
v_maxSoFar
in a fixed handful of instructions - no looping required.the technique works for any k<=8 because an AVX2 register can hold up to 8x
int32_t
.
For k>8, multiple AVX2 registers may be used (up to 16, if running in 64-bit mode); or AVX512 may be used, doubling the number of elements per register to 16 and the number of registers to 32. If worse comes to worst, an array of __m256i
or __m512i
elements can be used, but for any k<=512, an AVX512 implementation could keep v_maxSoFar
in registers.
Next: Let’s find out whether all (any?) of this SIMD optimization was worth the effort!
Performance Analysis
We measure performance of the scalar implementations as well as the AVX2 implementations: the register formulation thirdLargest(); kthLargest_heap() and kthLargest_sort() with k==3, even though those reasonably can be expected to perform similarly; and the AVX2 implementations described in the previous section:
thirdLargest_avx2() inserts each input element in turn,
thirdLargest_avx2_least() reads the next 8 input elements and, if any need to be inserted, inserts the first and advances the array index to point to the next, and
thirdLargest_avx2_peel() processes 8 input elements at a time, inserting each element that was greater than the “minimum maximum”. Note, depending on the inputs, this formulation performs some redundant work, since processing elements earlier in the input array may obviate the need to process some subsequent elements.
For random inputs, across ten (10) runs, the various implementations run at the following speeds (in gigaelements per second):
nth - nth_element() STL
heap - kthElement_heap<3>
sort - kthElement_sort<3>
swap - thirdElement<false> -
dlgt - thirdElement<true>, the Hacker's delight conditional swap
avx2 - naive AVX2 implementation, uses an AVX2 register only for v_maxSoFar
_x8 - AVX2 implementation that checks 8 elements at a time, and skips ahead by 8 if none of the incoming 8 candidates need to be inserted into v_maxSoFar
peel - AVX2 implementation that processes elements out of the incoming candidates, if any of them need to be inserted into v_maxSoFar
least - AVX2 implementation that checks 8 elements at a time and inserts the first into v_maxSoFar, then restarts the scan at the next element
Run nth heap sort swap dlgt avx2 _x8 peel least
1 0.27 5.43 5.43 2.76 2.75 2.75 14.55 14.41 14.51
2 0.90 5.44 5.43 2.75 2.76 2.76 14.45 14.50 14.47
3 0.28 5.43 5.43 2.75 2.76 2.75 14.62 14.49 14.61
4 0.41 5.43 5.44 2.75 2.75 2.75 14.55 14.42 14.41
5 0.25 5.43 5.44 2.75 2.76 2.75 14.51 14.41 14.44
6 0.48 5.43 5.44 2.75 2.76 2.76 14.64 14.33 14.52
7 0.69 5.43 5.44 2.75 2.76 2.76 14.62 14.45 14.56
8 0.45 5.43 5.44 2.75 2.76 2.75 14.60 14.46 14.43
9 0.43 5.44 5.42 2.75 2.75 2.75 14.69 14.49 14.51
10 0.29 5.44 5.44 2.76 2.76 2.76 14.45 14.34 14.55
Some takeaways from this performance data:
nth_element() exhibits the most variable performance, with performance on a full-sized data set ranging from 0.26-0.86 billion elements per second (average 0.44, standard deviation of 0.20).
The benefits of sweeping the input array are clear: even the slowest implementation is about 10x faster than nth_element().
The AVX2 implementations that “look ahead,” and skip 8 elements at a time when possible, all are about 50x faster than nth_element() and about 5x faster than thirdElement().
The futility of the Hacker’s Delight cleverness (
dlgt
as opposed toswap
) is laid bare: it is no faster than the ‘naive’ swap, which is eminently more readable.Interestingly, the generic kthElement_heap() and kthElement_sort() both run at the same speed, and both are noticeably faster than the thirdElement() versions. One would expect the implementations specific to k==3 to be a bit faster!
For reverse-sorted input (minimum work per element), the results are similar:
Run nth heap sort swap dlgt avx2 _x8 peel least
1 2.03 5.42 5.43 2.76 2.76 2.76 14.54 14.56 14.46
2 2.08 5.42 5.42 2.75 2.76 2.75 14.67 14.59 14.38
3 2.08 5.42 5.43 2.76 2.75 2.76 14.59 14.47 14.29
4 2.08 5.41 5.43 2.76 2.75 2.75 14.52 14.44 14.44
5 2.08 5.42 5.43 2.76 2.75 2.75 14.67 14.57 14.34
6 2.08 5.40 5.42 2.75 2.75 2.75 14.65 14.27 14.43
7 2.08 5.43 5.42 2.75 2.75 2.75 14.62 14.59 14.47
8 2.08 5.42 5.43 2.75 2.75 2.76 14.71 14.49 14.48
9 2.07 5.41 5.42 2.75 2.76 2.75 14.59 14.47 14.41
10 2.08 5.42 5.42 2.75 2.76 2.76 14.53 14.52 14.48
The main takeaways from this data are:
The scalar, and especially the AVX2 implementations that “look ahead,” barely run any faster than on randomized data; this is a surprise, since they should be running through the input, detecting that no updates to
v_MaxSoFar
are needed after initial setup; andintriguingly, nth_element() is both more level (not as much variance in speed) and markedly faster; is that because a median-of-three pivot selection on a sorted array amounts to pivoting on the actual median every time?
For sorted input (maximum work per element), the timing results are even more perplexing:
Run nth heap sort swap dlgt avx2 _x8 peel least
1 1.82 0.27 1.20 2.10 2.09 0.97 1.00 0.62 0.20
2 1.82 0.27 1.17 2.09 2.11 0.98 1.00 0.62 0.20
3 1.82 0.27 1.17 2.10 2.10 0.98 1.00 0.62 0.20
4 1.82 0.27 1.20 2.10 2.09 0.97 1.00 0.62 0.20
5 1.82 0.27 1.17 2.10 2.10 0.98 1.00 0.62 0.20
6 1.82 0.27 1.20 2.10 2.09 0.98 1.00 0.62 0.20
7 1.82 0.27 1.20 2.07 2.07 0.98 1.00 0.62 0.20
8 1.82 0.27 1.17 2.09 2.06 0.98 1.00 0.62 0.20
9 1.82 0.27 1.20 2.06 2.06 0.97 1.00 0.62 0.20
10 1.82 0.27 1.20 2.10 2.10 0.98 1.00 0.62 0.20
Holy smokes! On this CPU, on an already-sorted input, the STL’s nth_element() is faster than any of the AVX2 implementations! Why might that be? One explanation is that the clever handful of AVX2 instructions needed to insert a new candidate into v_maxSoFar
is slower than the instructions needed to rearrange the tiny 3-element arrays on the stack for kthElement_sort() and kthElement_heap(). Note that since a new maximum is being inserted into the tiny 3-element heap or sorted array on every iteration, the CPU likely is executing few, if any, mispredicted branches.
Another explanation for the performance difference is that SIMD instructions typically cause CPUs to reduce their clock rate - an argument against a hybrid approach that would do the “lookahead” eight elements at a time, and use scalar instructions to rearrange the maxSoFar
array. I haven’t measured the slowdown for the CPU used for these measurements (an 8-core AMD Ryzen 7 7700X, which is one of AMD’s earlier AVX512-capable CPUs).
Conclusions
I’ll leave it to you to decide whether the SIMD implementation was worth pursuing; the workload is synthetic, after all. For random and reverse-sorted inputs (where the top 3 are encountered early in the input), the AVX2 implementations are 5-7x faster. For already-sorted inputs, their performance suffers, in ways that bear investigation.
The source code for these articles has been pushed to the parallelprogrammer repository on GitHub.
If you want to play with holding v_maxSoFar
in multiple registers, investigating the perverse behavior on sorted input, and/or porting the code to AVX512, have at it! And let me know how it goes.