interlocked-bin-without-wave-prereduce
Status: shipped (Phase 4) — see CHANGELOG.
What it detects
Calls to InterlockedAdd, InterlockedOr, or InterlockedXor against a groupshared array or a UAV buffer with a small fixed bin count (≤ 32 indices) where the index is computed per-thread but the rule can prove the bin set is small enough that wave-level pre-reduction with WaveActiveSum or WavePrefixSum would coalesce most of the contention. Histograms with up to 32 bins, summary counters, per-material lane counters, and per-cluster occupancy maps are the canonical examples.
Why it matters on a GPU
InterlockedAdd to LDS / groupshared memory issues a ds_add_u32 (AMD RDNA) or ATOMS.ADD (NVIDIA) instruction per active lane, but the hardware serialises atomic ops that target the same address. On AMD RDNA 2/3, LDS atomics to one bank-aligned address are issued at one-per-clock from the LDS arbiter; if 32 lanes target one bin, the operation takes 32 clocks rather than the single clock a non-conflicting batch would take. NVIDIA Turing and Ada serialise warp-wide shared-memory atomics through the Locked Atomic Throughput pipe, with similar one-per-clock per-bank serialisation. Intel Xe-HPG's SLM atomics serialise through the L1$ bank arbiter at comparable rates. UAV atomics are even worse because they cross the L2 / memory hierarchy.
When the bin set is small (think 16 luminance buckets for an exposure histogram, 32 material IDs for a clustering pass, 8 visibility flags), almost every wave will have multiple lanes contributing to the same bin. The pre-reduction pattern collapses these into a single atomic per (bin, wave) pair: each lane builds a per-lane bin index, the wave uses WaveMatch(bin) (SM 6.5+) or a WaveActiveBallot cascade to find the set of lanes contributing to each unique bin, one designated lane sums the contributions with WaveActiveSum, and only that lane issues the atomic. For a 32-lane wave with a 16-bin histogram, the worst-case atomic count drops from 32 per wave to at most 16 per wave; the typical case with skewed bin distributions drops to 1-4 per wave.
The performance impact is dramatic on real workloads. Auto-exposure histogram passes on a 1080p frame typically issue 8 million LDS atomics; the pre-reduced form drops this to under 200k by aggregating contributions across each 32-lane wave. Measured on RDNA 2 (RX 6800) the reduction is from 1.4 ms to 0.2 ms for a typical luminance histogram. The same pattern appears in light-list builds, decal binning, and visibility-buffer compaction. The cost of the pre-reduction itself — one WaveMatch or one WaveActiveBallot per bin — is amortised against eliminating dozens of serialised atomics.
Examples
Bad
groupshared uint g_LumHistogram[16];
[numthreads(64, 1, 1)]
void cs_histogram(uint3 dtid : SV_DispatchThreadID, uint gi : SV_GroupIndex) {
if (gi < 16) g_LumHistogram[gi] = 0;
GroupMemoryBarrierWithGroupSync();
float lum = ComputeLuminance(dtid);
uint bin = clamp((uint)(lum * 16.0), 0u, 15u);
// Worst case: 64 lanes target up to 16 bins — atomics serialise heavily.
InterlockedAdd(g_LumHistogram[bin], 1u);
}Good
groupshared uint g_LumHistogram[16];
[numthreads(64, 1, 1)]
void cs_histogram_prereduce(uint3 dtid : SV_DispatchThreadID, uint gi : SV_GroupIndex) {
if (gi < 16) g_LumHistogram[gi] = 0;
GroupMemoryBarrierWithGroupSync();
float lum = ComputeLuminance(dtid);
uint bin = clamp((uint)(lum * 16.0), 0u, 15u);
// SM 6.5+: WaveMatch returns a 4-uint mask of lanes sharing this bin.
uint4 peers = WaveMatch(bin);
uint count = countbits(peers.x) + countbits(peers.y)
+ countbits(peers.z) + countbits(peers.w);
// One designated lane per unique bin issues the atomic for the whole wave.
if (WaveMultiPrefixCountBits(true, peers) == 0) {
InterlockedAdd(g_LumHistogram[bin], count);
}
}Options
none
Fix availability
suggestion — The pre-reduced form is semantically equivalent for the additive case but rewrites the shader's structure. The diagnostic flags the atomic and recommends the WaveMatch / WaveActiveSum template; a human must select the right wave intrinsic for their target SM.
See also
- Related rule: interlocked-float-bit-cast-trick — float atomics by hand-rolled bit cast vs. SM 6.6 native
- Related rule: groupshared-stride-32-bank-conflict — bank conflicts in groupshared memory
- Related rule: wave-intrinsic-non-uniform — wave intrinsic uniformity hazards
- HLSL intrinsic reference:
WaveMatch,WaveActiveSum,InterlockedAddin the DirectX HLSL Intrinsics documentation - Companion blog post: workgroup overview
© 2026 NelCit, CC-BY-4.0.