Skip to content

wave-prefix-sum-vs-scan-with-atomics

Status: stub. The full-length analysis is queued for a v1.0.x patch release per ADR 0018, section 5, criterion #6. The companion rule page at docs/rules/wave-prefix-sum-vs-scan-with-atomics.md contains the canonical detection logic + GPU reasoning.

TL;DR

Prefix-sum scan is the second-most-common cross-lane primitive in compute shaders (after WaveActiveSum). Every modern GPU implements it as a dedicated cross-lane primitive: AMD RDNA 2/3 issues WavePrefixSum through DPP (Data-Parallel Primitives) in log₂(wave_size) cycles — 5 cycles on a wave32 RDNA 3, 6 on a wave64 RDNA 2; NVIDIA Ada Lovelace and Turing expose the equivalent through the warp-shfl prefix network, also 5 cycles per warp; Intel Xe-HPG's subgroup prefix-scan completes in log₂(subgroup_size) cycles on the cross-lane unit. The HLSL WavePrefixSum intrinsic compiles to those primitives directly.

What the rule fires on

A hand-rolled prefix-sum (exclusive or inclusive scan) implemented as a multi-pass groupshared-plus-barrier sequence. Pattern shapes detected: (a) a Hillis–Steele up-sweep of the form for (uint stride = 1; stride < N; stride <<= 1) { if (gi >= stride) g_Scan[gi] += g_Scan[gi - stride]; GroupMemoryBarrierWithGroupSync(); }, (b) a Blelloch up-sweep / down-sweep with the equivalent barrier ladder, and (c) any scan implemented as a sequence of InterlockedAdd against a running counter where lanes consume monotone slot indices. All three patterns can be replaced by WavePrefixSum (within a wave) plus at most one barrier-and-broadcast step (across waves in a workgroup).

See the What it detects section of the rule page for the full pattern definition.

Why it matters

The full GPU-mechanism analysis lives in the Why it matters on a GPU section of the companion rule page.

Examples

The bad / good code snippets are kept canonical on the rule page; see wave-prefix-sum-vs-scan-with-atomics.md -> Examples.

See also


This is a v1.0-ship stub. Full analysis pending; track issue link TBD.

© 2026 NelCit — Apache-2.0 (code), CC-BY-4.0 (docs).