groupshared-write-then-no-barrier-read

Status: shipped (Phase 4) — see CHANGELOG.

What it detects

Compute shader code paths where a thread writes a groupshared location and a different thread (or the same thread on a subsequent iteration of a loop with cross-iteration dependence) reads from the same array region without a GroupMemoryBarrierWithGroupSync or AllMemoryBarrierWithGroupSync between the write and the read. The rule analyses index expressions to determine when reads can target a slot another thread has written: writes indexed by SV_GroupIndex followed by reads indexed by anything other than the same SV_GroupIndex (a neighbour offset, a constant, a transposed coordinate) are the canonical hits.

Why it matters on a GPU

The D3D12 compute model guarantees that within a thread group, all threads share LDS / groupshared memory, but it does NOT guarantee any ordering between threads' memory operations unless an explicit barrier is issued. On AMD RDNA 2/3, threads in a wave run in lock-step, but threads in different waves of the same group are scheduled independently and can be hundreds of cycles apart at any given instruction; without a barrier, a read from another wave's slot may return the previous frame's value, the LDS-uninitialised pattern (zero on AMD, undefined on NVIDIA), or a torn write halfway through a non-atomic store. On NVIDIA Turing / Ada, the same applies across warps in a thread block: the SM scheduler issues warps in arbitrary order and the L1/SHM coherence boundary is the explicit barrier instruction. On Intel Xe-HPG, the EU thread scheduler likewise serialises across barrier points only.

The bug class is insidious because the wave size happens to align with the access pattern in many cases. A reduction algorithm that writes Sum[gi] = value and then reads Sum[gi ^ 1] works correctly on hardware where the wave size is at least the group size — both writer and reader are in the same wave and lock-step execution provides accidental ordering — but breaks on hardware with a smaller wave size where the writer and reader land in different waves. RDNA 2/3 supports both wave32 and wave64 modes selected by [WaveSize(32)] or compiler heuristics; a shader that "works" on one wave size can deadlock or produce wrong output when the driver chooses the other. The bug also manifests after a driver upgrade that changes wave-size selection.

The fix is straightforward: insert GroupMemoryBarrierWithGroupSync() between the write phase and the read phase. The barrier is a single instruction that costs roughly 8-20 cycles depending on group size — modest compared to the cost of any reduction or transpose pass — and provides the cache-flush + execution-fence semantics required for the cross-thread visibility. For algorithms that perform many alternating write/read passes (like log-step reductions or radix sort phases), each phase boundary needs its own barrier; omitting any of them reintroduces the race.

Examples

Bad

hlsl

groupshared float g_Sum[64];

[numthreads(64, 1, 1)]
void cs_reduce(uint3 dtid : SV_DispatchThreadID, uint gi : SV_GroupIndex) {
    g_Sum[gi] = Input[dtid.x];
    // No barrier — readers in another wave (or even the same wave under a
    // future scheduler change) may see uninitialised LDS values.
    if (gi < 32) {
        g_Sum[gi] += g_Sum[gi + 32];
    }
}

Good

hlsl

groupshared float g_Sum[64];

[numthreads(64, 1, 1)]
void cs_reduce_safe(uint3 dtid : SV_DispatchThreadID, uint gi : SV_GroupIndex) {
    g_Sum[gi] = Input[dtid.x];
    // Make the writes globally visible before any thread reads a neighbour slot.
    GroupMemoryBarrierWithGroupSync();
    if (gi < 32) {
        g_Sum[gi] += g_Sum[gi + 32];
    }
    // Each subsequent reduction step needs its own barrier.
    GroupMemoryBarrierWithGroupSync();
    if (gi < 16) g_Sum[gi] += g_Sum[gi + 16];
    GroupMemoryBarrierWithGroupSync();
    if (gi <  8) g_Sum[gi] += g_Sum[gi +  8];
    // ... and so on.
}

Options

none

Fix availability

none — Inserting a barrier changes execution timing and program structure. Although the canonical fix is a single barrier insertion at the write/read boundary, identifying the correct insertion point requires understanding the producer/consumer schedule of the algorithm. The diagnostic flags the write site, the read site, and the absence of an intervening barrier on the dominator path between them.

groupshared-write-then-no-barrier-read ​

What it detects ​

Why it matters on a GPU ​

Examples ​

Bad ​

Good ​

Options ​

Fix availability ​

See also ​