coopvec-non-uniform-matrix-handle
Status: shipped (Phase 4) — see CHANGELOG.
(via ADR 0010)
What it detects
A cooperative-vector matrix-multiply call (MatrixMul, MatrixVectorMul, OuterProductAccumulate) whose matrix-handle, base-offset, stride, or interpretation argument is wave-divergent. The SM 6.9 cooperative-vector spec marks these arguments as preferring uniform values; non-uniform arguments either serialise across the wave (perf) or, on stricter implementations, produce undefined behaviour. The Phase 4 uniformity analysis (shared with wave-active-all-equal-precheck and cbuffer-divergent-index) tracks divergence on each argument.
Why it matters on a GPU
The cooperative-vector matrix engine on every supporting IHV (Ada tensor cores, RDNA 3/4 WMMA, Xe-HPG XMX) executes one matmul per wave, drawing operands from one source matrix per call. The matrix handle, offset, stride, and interpretation arguments parameterise that single matmul; the engine fetches operands once for the whole wave and broadcasts them to lanes.
When any of those arguments is divergent across the wave — different lanes pointing at different matrices, different offsets, different layouts — the engine cannot service the call as a single matmul. On NVIDIA Ada Lovelace's tensor cores, the driver serialises by re-issuing the matmul once per unique tuple of arguments, multiplying the cost by the number of distinct argument tuples in the wave. On AMD RDNA 3/4 WMMA, the same serialisation pattern applies; on Intel Xe-HPG XMX, the implementation may also reject the call as undefined. NVIDIA's cooperative-vector blog cites a 4-32x cost cliff when the matrix handle is divergent across a wave.
The fix is to ensure the matrix handle / offset / stride / interpretation arguments are uniform — typically by hoisting them out of any branch and proving they're computed from cbuffer or wave-uniform values. The diagnostic names the offending argument and the divergence source.
Examples
Bad
// Matrix offset is per-lane (depends on tid) — wave serialises.
ByteAddressBuffer g_Weights : register(t0);
[numthreads(32, 1, 1)]
void main(uint tid : SV_DispatchThreadID) {
using namespace dx::linalg;
vector<float, 16> input = LoadInput(tid);
vector<float, 16> output;
uint matOffset = tid * 64; // per-lane — divergent
MatrixVectorMul(output, input,
g_Weights, matOffset, 64,
MATRIX_LAYOUT_INFERENCING_OPTIMAL);
}Good
// Matrix offset is uniform across the wave.
ByteAddressBuffer g_Weights : register(t0);
cbuffer Cfg : register(b0) { uint g_MatOffset; }
[numthreads(32, 1, 1)]
void main(uint tid : SV_DispatchThreadID) {
using namespace dx::linalg;
vector<float, 16> input = LoadInput(tid);
vector<float, 16> output;
MatrixVectorMul(output, input,
g_Weights, g_MatOffset, 64,
MATRIX_LAYOUT_INFERENCING_OPTIMAL);
}Options
none
Fix availability
suggestion — Hoisting the argument out of a divergent expression is sometimes mechanical, but more often requires restructuring the calling code. The diagnostic names the divergence source and emits a candidate rewrite when the hoist is straightforward.
See also
- Related rule: wave-active-all-equal-precheck — same uniformity analysis on a different surface
- Related rule: cbuffer-divergent-index — same analysis on cbuffer indexing
- Related rule: coopvec-non-optimal-matrix-layout — companion cooperative-vector rule
- HLSL specification: proposal 0029 Cooperative Vector
- Companion blog post: cooperative-vector overview
© 2026 NelCit, CC-BY-4.0.