coopvec-stride-mismatch
Status: shipped (Phase 3) — see CHANGELOG.
(via ADR 0010)
What it detects
A cooperative-vector matrix-load call (MatrixMul, MatrixVectorMul, OuterProductAccumulate) whose constant-folded stride argument does not equal the natural row-stride implied by the matrix dimensions and the component type (rows * sizeof(component) or cols * sizeof(component) depending on layout). The SM 6.9 cooperative-vector specification requires the stride to match the matrix layout exactly when the layout enum is not OPTIMAL; mismatches produce undefined behaviour because the matrix-engine fetcher walks the wrong number of bytes per row.
Why it matters on a GPU
When a cooperative-vector call uses a generic row-major or column-major layout, the matrix engine on each IHV (Ada tensor cores, RDNA 3/4 WMMA, Xe-HPG XMX) walks the source buffer using the stride argument as the per-row byte advance. The engine assumes the stride is the natural one for the matrix shape and component type; if it isn't, the engine reads garbage bytes from outside the matrix or from the wrong row, and produces NaN-laced or zero results. There is no error signalled at runtime — the tensor engine has no concept of buffer bounds beyond what the stride tells it.
DXC's validator catches the simplest forms (literal-stride mismatch on a literal-shape matrix) but misses forms where the stride or shape is computed. Catching this at lint time uses constant-folding over the AST to recover the literal stride and shape and verifies the relationship, surfacing the mismatch with a precise diagnostic before the developer hits the silent-NaN failure mode.
For OPTIMAL layouts (MATRIX_LAYOUT_INFERENCING_OPTIMAL / MATRIX_LAYOUT_TRAINING_OPTIMAL), the stride argument is ignored by the runtime; the matrix engine uses the vendor-swizzle layout's intrinsic step. The rule does not fire in that case.
Examples
Bad
// 16x16 row-major float matrix should have stride = 64 (16 * sizeof(float)),
// not 32. Tensor engine misreads.
ByteAddressBuffer g_Weights : register(t0);
[numthreads(32, 1, 1)]
void main(uint tid : SV_DispatchThreadID) {
using namespace dx::linalg;
vector<float, 16> input = LoadInput(tid);
vector<float, 16> output;
MatrixVectorMul(output, input,
g_Weights, /*offset*/ 0, /*stride*/ 32, // WRONG
MATRIX_LAYOUT_ROW_MAJOR);
}Good
// Stride matches the row width.
ByteAddressBuffer g_Weights : register(t0);
[numthreads(32, 1, 1)]
void main(uint tid : SV_DispatchThreadID) {
using namespace dx::linalg;
vector<float, 16> input = LoadInput(tid);
vector<float, 16> output;
MatrixVectorMul(output, input,
g_Weights, /*offset*/ 0, /*stride*/ 64,
MATRIX_LAYOUT_ROW_MAJOR);
}Options
none
Fix availability
none — The right stride depends on the upload-side intent (which dimension is the row); the diagnostic names the mismatch and the expected value.
See also
- Related rule: coopvec-base-offset-misaligned — companion validation
- Related rule: coopvec-non-optimal-matrix-layout — perf rule for non-optimal layouts
- Related rule: coopvec-fp8-with-non-optimal-layout — companion validation
- HLSL specification: proposal 0029 Cooperative Vector
- Companion blog post: cooperative-vector overview
© 2026 NelCit, CC-BY-4.0.