coopvec-fp4-fp6-blackwell-layout

Status: shipped (Phase 8) — see CHANGELOG.

What it detects

A matrix matmul (MatrixMul, MatrixVectorMul, OuterProductAccumulate) with FP4 / FP6 element type (COMPONENT_TYPE_FLOAT_E2M1, COMPONENT_TYPE_FLOAT_E3M2, COMPONENT_TYPE_FLOAT_E2M3) using a non-OPTIMAL layout, under the [experimental.target = blackwell] config gate.

Why it matters on a GPU

NVIDIA Blackwell 5th-gen Tensor Cores are FP4 / FP6-native; the optimal layout differs from Hopper FP8. Per the Blackwell Architecture v1.1 white paper and the arXiv 2512.02189 microbenchmarking paper, FP4/FP6 hits ~96.3% of theoretical peak only when the matrix is in MATRIX_LAYOUT_INFERENCING_OPTIMAL (or TRAINING_OPTIMAL); a row-major / column-major layout pays a swizzle cost on every fetch.

Options

none. Activated only under [experimental] target = "blackwell".

Fix availability

suggestion — Switch to MATRIX_LAYOUT_INFERENCING_OPTIMAL / TRAINING_OPTIMAL.

coopvec-fp4-fp6-blackwell-layout ​

What it detects ​

Why it matters on a GPU ​

Options ​

Fix availability ​

See also ​

coopvec-fp4-fp6-blackwell-layout

What it detects

Why it matters on a GPU

Options

Fix availability

See also