Skip to content

maybereorderthread-without-payload-shrink

Status: shipped (Phase 7) -- see CHANGELOG.

(via ADR 0010)

What it detects

A dx::MaybeReorderThread(...) call whose surrounding payload struct contains live state that is not read after the reorder, i.e., values written before the reorder, not consumed inside the reorder's downstream invocation, and not read after. The Phase 7 IR-level live-range analysis (shared with live-state-across-traceray) walks per-lane lifetimes across the reorder and identifies fields that are dead across the call.

Why it matters on a GPU

SER's runtime spills the entire ray-payload at the reorder point: MaybeReorderThread reorganises lanes, and the per-lane state (the payload, plus any caller-side live registers) has to follow each lane to its new position. NVIDIA's Indiana Jones path-tracer case study quantified this: the reorder's spill traffic is proportional to live-state size, and the case study reported 10-25% perf gains by shrinking the payload from 64 bytes to 16 bytes around the reorder, even when the larger payload was needed before and after.

The pattern is "fat across the trace, lean across the reorder": fill the payload once before the trace, decode it into a small lean record before the reorder, run the reorder + invoke on the lean record, refill the fat payload after. The Phase 7 IR-level analysis identifies fields that can be migrated to a side-buffer indexed by DispatchRaysIndex() — the canonical fix.

The rule is suggestion-tier because the side-buffer migration changes the application's read pattern; the diagnostic ranks the offending fields by per-lane byte cost and emits the candidate refactor as a comment.

This is research-grade per ADR 0010's Phase 7 placement; the rule ships alongside the existing live-state-across-traceray once the IR-reader infrastructure lands.

Examples

Bad

hlsl
// 96-byte payload spilled across the reorder; only `radiance` is read after.
struct FatPayload {
    float3 radiance;
    float3 worldPos;
    float3 worldNormal;
    float4 debugColor;
    uint   bounceFlags;
    float  pdf;
};

[shader("raygeneration")]
void RayGen() {
    FatPayload p = (FatPayload)0;
    dx::HitObject hit = dx::HitObject::TraceRay(g_BVH, /*...*/, p);
    dx::MaybeReorderThread(hit);          // spills 96 bytes per lane
    hit.Invoke(p);
    g_Output[dispatchIndex] = p.radiance; // only radiance is read after
}

Good

hlsl
// 16-byte payload across the reorder; debug + worldPos / worldNormal in side-buffer.
struct LeanPayload {
    float3 radiance;
    float  pdf;
};

RWStructuredBuffer<DebugRecord> g_DebugSide : register(u3);

[shader("raygeneration")]
void RayGen() {
    LeanPayload p = (LeanPayload)0;
    dx::HitObject hit = dx::HitObject::TraceRay(g_BVH, /*...*/, p);
    dx::MaybeReorderThread(hit);          // spills 16 bytes per lane
    hit.Invoke(p);
    g_Output[dispatchIndex] = p.radiance;
}

Options

  • min-shrink-bytes (integer, default: 16) — minimum estimated savings before the rule fires.

Fix availability

suggestion — The fix is a structural refactor (side-buffer migration). The diagnostic ranks the offending fields by per-lane byte cost.

See also


Edit this page

© 2026 NelCit, CC-BY-4.0.

© 2026 NelCit — Apache-2.0 (code), CC-BY-4.0 (docs).