vllm - 💡(How to fix) Fix [Bug][DSV4][dynamo] nvidia/ops/attention.py wo_a access is dynamo-unsafe; forces --enforce-eager on Option-Y MTP artifacts (~10× decode slowdown on SM 12.0)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm/models/deepseek_v4/nvidia/ops/attention.py:370 accesses self.wo_a.weight_scale_inv (or the weight_scale fallback from #43290) via a getattr(..., None) pattern that trips dynamo's _getattr_static during torch.compile tracing. The trace fails with ObservedAttributeError when the MTP block — preserved at BF16 in Option-Y artifacts — has no scale attribute at all, forcing users to set --enforce-eager and losing ~10× decode throughput on SM 12.0 (Blackwell consumer/server).

Error Message

File "vllm/models/deepseek_v4/nvidia/ops/attention.py", line 370, in forward wo_a_scale = self.wo_a.weight_scale torch._dynamo.exc.ObservedAttributeError: 'ColumnParallelLinear' object has no attribute 'weight_scale'

Root Cause

This forces every Option-Y MTP artifact running on SM 12.0 hardware to use --enforce-eager, with a measured ~10× decode slowdown vs torch.compile + cudagraph. Empirical impact (bs=1 MTP-spec k=1 on a single TP=2 replica of 2× RTX PRO 6000 Blackwell, same artifact):

Modeoutput tok/sTPOT median (ms)
--enforce-eager11.5782.70
cudagraph (with dtype-fix below)98.838.55

Same hardware, same artifact, same kernel set — purely the dynamo-unsafe attribute access blocking cudagraph capture.

Fix Action

Fix / Workaround

We carry this as a patch in canada-quant/dsv4-flash-w4a16-fp8-mtp/scripts/patch_wo_a_bf16_path.sh and document the rationale in RECIPE_RTX6000PRO.md §3.3. Happy to file a PR if the maintainer-side direction is "yes, please upstream the dtype-check shape" — wanted to file the issue first per the project's standard contribution flow.

Code Example

File "vllm/models/deepseek_v4/nvidia/ops/attention.py", line 370, in forward
    wo_a_scale = self.wo_a.weight_scale
torch._dynamo.exc.ObservedAttributeError: 'ColumnParallelLinear' object has no attribute 'weight_scale'

---

# Before
wo_a_scale = getattr(self.wo_a, "weight_scale_inv", None)
if wo_a_scale is None:
    wo_a_scale = self.wo_a.weight_scale

# After (dynamo-safe)
if self.wo_a.weight.dtype == torch.bfloat16:
    # BF16 wo_a (e.g. Option-Y MTP block) — route to BF16 reference path
    # already used on ROCm (rocm_inv_rope_einsum)
    z = rocm_inv_rope_einsum(
        self.rotary_emb, o, positions, self.rope_head_dim,
        self.n_local_groups, self.o_lora_rank, self.wo_a,
    )
    return self.wo_b(z.flatten(1))

# else fall through to the FP8 einsum (existing path)
wo_a_fp8 = self.wo_a.weight
wo_a_scale = getattr(self.wo_a, "weight_scale_inv", None) or self.wo_a.weight_scale
RAW_BUFFERClick to expand / collapse

Summary

vllm/models/deepseek_v4/nvidia/ops/attention.py:370 accesses self.wo_a.weight_scale_inv (or the weight_scale fallback from #43290) via a getattr(..., None) pattern that trips dynamo's _getattr_static during torch.compile tracing. The trace fails with ObservedAttributeError when the MTP block — preserved at BF16 in Option-Y artifacts — has no scale attribute at all, forcing users to set --enforce-eager and losing ~10× decode throughput on SM 12.0 (Blackwell consumer/server).

Repro

Artifact: canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP — main expert weights at W4A16, attention path at FP8_BLOCK, MTP block (layer 43) preserved at BF16 per Option Y.

Hardware: NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0), TP=2.

vLLM build: jasl/vllm@ds4-sm120-preview-dev (SHA c79225692), which has the post-refactor vllm/models/deepseek_v4/nvidia/ops/attention.py layout.

Launch any serve without --enforce-eager. The cudagraph profile run crashes at:

File "vllm/models/deepseek_v4/nvidia/ops/attention.py", line 370, in forward
    wo_a_scale = self.wo_a.weight_scale
torch._dynamo.exc.ObservedAttributeError: 'ColumnParallelLinear' object has no attribute 'weight_scale'

The MTP block's wo_a is plain ColumnParallelLinear(weight=BF16) with no weight_scale_inv or weight_scale registered as parameters. PR #43290 added a getattr fallback, but getattr(obj, name, None) is not dynamo-safe — dynamo intercepts the attribute lookup with _getattr_static, which only inspects the class type (not the instance's dynamically registered params/buffers).

Why this matters

This forces every Option-Y MTP artifact running on SM 12.0 hardware to use --enforce-eager, with a measured ~10× decode slowdown vs torch.compile + cudagraph. Empirical impact (bs=1 MTP-spec k=1 on a single TP=2 replica of 2× RTX PRO 6000 Blackwell, same artifact):

Modeoutput tok/sTPOT median (ms)
--enforce-eager11.5782.70
cudagraph (with dtype-fix below)98.838.55

Same hardware, same artifact, same kernel set — purely the dynamo-unsafe attribute access blocking cudagraph capture.

Proposed fix

Replace the runtime attribute-presence check with a static dtype check that dynamo CAN constant-fold:

# Before
wo_a_scale = getattr(self.wo_a, "weight_scale_inv", None)
if wo_a_scale is None:
    wo_a_scale = self.wo_a.weight_scale

# After (dynamo-safe)
if self.wo_a.weight.dtype == torch.bfloat16:
    # BF16 wo_a (e.g. Option-Y MTP block) — route to BF16 reference path
    # already used on ROCm (rocm_inv_rope_einsum)
    z = rocm_inv_rope_einsum(
        self.rotary_emb, o, positions, self.rope_head_dim,
        self.n_local_groups, self.o_lora_rank, self.wo_a,
    )
    return self.wo_b(z.flatten(1))

# else fall through to the FP8 einsum (existing path)
wo_a_fp8 = self.wo_a.weight
wo_a_scale = getattr(self.wo_a, "weight_scale_inv", None) or self.wo_a.weight_scale

tensor.dtype == torch.bfloat16 is statically resolvable by dynamo because weight is a known declared parameter on ColumnParallelLinear and its dtype attribute is a constant at trace time. Dynamo treats it as a guard condition and specializes per dtype.

Generalizes cleanly to:

  • W4A16+FP8+MTP (this artifact) ✓ verified
  • NVFP4+FP8+MTP (sibling artifact, not yet ported to RTX 6000 Pro)
  • Any future Option-Y MTP scheme

Recipe linkage

We carry this as a patch in canada-quant/dsv4-flash-w4a16-fp8-mtp/scripts/patch_wo_a_bf16_path.sh and document the rationale in RECIPE_RTX6000PRO.md §3.3. Happy to file a PR if the maintainer-side direction is "yes, please upstream the dtype-check shape" — wanted to file the issue first per the project's standard contribution flow.

Related

  • #43290 — added the getattr(..., None) fallback; this issue is the dynamo-safety follow-up.
  • #43319 — auto-detect BF16 MTP from safetensors index → skip quant_config on the MTP draft tower at load time. This issue is the runtime/forward analogue.
  • #31085 — SM 12.0 NVFP4 MoE backend selector. Different code path, same Blackwell-consumer-class concern about ensuring cudagraphs actually fire.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING