vllm - 💡(How to fix) Fix [DSv4] [SM 12.0] fp8_einsum has no SM 12.0 fallback — blocks mainline serve on consumer Blackwell (follow-up to #41834) [5 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On RTX PRO 6000 Blackwell (SM 12.0), vllm-project/vllm@main + #43722 + #43723 + #41834 (full overlay) + #40923 + #43655 (rebased) gets DSv4-Flash artifacts past engine init + Marlin/Triton dispatch + cudagraph capture, then fails at first forward pass with:

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Error Message

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N at vllm/utils/deep_gemm.py:317 fp8_einsum at vllm/models/deepseek_v4/attention.py (forward)

Root Cause

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Fix Action

Fix / Workaround

On RTX PRO 6000 Blackwell (SM 12.0), vllm-project/vllm@main + #43722 + #43723 + #41834 (full overlay) + #40923 + #43655 (rebased) gets DSv4-Flash artifacts past engine init + Marlin/Triton dispatch + cudagraph capture, then fails at first forward pass with:

#41834 cleanly addresses one of the DeepGEMM Hopper-only call sites — tf32_hc_prenorm_gemm — via _tf32_hc_prenorm_gemm_sm12x dispatch in vllm/utils/deep_gemm.py and a Triton fallback in vllm/models/deepseek_v4/nvidia/ops/sm12x_deep_gemm_fallbacks.py. With #41834's full overlay applied, that first crash is gone. But the second DeepGEMM call, fp8_einsum, has no SM 12.0 dispatch in mainline OR in #41834.

FilePurposeLoC
fp8_einsum.pySM 12.0 Triton fallback (_deepseek_v4_sm12x_fp8_einsum_kernel) + a direct_register_custom_op(deepseek_v4_fp8_einsum) that dispatches between deepseek_v4_sm12x_fp8_einsum (Triton, SM 12.0) and the DeepGEMM fp8_einsum (Hopper)294
cutedsl_utils.pyCuteDSL helpers used by the SM 12.0 dispatch glue178

Code Example

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

---

def deepseek_v4_fp8_einsum(equation, a, a_scale, b, b_scale, out, *, recipe):
    if _use_deepseek_v4_sm12x_triton_fp8_einsum(equation, recipe, b_scale):
        deepseek_v4_sm12x_fp8_einsum(a, a_scale, b, b_scale, out)
    else:
        fp8_einsum(equation, (a, a_scale), (b, b_scale), out, recipe=tuple(recipe))

---

from vllm.models.deepseek_v4.nvidia.ops.fp8_einsum import (
    deepseek_v4_fp8_einsum_config,
)
# ... then forward calls torch.ops.vllm.deepseek_v4_fp8_einsum(...)
RAW_BUFFERClick to expand / collapse

DSv4 attention fp8_einsum has no SM 12.0 fallback in mainline — blocks consumer Blackwell serve

Disclosure (per AGENTS.md): drafted with AI assistance; all references verified by hand against vllm-project/vllm@e19b9b104 and jasl/vllm@27fd665b source trees on 2026-05-27. The human submitter (@pasta-paul / canada-quant) defends every claim.

Summary

On RTX PRO 6000 Blackwell (SM 12.0), vllm-project/vllm@main + #43722 + #43723 + #41834 (full overlay) + #40923 + #43655 (rebased) gets DSv4-Flash artifacts past engine init + Marlin/Triton dispatch + cudagraph capture, then fails at first forward pass with:

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Why this is separate from #41834

#41834 cleanly addresses one of the DeepGEMM Hopper-only call sites — tf32_hc_prenorm_gemm — via _tf32_hc_prenorm_gemm_sm12x dispatch in vllm/utils/deep_gemm.py and a Triton fallback in vllm/models/deepseek_v4/nvidia/ops/sm12x_deep_gemm_fallbacks.py. With #41834's full overlay applied, that first crash is gone. But the second DeepGEMM call, fp8_einsum, has no SM 12.0 dispatch in mainline OR in #41834.

Where jasl/vllm solves this

jasl/vllm@27fd665b ships these files in vllm/models/deepseek_v4/nvidia/ops/ that are absent from mainline:

FilePurposeLoC
fp8_einsum.pySM 12.0 Triton fallback (_deepseek_v4_sm12x_fp8_einsum_kernel) + a direct_register_custom_op(deepseek_v4_fp8_einsum) that dispatches between deepseek_v4_sm12x_fp8_einsum (Triton, SM 12.0) and the DeepGEMM fp8_einsum (Hopper)294
cutedsl_utils.pyCuteDSL helpers used by the SM 12.0 dispatch glue178

The dispatch pattern (in jasl's fp8_einsum.py):

def deepseek_v4_fp8_einsum(equation, a, a_scale, b, b_scale, out, *, recipe):
    if _use_deepseek_v4_sm12x_triton_fp8_einsum(equation, recipe, b_scale):
        deepseek_v4_sm12x_fp8_einsum(a, a_scale, b, b_scale, out)
    else:
        fp8_einsum(equation, (a, a_scale), (b, b_scale), out, recipe=tuple(recipe))

And jasl's attention.py:32:

from vllm.models.deepseek_v4.nvidia.ops.fp8_einsum import (
    deepseek_v4_fp8_einsum_config,
)
# ... then forward calls torch.ops.vllm.deepseek_v4_fp8_einsum(...)

i.e. the dispatch lives inside a registered custom op, not in deep_gemm.py. Mainline's vllm/models/deepseek_v4/attention.py (post-#43149 refactor) calls fp8_einsum(...) from vllm.utils.deep_gemm directly, with no custom-op wrapper.

What's needed for mainline SM 12.0 viability

Port jasl's pattern to mainline:

  1. Add vllm/models/deepseek_v4/fp8_einsum.py (note: top-level after #43149, not nvidia/ops/) containing:
    • _deepseek_v4_sm12x_fp8_einsum_kernel Triton kernel
    • deepseek_v4_sm12x_fp8_einsum wrapper
    • deepseek_v4_fp8_einsum dispatcher
    • direct_register_custom_op("deepseek_v4_fp8_einsum", ...)
  2. Add vllm/models/deepseek_v4/cutedsl_utils.py (CuteDSL helpers)
  3. Update vllm/models/deepseek_v4/attention.py (line ~32) to import from the new module and call the registered op instead of bare fp8_einsum.

This unblocks mainline DSv4 serve on SM 12.0 (after #41834, #43722, #43723, #40923, #43655 also land). Estimated diff size: ~500 lines (most of it is the existing Triton kernel + a custom-op wrapper).

What we'd offer

We have an RTX PRO 6000 Server Edition box + canada-quant artifacts + verified jasl-based reference. Can test any proposed mainline port within 24h. The jasl files are MIT/Apache-2.0 licensed (same as vLLM) and can be copy-ported with attribution.

Cross-references

cc @jasl @mgoin @pavanimajety @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @zyongye.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [DSv4] [SM 12.0] fp8_einsum has no SM 12.0 fallback — blocks mainline serve on consumer Blackwell (follow-up to #41834) [5 pull requests]