vllm - 💡(How to fix) Fix [DSv4] [SM 12.0] fp8_einsum has no SM 12.0 fallback — blocks mainline serve on consumer Blackwell (follow-up to #41834) [5 pull requests]

vllm2026-05-27 03:46:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On RTX PRO 6000 Blackwell (SM 12.0), vllm-project/vllm@main + #43722 + #43723 + #41834 (full overlay) + #40923 + #43655 (rebased) gets DSv4-Flash artifacts past engine init + Marlin/Triton dispatch + cudagraph capture, then fails at first forward pass with:

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Error Message

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N at vllm/utils/deep_gemm.py:317 fp8_einsum at vllm/models/deepseek_v4/attention.py (forward)

Root Cause

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Fix Action

Fix / Workaround

#41834 cleanly addresses one of the DeepGEMM Hopper-only call sites — tf32_hc_prenorm_gemm — via _tf32_hc_prenorm_gemm_sm12x dispatch in vllm/utils/deep_gemm.py and a Triton fallback in vllm/models/deepseek_v4/nvidia/ops/sm12x_deep_gemm_fallbacks.py. With #41834's full overlay applied, that first crash is gone. But the second DeepGEMM call, fp8_einsum, has no SM 12.0 dispatch in mainline OR in #41834.

File	Purpose	LoC
`fp8_einsum.py`	SM 12.0 Triton fallback (`_deepseek_v4_sm12x_fp8_einsum_kernel`) + a `direct_register_custom_op(deepseek_v4_fp8_einsum)` that dispatches between `deepseek_v4_sm12x_fp8_einsum` (Triton, SM 12.0) and the DeepGEMM `fp8_einsum` (Hopper)	294
`cutedsl_utils.py`	CuteDSL helpers used by the SM 12.0 dispatch glue	178

Code Example

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

---

def deepseek_v4_fp8_einsum(equation, a, a_scale, b, b_scale, out, *, recipe):
    if _use_deepseek_v4_sm12x_triton_fp8_einsum(equation, recipe, b_scale):
        deepseek_v4_sm12x_fp8_einsum(a, a_scale, b, b_scale, out)
    else:
        fp8_einsum(equation, (a, a_scale), (b, b_scale), out, recipe=tuple(recipe))

---

from vllm.models.deepseek_v4.nvidia.ops.fp8_einsum import (
    deepseek_v4_fp8_einsum_config,
)
# ... then forward calls torch.ops.vllm.deepseek_v4_fp8_einsum(...)

RAW_BUFFERClick to expand / collapse

DSv4 attention `fp8_einsum` has no SM 12.0 fallback in mainline — blocks consumer Blackwell serve

Disclosure (per AGENTS.md): drafted with AI assistance; all references verified by hand against vllm-project/vllm@e19b9b104 and jasl/vllm@27fd665b source trees on 2026-05-27. The human submitter (@pasta-paul / canada-quant) defends every claim.

Summary

RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:39): t.dim() == N
  at vllm/utils/deep_gemm.py:317 fp8_einsum
  at vllm/models/deepseek_v4/attention.py (forward)

Root cause: vllm.utils.deep_gemm.fp8_einsum resolves to DeepGEMM's Hopper-only fp8_einsum impl on consumer Blackwell, which asserts a tensor rank that the SM 12.0 call site doesn't satisfy.

Why this is separate from #41834

Where jasl/vllm solves this

jasl/vllm@27fd665b ships these files in vllm/models/deepseek_v4/nvidia/ops/ that are absent from mainline:

File	Purpose	LoC
`fp8_einsum.py`	SM 12.0 Triton fallback (`_deepseek_v4_sm12x_fp8_einsum_kernel`) + a `direct_register_custom_op(deepseek_v4_fp8_einsum)` that dispatches between `deepseek_v4_sm12x_fp8_einsum` (Triton, SM 12.0) and the DeepGEMM `fp8_einsum` (Hopper)	294
`cutedsl_utils.py`	CuteDSL helpers used by the SM 12.0 dispatch glue	178

The dispatch pattern (in jasl's fp8_einsum.py):

def deepseek_v4_fp8_einsum(equation, a, a_scale, b, b_scale, out, *, recipe):
    if _use_deepseek_v4_sm12x_triton_fp8_einsum(equation, recipe, b_scale):
        deepseek_v4_sm12x_fp8_einsum(a, a_scale, b, b_scale, out)
    else:
        fp8_einsum(equation, (a, a_scale), (b, b_scale), out, recipe=tuple(recipe))

And jasl's attention.py:32:

from vllm.models.deepseek_v4.nvidia.ops.fp8_einsum import (
    deepseek_v4_fp8_einsum_config,
)
# ... then forward calls torch.ops.vllm.deepseek_v4_fp8_einsum(...)

i.e. the dispatch lives inside a registered custom op, not in deep_gemm.py. Mainline's vllm/models/deepseek_v4/attention.py (post-#43149 refactor) calls fp8_einsum(...) from vllm.utils.deep_gemm directly, with no custom-op wrapper.

What's needed for mainline SM 12.0 viability

Port jasl's pattern to mainline:

Add vllm/models/deepseek_v4/fp8_einsum.py (note: top-level after #43149, not nvidia/ops/) containing:
- _deepseek_v4_sm12x_fp8_einsum_kernel Triton kernel
- deepseek_v4_sm12x_fp8_einsum wrapper
- deepseek_v4_fp8_einsum dispatcher
- direct_register_custom_op("deepseek_v4_fp8_einsum", ...)
Add vllm/models/deepseek_v4/cutedsl_utils.py (CuteDSL helpers)
Update vllm/models/deepseek_v4/attention.py (line ~32) to import from the new module and call the registered op instead of bare fp8_einsum.

This unblocks mainline DSv4 serve on SM 12.0 (after #41834, #43722, #43723, #40923, #43655 also land). Estimated diff size: ~500 lines (most of it is the existing Triton kernel + a custom-op wrapper).

What we'd offer

We have an RTX PRO 6000 Server Edition box + canada-quant artifacts + verified jasl-based reference. Can test any proposed mainline port within 24h. The jasl files are MIT/Apache-2.0 licensed (same as vLLM) and can be copy-ported with attribution.

Cross-references

Parent tracker #43564 — full Phase A summary at https://github.com/vllm-project/vllm/issues/43564#issuecomment-4550184475
PR #41834 — fixes tf32_hc_prenorm_gemm SM 12.0 dispatch; this issue is its follow-up for fp8_einsum
jasl source: https://github.com/jasl/vllm/blob/ds4-sm120-preview-dev/vllm/models/deepseek_v4/nvidia/ops/fp8_einsum.py + https://github.com/jasl/vllm/blob/ds4-sm120-preview-dev/vllm/models/deepseek_v4/nvidia/ops/cutedsl_utils.py
Today's verified working stack on jasl@27fd665b: Card D AIME-30 c=4 thinking = 24/30 correct, 0 CUDA errors, 91.61% MTP acceptance — bench JSON

cc @jasl @mgoin @pavanimajety @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @zyongye.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [DSv4] [SM 12.0] fp8_einsum has no SM 12.0 fallback — blocks mainline serve on consumer Blackwell (follow-up to #41834) [5 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

DSv4 attention `fp8_einsum` has no SM 12.0 fallback in mainline — blocks consumer Blackwell serve

Summary

Why this is separate from #41834

Where jasl/vllm solves this

What's needed for mainline SM 12.0 viability

What we'd offer

Cross-references

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [DSv4] [SM 12.0] fp8_einsum has no SM 12.0 fallback — blocks mainline serve on consumer Blackwell (follow-up to #41834) [5 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

DSv4 attention fp8_einsum has no SM 12.0 fallback in mainline — blocks consumer Blackwell serve

Summary

Why this is separate from #41834

Where jasl/vllm solves this

What's needed for mainline SM 12.0 viability

What we'd offer

Cross-references

Still need to ship something?

TRENDING

DSv4 attention `fp8_einsum` has no SM 12.0 fallback in mainline — blocks consumer Blackwell serve