vllm - 💡(How to fix) Fix [deepseek_v4 / DeepGEMM] paged_mqa_logits kernel asserts on next_n=3 → num_speculative_tokens capped at 1 on Hopper

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

vLLM passes next_n = num_speculative_tokens + 1 into DeepGemm's smxx_fp8_fp4_paged_mqa_logits kernel (k draft + 1 main verifier in the lookahead window). The assertion enforces num_speculative_tokens <= 1. With k=1, next_n=2 passes. With k=2, next_n=3 fails. The error message is misleading — it sounds like the kernel allows up to 2 speculative tokens, but it actually allows up to 2 total lookahead positions.

Error Message

RuntimeError: Worker failed with error 'Assertion error (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233): next_n == 1 or next_n == 2'

Root Cause

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

vLLM passes next_n = num_speculative_tokens + 1 into DeepGemm's smxx_fp8_fp4_paged_mqa_logits kernel (k draft + 1 main verifier in the lookahead window). The assertion enforces num_speculative_tokens <= 1. With k=1, next_n=2 passes. With k=2, next_n=3 fails. The error message is misleading — it sounds like the kernel allows up to 2 speculative tokens, but it actually allows up to 2 total lookahead positions.

Code Example

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'
RAW_BUFFERClick to expand / collapse

Summary

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

vLLM passes next_n = num_speculative_tokens + 1 into DeepGemm's smxx_fp8_fp4_paged_mqa_logits kernel (k draft + 1 main verifier in the lookahead window). The assertion enforces num_speculative_tokens <= 1. With k=1, next_n=2 passes. With k=2, next_n=3 fails. The error message is misleading — it sounds like the kernel allows up to 2 speculative tokens, but it actually allows up to 2 total lookahead positions.

Reproducible on

  • vLLM ~/src/vllm HEAD 50d9dd902 with cherry-picked PRs #43248+#43288+#43290+#43319
  • H200 SXM5 (Hopper, sm_90a), TP=2
  • DSv4-Flash W4A16+FP8+MTP artifact (canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP, public)
  • --attention-backend FLASHINFER_MLA_SPARSE: same assertion fires (the paged_mqa_logits kernel is logits-side, not attention-backend-specific)

What I've ruled out

DeepGemm has separate Blackwell-only assertions in attention.hpp:210 and :338 that constrain arch_major == 10 and next_n == 1. Those don't fire on Hopper. The paged_mqa_logits.hpp:233 assertion is the active one on Hopper and accepts next_n in {1, 2}.

Impact

num_speculative_tokens=1 is the practical max on Hopper in this build. We measured 1.49× decode speedup at bs=1 with k=1 (TPOT median 6.02ms with MTP vs 8.93ms without). At k=2 the theoretical ceiling rises to ~1.85-2.0×, matching the sibling NVFP4-FP8-MTP B300 artifact's published 2.03× number. We're leaving ~25% of the MTP speedup on the table due to this assertion.

Raw throughput data:

Suggested fix

Two options:

  1. Widen the assertion in smxx_fp8_fp4_paged_mqa_logits.hpp:233 to allow next_n up to whatever the kernel actually supports (presumably 4 or higher; DSv4-Flash's MTP block is designed for higher k). This requires DeepGemm-side changes in deepseek-ai/DeepGEMM.

  2. Surface the constraint in vLLM's config validation — when --speculative-config num_speculative_tokens > 1 is set AND the model uses the paged_mqa_logits kernel path on Hopper, emit a clear error at startup rather than letting the cudagraph capture fail with a confusing assertion message.

Option 2 is the cheaper path for vLLM-side. Option 1 unlocks the actual speedup.

Tracking

Filed by the canada-quant team during W4A16+FP8+MTP quantization work. See FINDINGS_FOR_SIBLING.md §C15 for the full diagnosis. The sibling artifact (B300, NVFP4) is on arch_major == 10 paths and may hit different assertions; happy to coordinate if there's a cross-arch fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [deepseek_v4 / DeepGEMM] paged_mqa_logits kernel asserts on next_n=3 → num_speculative_tokens capped at 1 on Hopper