vllm - 💡(How to fix) Fix [deepseek_v4 / DeepGEMM] paged_mqa_logits kernel asserts on next_n=3 → num_speculative_tokens capped at 1 on Hopper

vllm2026-05-23 00:01:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

vLLM passes next_n = num_speculative_tokens + 1 into DeepGemm's smxx_fp8_fp4_paged_mqa_logits kernel (k draft + 1 main verifier in the lookahead window). The assertion enforces num_speculative_tokens <= 1. With k=1, next_n=2 passes. With k=2, next_n=3 fails. The error message is misleading — it sounds like the kernel allows up to 2 speculative tokens, but it actually allows up to 2 total lookahead positions.

Error Message

RuntimeError: Worker failed with error 'Assertion error (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233): next_n == 1 or next_n == 2'

Root Cause

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

Code Example

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

RAW_BUFFERClick to expand / collapse

Summary

vllm serve --speculative-config '{"method":"mtp","num_speculative_tokens":2}' crashes during profile_cudagraph_memory on Hopper (H200, sm_90a) with:

RuntimeError: Worker failed with error 'Assertion error
  (/home/ubuntu/src/vllm/.deps/deepgemm-src/csrc/apis/../jit_kernels/impls/smxx_fp8_fp4_paged_mqa_logits.hpp:233):
  next_n == 1 or next_n == 2'

Reproducible on

vLLM ~/src/vllm HEAD 50d9dd902 with cherry-picked PRs #43248+#43288+#43290+#43319
H200 SXM5 (Hopper, sm_90a), TP=2
DSv4-Flash W4A16+FP8+MTP artifact (canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP, public)
--attention-backend FLASHINFER_MLA_SPARSE: same assertion fires (the paged_mqa_logits kernel is logits-side, not attention-backend-specific)

What I've ruled out

DeepGemm has separate Blackwell-only assertions in attention.hpp:210 and :338 that constrain arch_major == 10 and next_n == 1. Those don't fire on Hopper. The paged_mqa_logits.hpp:233 assertion is the active one on Hopper and accepts next_n in {1, 2}.

Impact

num_speculative_tokens=1 is the practical max on Hopper in this build. We measured 1.49× decode speedup at bs=1 with k=1 (TPOT median 6.02ms with MTP vs 8.93ms without). At k=2 the theoretical ceiling rises to ~1.85-2.0×, matching the sibling NVFP4-FP8-MTP B300 artifact's published 2.03× number. We're leaving ~25% of the MTP speedup on the table due to this assertion.

Raw throughput data:

Suggested fix

Two options:

Widen the assertion in smxx_fp8_fp4_paged_mqa_logits.hpp:233 to allow next_n up to whatever the kernel actually supports (presumably 4 or higher; DSv4-Flash's MTP block is designed for higher k). This requires DeepGemm-side changes in deepseek-ai/DeepGEMM.
Surface the constraint in vLLM's config validation — when --speculative-config num_speculative_tokens > 1 is set AND the model uses the paged_mqa_logits kernel path on Hopper, emit a clear error at startup rather than letting the cudagraph capture fail with a confusing assertion message.

Option 2 is the cheaper path for vLLM-side. Option 1 unlocks the actual speedup.

Tracking

Filed by the canada-quant team during W4A16+FP8+MTP quantization work. See FINDINGS_FOR_SIBLING.md §C15 for the full diagnosis. The sibling artifact (B300, NVFP4) is on arch_major == 10 paths and may hit different assertions; happy to coordinate if there's a cross-arch fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [deepseek_v4 / DeepGEMM] paged_mqa_logits kernel asserts on next_n=3 → num_speculative_tokens capped at 1 on Hopper

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Reproducible on

What I've ruled out

Impact

Suggested fix

Tracking

Still need to ship something?

TRENDING