vllm - ✅(Solved) Fix [Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) [1 pull requests, 2 comments, 2 participants]

vllm2026-04-25 16:20:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40880•Fetched 2026-04-26 05:06:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

noonghunna

Participants

noonghunna

Sandermage

Timeline (top)

subscribed ×11mentioned ×10commented ×2closed ×1

MTP speculative decoding × TurboQuant KV cache × CUDA graph capture produces degenerate output on Qwen3-Next hybrid models. The four upstream PRs that fixed the ngram path (#40738 + #36138 + #40783 + #39055, plus the prompt_lookup_min=8 config trick from #40875) do not close this for MTP — the MTP forward path goes through a different proposer (EagleProposer configured with method="mtp") that the ngram-scoped fixes don't cover.

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Root Cause

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Fix Action

Fix / Workaround

Image: vllm/vllm-openai@sha256:9bba4628a3b943e0dd33caefb94b811569ba1e97bdf23bee19a265c31b947ccb (v0.19.2rc1.dev21+g893611813). Hardware: 1× RTX 3090 (Ampere SM 8.6). Model: Lorbus/Qwen3.6-27B-int4-AutoRound. Genesis v7.13 backports applied via Sandermage/genesis-vllm-patches@852b73f — log confirms Genesis Results: 25 applied, 16 skipped, 0 failed including P60 (#40738 Phase 1) + P60b (#40738 Phase 2 Triton kernel offset).

#40831 — broader bug class; closed for ngram path; this issue is the scoped MTP follow-up
#40875 — ngram prompt_lookup_min=8 config-only fix
#40738 — @tdoublep's GDN+ngram state recovery (scope-limited to ngram)
Genesis v7.13 tree — backport tree applied for this test

PR fix notes

PR #40738: [Bugfix] Fix GDN conv + SSM state corruption with ngram spec decode

Repository: vllm-project/vllm
Author: tdoublep
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40738

Description (problem / solution / changelog)

Summary

Fix output corruption when using ngram speculative decoding with hybrid GDN models (e.g., Qwen3.5) in mamba_cache_mode="none".

After a spec decode step accepts N tokens, the next non-spec decode step must read SSM state from block N-1 and conv state from an offset position. Two bugs prevented this:

num_accepted_tokens was not passed to SSM metadata builders on non-spec steps
causal_conv1d_fn had no mechanism to offset conv state reads based on accepted tokens

Changes

gdn_attn.py: Compute spec_decode_src_indices for SSM state correction; pad num_accepted_tokens with 1s for prefill sequences in mixed batches
gdn_linear_attn.py: Pre-copy SSM state from accepted block to block 0; pass num_accepted_tokens to conv kernels gated on whether correction is needed
causal_conv1d.py: Add IS_SPEC_DECODING path to _causal_conv1d_fwd_kernel that offsets conv state reads/writes by num_accepted_tokens - 1
gpu_model_runner.py: Pass num_accepted_tokens to GDN/Mamba2 builders on non-spec steps

Test plan

Single-prompt: baseline vs ngram match token-for-token (Qwen3.5-0.8B)
Mixed-batch: short prompt matches baseline; long prompt generates coherent output
Kernel fix verified necessary: disabling conv offset causes regression
Existing GDN + spec decode CI tests

<details> <summary>Reproducers</summary>

Single-prompt:

from vllm import LLM, SamplingParams
MODEL, PROMPT = "Qwen/Qwen3.5-0.8B", "<code>\nclass Calculator:\n    def add(self, a, b):\n        return a + b\n</code>\n<update>\nAdd subtract and multiply methods\n</update>"
ARGS = dict(model=MODEL, trust_remote_code=True, enforce_eager=True, enable_chunked_prefill=True, max_model_len=4096)
SPEC = {"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 10, "prompt_lookup_min": 2}
S = SamplingParams(max_tokens=200, temperature=0)
b = list(LLM(**ARGS).generate([PROMPT], S)[0].outputs[0].token_ids)
n = list(LLM(**ARGS, speculative_config=SPEC).generate([PROMPT], S)[0].outputs[0].token_ids)
print("PASS" if b == n else "FAIL")

Mixed-batch (short + long prompt, max_num_batched_tokens=64): see reproduce_gdn_ngram_mixed.py in the branch.

</details>

Fixes #39273

AI-assisted: Yes (Claude). Not duplicating any existing PR.

Changed files

vllm/model_executor/layers/mamba/gdn_linear_attn.py (modified, +27/-0)
vllm/model_executor/layers/mamba/ops/causal_conv1d.py (modified, +20/-2)
vllm/v1/attention/backends/gdn_attn.py (modified, +32/-2)
vllm/v1/worker/gpu_model_runner.py (modified, +11/-0)

Code Example

vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 1 --max-model-len 125000 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_3bit_nc \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  # cudagraph + torch.compile both ON (default — no overrides)

---

import json, urllib.request
body = {
  "model": "qwen3.6-27b-autoround",
  "messages": [{"role": "user", "content": "What is the weather in Paris in Celsius? Use the tool."}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "parameters": {"type": "object", "properties": {
      "city": {"type": "string"}, "unit": {"type": "string"}}, "required": ["city","unit"]}}}],
  "tool_choice": "auto", "max_tokens": 256, "chat_template_kwargs": {"enable_thinking": False},
}
r = urllib.request.urlopen(urllib.request.Request(
  "http://localhost:8020/v1/chat/completions",
  data=json.dumps(body).encode(), headers={"content-type":"application/json"}))
print(json.loads(r.read())["choices"][0]["message"])

---

tool_calls=[]
content="<tool_call>\n<function=...><parameter=...>...</parameter>...</parameter>...
        (cascading inline-text emission, never populates tool_calls[])"

---

expected: 'silver platypus 22'
got:      'silver '

RAW_BUFFERClick to expand / collapse

Summary

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Reproducer

vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 1 --max-model-len 125000 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_3bit_nc \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  # cudagraph + torch.compile both ON (default — no overrides)

Test request:

import json, urllib.request
body = {
  "model": "qwen3.6-27b-autoround",
  "messages": [{"role": "user", "content": "What is the weather in Paris in Celsius? Use the tool."}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "parameters": {"type": "object", "properties": {
      "city": {"type": "string"}, "unit": {"type": "string"}}, "required": ["city","unit"]}}}],
  "tool_choice": "auto", "max_tokens": 256, "chat_template_kwargs": {"enable_thinking": False},
}
r = urllib.request.urlopen(urllib.request.Request(
  "http://localhost:8020/v1/chat/completions",
  data=json.dumps(body).encode(), headers={"content-type":"application/json"}))
print(json.loads(r.read())["choices"][0]["message"])

Expected: tool_calls=[{name=get_weather, args={"city":"Paris","unit":"Celsius"}}]

Actual:

tool_calls=[]
content="<tool_call>\n<function=...><parameter=...>...</parameter>...</parameter>...
        (cascading inline-text emission, never populates tool_calls[])"

Same shape on the long-context needle test — model retrieves first token of the secret then output ends:

expected: 'silver platypus 22'
got:      'silver '

Why this is separate from the v7.13 ngram fix

The four backported PRs scope to the ngram path's GDN state recovery (gdn_attn.py + gdn_linear_attn.py). Genesis P60/P60b explicitly target the ngram-affected paths.
MTP runs through vllm/v1/spec_decode/eagle.py with method="mtp" and Qwen3_5MTP model class — distinct machinery from ngram's path.
Empirically: same config, same Genesis loaded, same image. Switching method from mtp to ngram + prompt_lookup_min=8 flips outcome from broken to clean. So the four backports are working; they just don't cover this code path.

Confirmed in #40831 probe ladder Test A vs Test C on the same rig.

Working configurations that avoid this bug

--compilation-config '{"cudagraph_mode":"NONE"}' — keeps torch.compile inductor on, disables CUDA graphs. Restores correctness at ~33 TPS (vs ~85 broken).
Switch to ngram + prompt_lookup_min=8 per #40875.
Drop TurboQuant entirely (use fp8_e5m2 KV) — caps context at ~32-40K on a 24 GB card but unaffected.

Suggested investigation entry-point

Whether Qwen3_5MTP.forward correctly reads SSM/conv state from block[num_accepted-1] after spec-decode acceptance — same shape of bug #40738 fixed for ngram, possibly fixable with the same spec_decode_src_indices pattern adapted to the MTP path.

References

#40831 — broader bug class; closed for ngram path; this issue is the scoped MTP follow-up
#40875 — ngram prompt_lookup_min=8 config-only fix
#40738 — @tdoublep's GDN+ngram state recovery (scope-limited to ngram)
Genesis v7.13 tree — backport tree applied for this test

cc @Sandermage @tdoublep — per the handoff in #40831.

extent analysis

TL;DR

The issue can be mitigated by disabling CUDA graphs or switching to the ngram method with prompt_lookup_min=8, but a permanent fix requires adapting the spec_decode_src_indices pattern to the MTP path.

Guidance

Investigate whether Qwen3_5MTP.forward correctly reads SSM/conv state from block[num_accepted-1] after spec-decode acceptance, similar to the fix for the ngram path in #40738.
Try disabling CUDA graphs using --compilation-config '{"cudagraph_mode":"NONE"}' to restore correctness, albeit at a lower performance.
Switch to the ngram method with prompt_lookup_min=8 as a temporary workaround, as suggested in #40875.
Consider dropping TurboQuant entirely and using fp8_e5m2 KV, but be aware that this will cap the context size.

Example

No code snippet is provided, as the issue requires a deeper understanding of the MTP path and the Qwen3_5MTP.forward method.

Notes

The issue is specific to the MTP path and the Qwen3_5MTP model class, which is distinct from the ngram path. The four backported PRs (#40738, #36138, #40783, and #39055) do not cover this code path.

Recommendation

Apply the workaround by disabling CUDA graphs or switching to the ngram method with prompt_lookup_min=8, as these are the most straightforward ways to mitigate the issue. A permanent fix will require adapting the spec_decode_src_indices pattern to the MTP path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40738: [Bugfix] Fix GDN conv + SSM state corruption with ngram spec decode

Description (problem / solution / changelog)

Summary

Changes

Test plan

Changed files

Code Example

Summary

Reproducer

Why this is separate from the v7.13 ngram fix

Working configurations that avoid this bug

Suggested investigation entry-point

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40738: [Bugfix] Fix GDN conv + SSM state corruption with ngram spec decode

Description (problem / solution / changelog)

Summary

Changes

Test plan

Changed files

Code Example

Summary

Reproducer

Why this is separate from the v7.13 ngram fix

Working configurations that avoid this bug

Suggested investigation entry-point

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING