vllm - ✅(Solved) Fix [Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40880Fetched 2026-04-26 05:06:14
View on GitHub
Comments
2
Participants
2
Timeline
25
Reactions
0
Timeline (top)
subscribed ×11mentioned ×10commented ×2closed ×1

MTP speculative decoding × TurboQuant KV cache × CUDA graph capture produces degenerate output on Qwen3-Next hybrid models. The four upstream PRs that fixed the ngram path (#40738 + #36138 + #40783 + #39055, plus the prompt_lookup_min=8 config trick from #40875) do not close this for MTP — the MTP forward path goes through a different proposer (EagleProposer configured with method="mtp") that the ngram-scoped fixes don't cover.

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Root Cause

MTP speculative decoding × TurboQuant KV cache × CUDA graph capture produces degenerate output on Qwen3-Next hybrid models. The four upstream PRs that fixed the ngram path (#40738 + #36138 + #40783 + #39055, plus the prompt_lookup_min=8 config trick from #40875) do not close this for MTP — the MTP forward path goes through a different proposer (EagleProposer configured with method="mtp") that the ngram-scoped fixes don't cover.

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Fix Action

Fix / Workaround

Image: vllm/vllm-openai@sha256:9bba4628a3b943e0dd33caefb94b811569ba1e97bdf23bee19a265c31b947ccb (v0.19.2rc1.dev21+g893611813). Hardware: 1× RTX 3090 (Ampere SM 8.6). Model: Lorbus/Qwen3.6-27B-int4-AutoRound. Genesis v7.13 backports applied via Sandermage/genesis-vllm-patches@852b73f — log confirms Genesis Results: 25 applied, 16 skipped, 0 failed including P60 (#40738 Phase 1) + P60b (#40738 Phase 2 Triton kernel offset).

  • #40831 — broader bug class; closed for ngram path; this issue is the scoped MTP follow-up
  • #40875 — ngram prompt_lookup_min=8 config-only fix
  • #40738 — @tdoublep's GDN+ngram state recovery (scope-limited to ngram)
  • Genesis v7.13 tree — backport tree applied for this test

PR fix notes

PR #40738: [Bugfix] Fix GDN conv + SSM state corruption with ngram spec decode

Description (problem / solution / changelog)

Summary

Fix output corruption when using ngram speculative decoding with hybrid GDN models (e.g., Qwen3.5) in mamba_cache_mode="none".

After a spec decode step accepts N tokens, the next non-spec decode step must read SSM state from block N-1 and conv state from an offset position. Two bugs prevented this:

  1. num_accepted_tokens was not passed to SSM metadata builders on non-spec steps
  2. causal_conv1d_fn had no mechanism to offset conv state reads based on accepted tokens

Changes

  • gdn_attn.py: Compute spec_decode_src_indices for SSM state correction; pad num_accepted_tokens with 1s for prefill sequences in mixed batches
  • gdn_linear_attn.py: Pre-copy SSM state from accepted block to block 0; pass num_accepted_tokens to conv kernels gated on whether correction is needed
  • causal_conv1d.py: Add IS_SPEC_DECODING path to _causal_conv1d_fwd_kernel that offsets conv state reads/writes by num_accepted_tokens - 1
  • gpu_model_runner.py: Pass num_accepted_tokens to GDN/Mamba2 builders on non-spec steps

Test plan

  • Single-prompt: baseline vs ngram match token-for-token (Qwen3.5-0.8B)
  • Mixed-batch: short prompt matches baseline; long prompt generates coherent output
  • Kernel fix verified necessary: disabling conv offset causes regression
  • Existing GDN + spec decode CI tests
<details> <summary>Reproducers</summary>

Single-prompt:

from vllm import LLM, SamplingParams
MODEL, PROMPT = "Qwen/Qwen3.5-0.8B", "<code>\nclass Calculator:\n    def add(self, a, b):\n        return a + b\n</code>\n<update>\nAdd subtract and multiply methods\n</update>"
ARGS = dict(model=MODEL, trust_remote_code=True, enforce_eager=True, enable_chunked_prefill=True, max_model_len=4096)
SPEC = {"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 10, "prompt_lookup_min": 2}
S = SamplingParams(max_tokens=200, temperature=0)
b = list(LLM(**ARGS).generate([PROMPT], S)[0].outputs[0].token_ids)
n = list(LLM(**ARGS, speculative_config=SPEC).generate([PROMPT], S)[0].outputs[0].token_ids)
print("PASS" if b == n else "FAIL")

Mixed-batch (short + long prompt, max_num_batched_tokens=64): see reproduce_gdn_ngram_mixed.py in the branch.

</details>

Fixes #39273

AI-assisted: Yes (Claude). Not duplicating any existing PR.

Changed files

  • vllm/model_executor/layers/mamba/gdn_linear_attn.py (modified, +27/-0)
  • vllm/model_executor/layers/mamba/ops/causal_conv1d.py (modified, +20/-2)
  • vllm/v1/attention/backends/gdn_attn.py (modified, +32/-2)
  • vllm/v1/worker/gpu_model_runner.py (modified, +11/-0)

Code Example

vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 1 --max-model-len 125000 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_3bit_nc \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  # cudagraph + torch.compile both ON (default — no overrides)

---

import json, urllib.request
body = {
  "model": "qwen3.6-27b-autoround",
  "messages": [{"role": "user", "content": "What is the weather in Paris in Celsius? Use the tool."}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "parameters": {"type": "object", "properties": {
      "city": {"type": "string"}, "unit": {"type": "string"}}, "required": ["city","unit"]}}}],
  "tool_choice": "auto", "max_tokens": 256, "chat_template_kwargs": {"enable_thinking": False},
}
r = urllib.request.urlopen(urllib.request.Request(
  "http://localhost:8020/v1/chat/completions",
  data=json.dumps(body).encode(), headers={"content-type":"application/json"}))
print(json.loads(r.read())["choices"][0]["message"])

---

tool_calls=[]
content="<tool_call>\n<function=...><parameter=...>...</parameter>...</parameter>...
        (cascading inline-text emission, never populates tool_calls[])"

---

expected: 'silver platypus 22'
got:      'silver '
RAW_BUFFERClick to expand / collapse

Summary

MTP speculative decoding × TurboQuant KV cache × CUDA graph capture produces degenerate output on Qwen3-Next hybrid models. The four upstream PRs that fixed the ngram path (#40738 + #36138 + #40783 + #39055, plus the prompt_lookup_min=8 config trick from #40875) do not close this for MTP — the MTP forward path goes through a different proposer (EagleProposer configured with method="mtp") that the ngram-scoped fixes don't cover.

Filed at @Sandermage's explicit suggestion: "we did not test MTP at all in the v7.13 cycle... your data shows that assumption is wrong."

Reproducer

Image: vllm/vllm-openai@sha256:9bba4628a3b943e0dd33caefb94b811569ba1e97bdf23bee19a265c31b947ccb (v0.19.2rc1.dev21+g893611813). Hardware: 1× RTX 3090 (Ampere SM 8.6). Model: Lorbus/Qwen3.6-27B-int4-AutoRound. Genesis v7.13 backports applied via Sandermage/genesis-vllm-patches@852b73f — log confirms Genesis Results: 25 applied, 16 skipped, 0 failed including P60 (#40738 Phase 1) + P60b (#40738 Phase 2 Triton kernel offset).

vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 1 --max-model-len 125000 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype turboquant_3bit_nc \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  # cudagraph + torch.compile both ON (default — no overrides)

Test request:

import json, urllib.request
body = {
  "model": "qwen3.6-27b-autoround",
  "messages": [{"role": "user", "content": "What is the weather in Paris in Celsius? Use the tool."}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "parameters": {"type": "object", "properties": {
      "city": {"type": "string"}, "unit": {"type": "string"}}, "required": ["city","unit"]}}}],
  "tool_choice": "auto", "max_tokens": 256, "chat_template_kwargs": {"enable_thinking": False},
}
r = urllib.request.urlopen(urllib.request.Request(
  "http://localhost:8020/v1/chat/completions",
  data=json.dumps(body).encode(), headers={"content-type":"application/json"}))
print(json.loads(r.read())["choices"][0]["message"])

Expected: tool_calls=[{name=get_weather, args={"city":"Paris","unit":"Celsius"}}]

Actual:

tool_calls=[]
content="<tool_call>\n<function=...><parameter=...>...</parameter>...</parameter>...
        (cascading inline-text emission, never populates tool_calls[])"

Same shape on the long-context needle test — model retrieves first token of the secret then output ends:

expected: 'silver platypus 22'
got:      'silver '

Why this is separate from the v7.13 ngram fix

  • The four backported PRs scope to the ngram path's GDN state recovery (gdn_attn.py + gdn_linear_attn.py). Genesis P60/P60b explicitly target the ngram-affected paths.
  • MTP runs through vllm/v1/spec_decode/eagle.py with method="mtp" and Qwen3_5MTP model class — distinct machinery from ngram's path.
  • Empirically: same config, same Genesis loaded, same image. Switching method from mtp to ngram + prompt_lookup_min=8 flips outcome from broken to clean. So the four backports are working; they just don't cover this code path.

Confirmed in #40831 probe ladder Test A vs Test C on the same rig.

Working configurations that avoid this bug

  1. --compilation-config '{"cudagraph_mode":"NONE"}' — keeps torch.compile inductor on, disables CUDA graphs. Restores correctness at ~33 TPS (vs ~85 broken).
  2. Switch to ngram + prompt_lookup_min=8 per #40875.
  3. Drop TurboQuant entirely (use fp8_e5m2 KV) — caps context at ~32-40K on a 24 GB card but unaffected.

Suggested investigation entry-point

Whether Qwen3_5MTP.forward correctly reads SSM/conv state from block[num_accepted-1] after spec-decode acceptance — same shape of bug #40738 fixed for ngram, possibly fixable with the same spec_decode_src_indices pattern adapted to the MTP path.

References

  • #40831 — broader bug class; closed for ngram path; this issue is the scoped MTP follow-up
  • #40875 — ngram prompt_lookup_min=8 config-only fix
  • #40738 — @tdoublep's GDN+ngram state recovery (scope-limited to ngram)
  • Genesis v7.13 tree — backport tree applied for this test

cc @Sandermage @tdoublep — per the handoff in #40831.

extent analysis

TL;DR

The issue can be mitigated by disabling CUDA graphs or switching to the ngram method with prompt_lookup_min=8, but a permanent fix requires adapting the spec_decode_src_indices pattern to the MTP path.

Guidance

  • Investigate whether Qwen3_5MTP.forward correctly reads SSM/conv state from block[num_accepted-1] after spec-decode acceptance, similar to the fix for the ngram path in #40738.
  • Try disabling CUDA graphs using --compilation-config '{"cudagraph_mode":"NONE"}' to restore correctness, albeit at a lower performance.
  • Switch to the ngram method with prompt_lookup_min=8 as a temporary workaround, as suggested in #40875.
  • Consider dropping TurboQuant entirely and using fp8_e5m2 KV, but be aware that this will cap the context size.

Example

No code snippet is provided, as the issue requires a deeper understanding of the MTP path and the Qwen3_5MTP.forward method.

Notes

The issue is specific to the MTP path and the Qwen3_5MTP model class, which is distinct from the ngram path. The four backported PRs (#40738, #36138, #40783, and #39055) do not cover this code path.

Recommendation

Apply the workaround by disabling CUDA graphs or switching to the ngram method with prompt_lookup_min=8, as these are the most straightforward ways to mitigate the issue. A permanent fix will require adapting the spec_decode_src_indices pattern to the MTP path.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING