vllm - ✅(Solved) Fix [Bug]: Ngram speculative decoding produces corrupted output on hybrid GDN (Qwen3.5) models [1 pull requests, 2 comments, 1 participants]

vllm2026-04-08 06:28:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39273•Fetched 2026-04-09 07:52:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bhaktatejas922

Participants

bhaktatejas922

Timeline (top)

commented ×2referenced ×2

Error Message

How to reproduce the error

Relevant logs or traceback

No crash or error — the model runs but produces corrupted output. The corruption pattern shows repeated/truncated fragments that degrade progressively, suggesting SSM state corruption rather than a sampling issue.

Root Cause

We traced this to how mamba_utils.postprocess_mamba handles token rejection during ngram speculative decoding on hybrid GDN models.

The problem (vllm/v1/worker/mamba_utils.py, lines ~242-243):

When ngram proposes N speculative tokens and some are rejected:

The forward pass runs on all N proposed tokens, advancing GDN SSM state by N steps
The rejection sampler correctly identifies which tokens were accepted (e.g., 2 of 4)
postprocess_mamba computes num_tokens_running_state and new_num_computed_tokens to determine which SSM state block to copy:

num_tokens_running_state = (
    num_computed_tokens + num_scheduled_tokens - num_draft_tokens
)
new_num_computed_tokens = num_tokens_running_state + num_accepted_tokens - 1

But the SSM state was already evolved for all N tokens during the forward pass. There is no mechanism to rollback/revert the GDN state to the position after only the accepted tokens. The state copy uses accept_token_bias to select which intermediate state to preserve, but this relies on the block-aligned state checkpoints being correct — which they aren't, because the SSM kernel (fused_recurrent) wrote states for all N tokens into contiguous block slots.

Why MTP doesn't have this issue: MTP draft tokens are generated after the base model step completes, so the SSM state evolution and draft proposal are decoupled. With ngram, drafts are pre-computed from prompt history but the forward pass still speculatively evolves SSM state for all proposed tokens.

Key files involved:

vllm/v1/worker/mamba_utils.py — postprocess_mamba() state copy logic
vllm/v1/attention/backends/gdn_attn.py — GDN metadata passes num_accepted_tokens but no state rollback
vllm/v1/worker/gpu_model_runner.py — _update_states_after_model_execute()
vllm/model_executor/layers/mamba/abstract.py — num_speculative_blocks allocation

PR fix notes

PR #39463: [Bugfix] Fix ngram spec decode corrupted output on hybrid GDN models

Repository: vllm-project/vllm
Author: AjAnubolu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39463

Description (problem / solution / changelog)

Summary

Closes #39273. When ngram spec decode accepts >1 tokens on a hybrid GDN model but the next step has no draft tokens, the non-spec path reads stale SSM state from slot 0 instead of the accepted state at the speculative offset — this copies it back via postprocess_mamba for mamba_cache_mode="none".

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +21/-7)
vllm/v1/worker/mamba_utils.py (modified, +14/-0)

Code Example

# Start vLLM with no speculative decoding
vllm serve model-fp8 \
  --trust-remote-code \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-model-len 131072 \
  --max-num-batched-tokens 131072 \
  --additional-config '{"gdn_prefill_backend": "triton"}'

# Test
curl -s http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "model-fp8",
  "prompt": "<code>\nclass Calculator:\n    def add(self, a, b):\n        return a + b\n</code>\n<update>\nAdd subtract and multiply methods\n</update>",
  "max_tokens": 300, "temperature": 0
}'

---

class Calculator:
    def add(self, a, b):
        return a + b

    def subtract(self, a, b):
        return a - b

    def multiply(self, a, b):
        return a * b

---

vllm serve model-fp8 \
  --trust-remote-code \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-model-len 131072 \
  --max-num-batched-tokens 131072 \
  --additional-config '{"gdn_prefill_backend": "triton"}' \
  --speculative-config '{"method": "ngram", "num_speculative_tokens": 64, "prompt_lookup_max": 10, "prompt_lookup_min": 2}'

---

class Calculator:
    def add(self, a, b):
        return a + b

    def add(self, a, b):
        return a + b

    add(self, a, b):
        return a + b

, a, b):
        return a + b
...

---

num_tokens_running_state = (
    num_computed_tokens + num_scheduled_tokens - num_draft_tokens
)
new_num_computed_tokens = num_tokens_running_state + num_accepted_tokens - 1

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: 0.18.1rc1.dev43+gdebd6e768 (also reproducible on latest nightly 0.18.1rc1.dev236)
GPU: NVIDIA GH200 480GB
Model: Qwen3.5-9b (Qwen3.5 architecture, model_type: qwen3_5_text, hybrid GDN + full attention)

Model description

Qwen3.5-based model with hybrid architecture: 24 GatedDeltaNet (linear attention) layers + 8 full attention layers. FP8 quantized via compressed-tensors. Config includes layer_types: [linear_attention, linear_attention, linear_attention, full_attention, ...] repeating pattern.

How to reproduce the error

Without ngram (correct output):

# Start vLLM with no speculative decoding
vllm serve model-fp8 \
  --trust-remote-code \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-model-len 131072 \
  --max-num-batched-tokens 131072 \
  --additional-config '{"gdn_prefill_backend": "triton"}'

# Test
curl -s http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "model-fp8",
  "prompt": "<code>\nclass Calculator:\n    def add(self, a, b):\n        return a + b\n</code>\n<update>\nAdd subtract and multiply methods\n</update>",
  "max_tokens": 300, "temperature": 0
}'

Output (correct):

class Calculator:
    def add(self, a, b):
        return a + b

    def subtract(self, a, b):
        return a - b

    def multiply(self, a, b):
        return a * b

With ngram (corrupted output):

vllm serve model-fp8 \
  --trust-remote-code \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-model-len 131072 \
  --max-num-batched-tokens 131072 \
  --additional-config '{"gdn_prefill_backend": "triton"}' \
  --speculative-config '{"method": "ngram", "num_speculative_tokens": 64, "prompt_lookup_max": 10, "prompt_lookup_min": 2}'

Same prompt produces degenerate output with repeated fragments:

class Calculator:
    def add(self, a, b):
        return a + b

    def add(self, a, b):
        return a + b

    add(self, a, b):
        return a + b

, a, b):
        return a + b
...

Relevant logs or traceback

Root cause analysis

We traced this to how mamba_utils.postprocess_mamba handles token rejection during ngram speculative decoding on hybrid GDN models.

The problem (vllm/v1/worker/mamba_utils.py, lines ~242-243):

When ngram proposes N speculative tokens and some are rejected:

The forward pass runs on all N proposed tokens, advancing GDN SSM state by N steps
The rejection sampler correctly identifies which tokens were accepted (e.g., 2 of 4)
postprocess_mamba computes num_tokens_running_state and new_num_computed_tokens to determine which SSM state block to copy:

num_tokens_running_state = (
    num_computed_tokens + num_scheduled_tokens - num_draft_tokens
)
new_num_computed_tokens = num_tokens_running_state + num_accepted_tokens - 1

But the SSM state was already evolved for all N tokens during the forward pass. There is no mechanism to rollback/revert the GDN state to the position after only the accepted tokens. The state copy uses accept_token_bias to select which intermediate state to preserve, but this relies on the block-aligned state checkpoints being correct — which they aren't, because the SSM kernel (fused_recurrent) wrote states for all N tokens into contiguous block slots.

Key files involved:

vllm/v1/worker/mamba_utils.py — postprocess_mamba() state copy logic
vllm/v1/attention/backends/gdn_attn.py — GDN metadata passes num_accepted_tokens but no state rollback
vllm/v1/worker/gpu_model_runner.py — _update_states_after_model_execute()
vllm/model_executor/layers/mamba/abstract.py — num_speculative_blocks allocation

Expected behavior

Ngram speculative decoding should produce identical output to non-speculative decoding (with temperature=0, output should be deterministic and match).

Before submitting a new issue...

I have searched existing issues and confirmed this is not a duplicate
I have verified the issue persists on the latest nightly build
I have included reproduction steps and root cause analysis

extent analysis

TL;DR

The most likely fix involves modifying the postprocess_mamba function in vllm/v1/worker/mamba_utils.py to correctly handle the GDN SSM state rollback after token rejection during ngram speculative decoding.

Guidance

Review the postprocess_mamba function to understand how it handles token rejection and SSM state evolution.
Modify the num_tokens_running_state and new_num_computed_tokens calculations to account for the accepted tokens only.
Implement a mechanism to rollback or revert the GDN SSM state to the position after only the accepted tokens, potentially by using the accept_token_bias to select the correct intermediate state.
Verify the changes by testing the ngram speculative decoding with the modified postprocess_mamba function.

Example

# Modified postprocess_mamba function
def postprocess_mamba(...):
    # ...
    num_tokens_running_state = (
        num_computed_tokens + num_scheduled_tokens - num_draft_tokens
    )
    new_num_computed_tokens = num_tokens_running_state + num_accepted_tokens - 1
    
    # Rollback GDN SSM state to the position after accepted tokens
    gdn_state = rollback_gdn_state(gdn_state, num_accepted_tokens)
    
    # ...

Note: The rollback_gdn_state function is not implemented in the example, as it would require additional information about the GDN SSM state and its evolution.

Notes

The provided root cause analysis suggests that the issue is specific to the hybrid GDN models and the ngram speculative decoding. The fix should be verified on the latest nightly build to ensure it resolves the issue without introducing new problems.

Recommendation

Apply a workaround by modifying the postprocess_mamba function to correctly handle the GDN SSM state rollback, as this is the most direct way to address the issue. Upgrading to a fixed version is not possible without a new release that includes the necessary changes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Ngram speculative decoding should produce identical output to non-speculative decoding (with temperature=0, output should be deterministic and match).

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Ngram speculative decoding produces corrupted output on hybrid GDN (Qwen3.5) models [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

How to reproduce the error

Relevant logs or traceback

Root Cause

PR fix notes

PR #39463: [Bugfix] Fix ngram spec decode corrupted output on hybrid GDN models

Description (problem / solution / changelog)

Summary

Changed files

Code Example

Your current environment

Model description

How to reproduce the error

Relevant logs or traceback

Root cause analysis

Expected behavior

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING