vllm - 💡(How to fix) Fix GDN attention backend crashes with ngram speculative decoding on mixed decode batches [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38196Fetched 2026-04-08 01:31:45
View on GitHub
Comments
3
Participants
2
Timeline
4
Reactions
0
Timeline (top)
commented ×3subscribed ×1

The GDN (Gated Delta Network) attention backend in vllm/v1/attention/backends/gdn_attn.py crashes with an AssertionError when ngram speculative decoding produces a batch containing both regular decode tokens and speculative decode tokens.

Error Message

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

Root Cause

Line 310 in gdn_attn.py:

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

When ngram speculative decoding rejects some tokens, vLLM creates a batch with both:

  • Regular decode tokens (from rejected/verified sequences)
  • Speculative decode tokens (new speculative proposals)

The GDN attention builder assumes these are mutually exclusive, but they're not when spec decoding has partial rejections.

The CUDA graph preparation code below the assertion also assumes mutual exclusivity — there are separate branches for num_decodes == 0 (spec-only) and num_spec_decodes == 0 (decode-only), with no handling for mixed batches.

Code Example

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

---

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — will crash after 1-2 succeed
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))
RAW_BUFFERClick to expand / collapse

Description

The GDN (Gated Delta Network) attention backend in vllm/v1/attention/backends/gdn_attn.py crashes with an AssertionError when ngram speculative decoding produces a batch containing both regular decode tokens and speculative decode tokens.

Environment

  • vLLM version: 0.17.2.dev0+g95c0f928c.d20260313 (nightly)
  • GPU: NVIDIA GH200 480GB
  • Model: Qwen3.5 9B (uses GDN linear attention)
  • Config: FP8 online quantization, ngram speculative decoding (128 tokens, prompt lookup)

Error

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

Root Cause

Line 310 in gdn_attn.py:

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

When ngram speculative decoding rejects some tokens, vLLM creates a batch with both:

  • Regular decode tokens (from rejected/verified sequences)
  • Speculative decode tokens (new speculative proposals)

The GDN attention builder assumes these are mutually exclusive, but they're not when spec decoding has partial rejections.

The CUDA graph preparation code below the assertion also assumes mutual exclusivity — there are separate branches for num_decodes == 0 (spec-only) and num_spec_decodes == 0 (decode-only), with no handling for mixed batches.

Behavior

  • First 1-2 requests succeed (all spec tokens accepted → pure spec decode batches)
  • Subsequent requests crash when a token rejection creates a mixed batch
  • Engine dies permanently after the assertion failure

Steps to Reproduce

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — will crash after 1-2 succeed
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))

Impact

This effectively makes speculative decoding unusable with Qwen3.5 models (and any model using GDN attention layers), which is a significant performance regression — ngram spec decoding provides signifcant speedups

Expected Behavior

The GDN attention builder should handle mixed batches where both num_decodes > 0 and num_spec_decodes > 0, similar to how the standard attention backends handle this case.

extent analysis

Fix Plan

To fix the issue, we need to modify the GDN attention builder to handle mixed batches. Here are the steps:

  • Modify the build function in gdn_attn.py to remove the assertion and add a new branch to handle mixed batches.
  • Update the CUDA graph preparation code to handle the mixed batch case.

Example code:

# Remove the assertion
# assert not (num_decodes > 0 and num_spec_decodes > 0), (
#     f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
# )

# Add a new branch to handle mixed batches
if num_decodes > 0 and num_spec_decodes > 0:
    # Handle mixed batch case
    # ... (add CUDA graph preparation code for mixed batches)
elif num_decodes > 0:
    # Handle decode-only batch case
    # ... (existing code)
elif num_spec_decodes > 0:
    # Handle spec-only batch case
    # ... (existing code)

Verification

To verify the fix, run the following code:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — should not crash
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))

If the fix is successful, the code should run without crashing and produce the expected output.

Extra Tips

  • Make sure to test the fix with different input sequences and speculative decoding configurations to ensure that it works correctly in all cases.
  • Consider adding additional logging or debugging statements to help diagnose any issues that may arise during the fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix GDN attention backend crashes with ngram speculative decoding on mixed decode batches [3 comments, 2 participants]