vllm - 💡(How to fix) Fix GDN attention backend crashes with ngram speculative decoding on mixed decode batches [3 comments, 2 participants]

vllm2026-03-26 06:41:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38196•Fetched 2026-04-08 01:31:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bhaktatejas922

Participants

bhaktatejas922

NJX-njx

Timeline (top)

commented ×3subscribed ×1

The GDN (Gated Delta Network) attention backend in vllm/v1/attention/backends/gdn_attn.py crashes with an AssertionError when ngram speculative decoding produces a batch containing both regular decode tokens and speculative decode tokens.

Error Message

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

Root Cause

Line 310 in gdn_attn.py:

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

When ngram speculative decoding rejects some tokens, vLLM creates a batch with both:

Regular decode tokens (from rejected/verified sequences)
Speculative decode tokens (new speculative proposals)

The GDN attention builder assumes these are mutually exclusive, but they're not when spec decoding has partial rejections.

The CUDA graph preparation code below the assertion also assumes mutual exclusivity — there are separate branches for num_decodes == 0 (spec-only) and num_spec_decodes == 0 (decode-only), with no handling for mixed batches.

Code Example

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

---

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — will crash after 1-2 succeed
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))

RAW_BUFFERClick to expand / collapse

Description

Environment

vLLM version: 0.17.2.dev0+g95c0f928c.d20260313 (nightly)
GPU: NVIDIA GH200 480GB
Model: Qwen3.5 9B (uses GDN linear attention)
Config: FP8 online quantization, ngram speculative decoding (128 tokens, prompt lookup)

Error

File ".../vllm/v1/attention/backends/gdn_attn.py", line 310, in build
    assert not (num_decodes > 0 and num_spec_decodes > 0), (
AssertionError: num_decodes: 1, num_spec_decodes: 1

Root Cause

Line 310 in gdn_attn.py:

# Function code counted on either presency non-spec decode or spec decode,
# but not both.
assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

When ngram speculative decoding rejects some tokens, vLLM creates a batch with both:

Regular decode tokens (from rejected/verified sequences)
Speculative decode tokens (new speculative proposals)

The GDN attention builder assumes these are mutually exclusive, but they're not when spec decoding has partial rejections.

Behavior

First 1-2 requests succeed (all spec tokens accepted → pure spec decode batches)
Subsequent requests crash when a token rejection creates a mixed batch
Engine dies permanently after the assertion failure

Steps to Reproduce

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — will crash after 1-2 succeed
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))

Impact

This effectively makes speculative decoding unusable with Qwen3.5 models (and any model using GDN attention layers), which is a significant performance regression — ngram spec decoding provides signifcant speedups

Expected Behavior

The GDN attention builder should handle mixed batches where both num_decodes > 0 and num_spec_decodes > 0, similar to how the standard attention backends handle this case.

extent analysis

Fix Plan

To fix the issue, we need to modify the GDN attention builder to handle mixed batches. Here are the steps:

Modify the build function in gdn_attn.py to remove the assertion and add a new branch to handle mixed batches.
Update the CUDA graph preparation code to handle the mixed batch case.

Example code:

# Remove the assertion
# assert not (num_decodes > 0 and num_spec_decodes > 0), (
#     f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
# )

# Add a new branch to handle mixed batches
if num_decodes > 0 and num_spec_decodes > 0:
    # Handle mixed batch case
    # ... (add CUDA graph preparation code for mixed batches)
elif num_decodes > 0:
    # Handle decode-only batch case
    # ... (existing code)
elif num_spec_decodes > 0:
    # Handle spec-only batch case
    # ... (existing code)

Verification

To verify the fix, run the following code:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3.5-9B",  # Any Qwen3.5 model with GDN layers
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 128,
        "prompt_lookup_max": 12,
        "prompt_lookup_min": 2,
    },
)

# Run multiple requests — should not crash
for i in range(5):
    output = llm.generate("def hello():\n", SamplingParams(max_tokens=100))

If the fix is successful, the code should run without crashing and produce the expected output.

Extra Tips

Make sure to test the fix with different input sequences and speculative decoding configurations to ensure that it works correctly in all cases.
Consider adding additional logging or debugging statements to help diagnose any issues that may arise during the fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix GDN attention backend crashes with ngram speculative decoding on mixed decode batches [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Description

Environment

Error

Root Cause

Behavior

Steps to Reproduce

Impact

Expected Behavior

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix GDN attention backend crashes with ngram speculative decoding on mixed decode batches [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Description

Environment

Error

Root Cause

Behavior

Steps to Reproduce

Impact

Expected Behavior

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING