vllm - ✅(Solved) Fix [Bug]: GDN attention backend crashes with mixed decode/spec_decode batch when serving Qwen3.5 family models with MTP [1 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36917Fetched 2026-04-08 00:43:38
View on GitHub
Comments
3
Participants
4
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×3cross-referenced ×1labeled ×1subscribed ×1

Error Message

(Worker_TP4 pid=826) AssertionError: num_decodes: 1, num_spec_decodes: 4

(Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3671, in execute_model (Worker_TP4 pid=826) self._build_attention_metadata( (Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2056, in _build_attention_metadata (Worker_TP4 pid=826) _build_attn_group_metadata(kv_cache_gid, attn_gid, cm) (Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2007, in _build_attn_group_metadata (Worker_TP4 pid=826) attn_metadata_i = builder.build( (Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build (Worker_TP4 pid=826) assert not (num_decodes > 0 and num_spec_decodes > 0), (

Root Cause

The assertion at vllm/v1/attention/backends/gdn_attn.py:310 fails:

assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

The GDN attention backend does not support mixed batches containing both regular decode tokens and speculative decode tokens. However, the V1 scheduler produces exactly this kind of heterogeneous batch under certain conditions.

Fix Action

Fix / Workaround

Current Workaround

PR fix notes

PR #36918: [Bugfix][Core] Fix gdn kernel mixed batch spec decode crash

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Purpose

Fix GDN attention backend crash with mixed decode/spec_decode batch (this address the following issue #36917).

The V1 scheduler can produce batches containing both regular decode and speculative decode requests. This happens when concurrent requests approach max_model_len at different times, the scheduler correctly disables spec decode per-request (clipping num_new_tokens to 1), but doesn't enforce batch-level uniformity. The GDN attention backend (gdn_attn.py:310) asserts this can't happen and crashes:

AssertionError: num_decodes: 1, num_spec_decodes: 7

This kills the EngineCore and all in-flight requests. Affects any model with GatedDeltaNet layers (so the whole Qwen3.5 family) when MTP speculative decoding is enabled.

Fix: After the running-requests scheduling loop in scheduler.py, detect mixed batches and disable spec decode for all requests in that step. The cost is reasonable: one step without speculative decoding, only when a request is within num_spec_tokens of max_model_len.

Test Plan

pytest tests/v1/core/test_scheduler.py::test_mixed_batch_disables_spec_decode -xvs

The unit test simulates two concurrent requests where one approaches max_model_len (clipped to 1 token, no spec decode) while the other still has spec decode active which creates the mixed batch that triggers the crash. The test verifies the fix strips spec decode tokens from the entire batch.

Test Result

New test passes with fix, fails without:

tests/v1/core/test_scheduler.py::test_mixed_batch_disables_spec_decode PASSED

All 12 existing spec-decode-related scheduler tests continue to pass:

=============== 12 passed, 83 deselected in 13.84s ================

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/v1/core/test_scheduler.py (modified, +109/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +23/-0)

Code Example

Key info: vLLM 0.17.1, PyTorch 2.10.0+cu129, GPU: NVIDIA
  GB10, OS: Ubuntu 22.04 aarch64.

---

assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

---

num_scheduled_tokens={
    chatcmpl-930d910693e2b9b3: 1,    # <-- regular decode (1 token)
    chatcmpl-988529689500aeab: 2,     # <-- spec decode (1 + 1 MTP)
    chatcmpl-a4414e85c2ae3487: 2,     # <-- spec decode
    chatcmpl-b88f8e337b3083b1: 2,     # <-- spec decode
    chatcmpl-97d27dc64ac3ad97: 2      # <-- spec decode
}
scheduled_spec_decode_tokens={
    chatcmpl-97d27dc64ac3ad97: [-1],
    chatcmpl-a4414e85c2ae3487: [-1],
    chatcmpl-988529689500aeab: [-1],
    chatcmpl-b88f8e337b3083b1: [-1]
}
# Note: chatcmpl-930d910693e2b9b3 is NOT in scheduled_spec_decode_tokens

---

model='Qwen/Qwen3.5-397B-A17B'
speculative_config=SpeculativeConfig(method='mtp', model='...', num_spec_tokens=1)
dtype=torch.bfloat16
max_seq_len=262144
tensor_parallel_size=8
enable_chunked_prefill=True
enable_prefix_caching=False

---

(Worker_TP4 pid=826) AssertionError: num_decodes: 1, num_spec_decodes: 4

(Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3671, in execute_model
(Worker_TP4 pid=826)     self._build_attention_metadata(
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2056, in _build_attention_metadata
(Worker_TP4 pid=826)     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2007, in _build_attn_group_metadata
(Worker_TP4 pid=826)     attn_metadata_i = builder.build(
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP4 pid=826)     assert not (num_decodes > 0 and num_spec_decodes > 0), (

---

vllm serve Qwen/Qwen3.5-397B-A17B \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 0}' \
    ...
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Key info: vLLM 0.17.1, PyTorch 2.10.0+cu129, GPU: NVIDIA
  GB10, OS: Ubuntu 22.04 aarch64.
</details>

🐛 Describe the bug

(if it helps) Environment where we first spotted the bug

  • vLLM version: v0.17.0rc1.dev164+gfff3711a2
  • GPU: NVIDIA B200 (DGX B200)
  • Model: Qwen/Qwen3.5-397B-A17B
  • TP: 8
  • Docker: yes (container wsd-qwen3.5-397b-a17b-vllm)
  • OS: Linux

When serving Qwen3.5-397B-A17B with the recommended MTP speculative decoding config from vLLM recipes (num_speculative_tokens=1), the GDN attention backend crashes with a fatal AssertionError during concurrent request serving. The engine dies and all in-flight requests fail with EngineDeadError. I could reproduce this on my DGX spark and smaller models from the Qwen3.5 family. (see gist)

Root cause

The assertion at vllm/v1/attention/backends/gdn_attn.py:310 fails:

assert not (num_decodes > 0 and num_spec_decodes > 0), (
    f"num_decodes: {num_decodes}, num_spec_decodes: {num_spec_decodes}"
)

The GDN attention backend does not support mixed batches containing both regular decode tokens and speculative decode tokens. However, the V1 scheduler produces exactly this kind of heterogeneous batch under certain conditions.

How to trigger

The crash occurs when the scheduler creates a batch where at least one request is in regular decode mode while other requests are in speculative decode mode. In my case, this happens when a request approaches max_seq_len (262144) one request had num_computed_tokens=262142 and was scheduled for 1 regular decode token, while 4 other requests were scheduled with scheduled_spec_decode_tokens: [-1].

From the scheduler output dump:

num_scheduled_tokens={
    chatcmpl-930d910693e2b9b3: 1,    # <-- regular decode (1 token)
    chatcmpl-988529689500aeab: 2,     # <-- spec decode (1 + 1 MTP)
    chatcmpl-a4414e85c2ae3487: 2,     # <-- spec decode
    chatcmpl-b88f8e337b3083b1: 2,     # <-- spec decode
    chatcmpl-97d27dc64ac3ad97: 2      # <-- spec decode
}
scheduled_spec_decode_tokens={
    chatcmpl-97d27dc64ac3ad97: [-1],
    chatcmpl-a4414e85c2ae3487: [-1],
    chatcmpl-988529689500aeab: [-1],
    chatcmpl-b88f8e337b3083b1: [-1]
}
# Note: chatcmpl-930d910693e2b9b3 is NOT in scheduled_spec_decode_tokens

Request chatcmpl-930d910693e2b9b3 has num_computed_tokens=262142, only 2 tokens away from max_seq_len=262144. With MTP num_spec_tokens=1, a speculative step would need to allocate 2 tokens (1 regular + 1 speculative), landing at exactly 262144. The verification of the speculative token could require going to 262145, which exceeds the limit. So the scheduler correctly disables spec decode for this individual request because there isn't enough headroom for producing these tokens.

TLDR : The scheduler doesn't account for the GDN attention backend's constraint that all requests in a batch must use the same decode mode. It produces a valid per-request schedule but an invalid batch-level state.

Expected behavior

The scheduler should not produce batches that violate the attention backend's invariants. Either:

  1. The scheduler should ensure all requests in a batch use the same decode mode when the GDN backend is active (split into separate batches or disable spec decode for all requests in the batch when any request can't use it), or
  2. The GDN attention backend should handle mixed decode/spec_decode batches gracefully.

How to reproduce

Run the two files from the following gist <script src="https://gist.github.com/lulmer/8e8cb558c430bbe70f749aeea784d02f.js"></script>

run ./trigger_bug_init.sh then ./trigger_bug.sh

Suggested unit test approach

A faster way to reproduce would be a unit test that directly constructs a SchedulerOutput with mixed decode/spec_decode requests and passes it to the GDN attention metadata builder, or a test that sets max_seq_len to a very small value (e.g., 128) so the boundary is reached quickly (see gist).

Relevant config from the engine dump

model='Qwen/Qwen3.5-397B-A17B'
speculative_config=SpeculativeConfig(method='mtp', model='...', num_spec_tokens=1)
dtype=torch.bfloat16
max_seq_len=262144
tensor_parallel_size=8
enable_chunked_prefill=True
enable_prefix_caching=False

Full stack trace

(Worker_TP4 pid=826) AssertionError: num_decodes: 1, num_spec_decodes: 4

(Worker_TP4 pid=826) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3671, in execute_model
(Worker_TP4 pid=826)     self._build_attention_metadata(
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2056, in _build_attention_metadata
(Worker_TP4 pid=826)     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2007, in _build_attn_group_metadata
(Worker_TP4 pid=826)     attn_metadata_i = builder.build(
(Worker_TP4 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP4 pid=826)     assert not (num_decodes > 0 and num_spec_decodes > 0), (

All 8 TP workers hit the same assertion simultaneously. The EngineCore then dies with EngineDeadError, killing all in-flight requests.

Current Workaround

Disable MTP speculative decoding entirely:

vllm serve Qwen/Qwen3.5-397B-A17B \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 0}' \
    ...

Related issues

  • #24660 : different GDN + MTP bug (cudagraph padding IndexError), but same general area

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to modify the scheduler to ensure that all requests in a batch use the same decode mode when the GDN backend is active. Here are the steps:

  • Modify the SchedulerOutput class to include a decode_mode attribute that indicates whether the batch is using regular decode or speculative decode.
  • In the schedule method, check if the GDN backend is active and if any request in the batch cannot use speculative decode. If so, set the decode_mode to regular decode for all requests in the batch.
  • In the build_attention_metadata method, check the decode_mode of the batch and ensure that all requests are using the same decode mode.

Example code:

class SchedulerOutput:
    def __init__(self, requests, decode_mode):
        self.requests = requests
        self.decode_mode = decode_mode

def schedule(requests):
    # Check if GDN backend is active and if any request cannot use speculative decode
    if gdn_backend_active and any(not request.can_use_speculative_decode for request in requests):
        decode_mode = 'regular'
    else:
        decode_mode = 'speculative'
    
    return SchedulerOutput(requests, decode_mode)

def build_attention_metadata(scheduler_output):
    # Check decode mode of batch and ensure all requests are using the same decode mode
    if scheduler_output.decode_mode == 'regular':
        # Build attention metadata for regular decode
        pass
    elif scheduler_output.decode_mode == 'speculative':
        # Build attention metadata for speculative decode
        pass
    else:
        raise ValueError('Invalid decode mode')

Verification

To verify that the fix worked, run the trigger_bug.sh script and check that the engine no longer crashes with an AssertionError. Additionally, you can add logging statements to the build_attention_metadata method to verify that the correct decode mode is being used for each batch.

Extra Tips

  • Make sure to test the fix with different scenarios, including batches with mixed decode modes and batches with only regular or speculative decode requests.
  • Consider adding a unit test to verify that the SchedulerOutput class is correctly setting the decode_mode attribute based on the requests in the batch.
  • If you encounter any issues with the fix, try debugging the schedule method to ensure that it is correctly determining the decode mode for each batch.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The scheduler should not produce batches that violate the attention backend's invariants. Either:

  1. The scheduler should ensure all requests in a batch use the same decode mode when the GDN backend is active (split into separate batches or disable spec decode for all requests in the batch when any request can't use it), or
  2. The GDN attention backend should handle mixed decode/spec_decode batches gracefully.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING