vllm - 💡(How to fix) Fix [Bug]: MTP + FULL_AND_PIECEWISE cudagraph deadlocks at HT batched-decode when bonus-token-only forward shape is scheduled

StepCodex · 2026-05-11T03:54:35Z

[vllm] With vLLM v0.20.1 + cudagraph mode=FULL AND PIECEWISE + MTP num speculative tokens=1 , the EngineCore deterministically deadlocks during batched-decode… With vLLM v0.20.1 + `cudagraph_mode=FULL_AND_PIECEWISE` + MTP (`num_speculative_tokens=1`), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in `scheduled_spec_decode_tokens=[-1]` (MTP bonus-token-only mode). The engine hangs at `shm_broadcast.acquire_read._spin_condition.wait` (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), `EngineDeadError` fires. The deadlock does **not** reproduce with `cudagraph_mode=FULL_DECODE_ONLY` (FDO), nor with single-user (c=1) MTP=2, nor with `vllm bench serve` as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step). ## Fix / Workaround Re-run the SGLang client reproducer above with the patched build. Expect 512/512 success on n=512 and 2048/2048 on n=2048. ## Summary With vLLM v0.20.1 + `cudagraph_mode=FULL_AND_PIECEWISE` + MTP (`num_speculative_tokens=1`), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in `scheduled_spec_decode_tokens=[-1]` (MTP bonus-token-only mode). The engine hangs at `shm_broadcast.acquire_read._spin_condition.wait` (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), `EngineDeadError` fires. The deadlock does **not** reproduce with `cudagraph_mode=FULL_DECODE_ONLY` (FDO), nor with single-user (c=1) MTP=2, nor with `vllm bench serve` as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step). ## Environment - vLLM: `v0.20.1` (`vllm/vllm-openai:v0.20.1`) - Model: DeepSeek-V4-Flash (FP8 + MXFP4 MoE, custom `model_type=deepseek_v4`) - Hardware: 4× NVIDIA H20 (Hopper, 96 GB HBM3), TP=4 ## Reproducer **Server**: ```bash vllm serve \ --tensor-parallel-size 4 \ --kv-cache-dtype fp8 --block-size 256 \ --max-model-len 16384 --max-num-seqs 512 --max-num-batched-tokens 16384 \ --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' ``` **Client** (SGLang `bench_serving.py`, any recent version): ```bash python3 sglang/python/sglang/bench_serving.py \ --backend vllm --base-url http://localhost:8000 \ --model --tokenizer \ --dataset-name random \ --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.0 \ --num-prompts 512 --max-concurrency 128 ``` `vllm bench serve` with the same args does NOT reproduce the deadlock. ## Observed log signature ``` INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s <-- STUCK INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. ... (repeats every 60s for ~4 min) ... ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error. Traceback: File ".../v1/executor/multiproc_executor.py", line 386, in get_response status, result = mq.dequeue(timeout=dequeue_timeout) File ".../device_communicators/shm_broadcast.py", line 755, in dequeue with self.acquire_read(timeout, indefinite) as buf: File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms()) ``` Scheduler dump at the stuck step (from `dump_input.py`): ```python SchedulerOutput( num_running_reqs=15, num_scheduled_tokens={...: 2, ... (15 entries, all 2)}, total_num_scheduled_tokens=30, scheduled_spec_decode_tokens={ # all 15 reqs: cmpl-...: [-1], cmpl-...: [-1], ... }, num_output_tokens=[935, 901, 871, 841, 813, 805, 799, 791, 763, 763, 731, 725, 723, 721, 712], ... ) ``` Every one of the 15 stuck requests is in MTP bonus-token-only mode (`spec_decode_tokens=[-1]`). The forward shape is 15 reqs × 1 query token (= 15 total), as opposed to the normal-MTP shape of N reqs × 2 query tokens. ## Empirical pattern Same server config across 5 client runs: | Client | range_ratio | n | Result | |---|---|---|---| | vllm bench | n/a (fixed 1024) | 512 | 512/512 ✓ | | sglang | 0.0 (random) | 512 | 0/512 ✗ deadlock at request 0 | | sglang | 0.0 (random) | 512 | 497/512 (97 %) ⚠ tail deadlock, 15 stuck | | sglang | **1.0 (fixed 1024)** | 512 | 497/512 ⚠ same — refutes "varying output_lens triggers" | | sglang | 0.0 | **2048** | 2030/2048 (99.1 %) ⚠ — 18 stuck, same pattern at scale | Deadlock is **independent** of: - output_lens distribution (random vs fixed) - num_prompts (n=512 → 15 stuck, n=2048 → 18 stuck) - whether stuck requests are near OSL boundary (n=2048 stuck reqs have num_output_tokens 712

vllm2026-05-11 03:54:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

With vLLM v0.20.1 + cudagraph_mode=FULL_AND_PIECEWISE + MTP (num_speculative_tokens=1), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in scheduled_spec_decode_tokens=[-1] (MTP bonus-token-only mode).

The engine hangs at shm_broadcast.acquire_read._spin_condition.wait (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), EngineDeadError fires.

The deadlock does not reproduce with cudagraph_mode=FULL_DECODE_ONLY (FDO), nor with single-user (c=1) MTP=2, nor with vllm bench serve as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step).

Error Message

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s <-- STUCK INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. ... (repeats every 60s for ~4 min) ... ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error. Traceback: File ".../v1/executor/multiproc_executor.py", line 386, in get_response status, result = mq.dequeue(timeout=dequeue_timeout) File ".../device_communicators/shm_broadcast.py", line 755, in dequeue with self.acquire_read(timeout, indefinite) as buf: File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

Root Cause

Fix Action

Fix / Workaround

Re-run the SGLang client reproducer above with the patched build. Expect 512/512 success on n=512 and 2048/2048 on n=2048.

Code Example

vllm serve <model_path> \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 16384 --max-num-seqs 512 --max-num-batched-tokens 16384 \
    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

---

python3 sglang/python/sglang/bench_serving.py \
    --backend vllm --base-url http://localhost:8000 \
    --model <model_path> --tokenizer <model_path> \
    --dataset-name random \
    --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.0 \
    --num-prompts 512 --max-concurrency 128

---

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s     <-- STUCK
INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
... (repeats every 60s for ~4 min) ...
ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error.
Traceback:
  File ".../v1/executor/multiproc_executor.py", line 386, in get_response
    status, result = mq.dequeue(timeout=dequeue_timeout)
  File ".../device_communicators/shm_broadcast.py", line 755, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
  File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

---

SchedulerOutput(
    num_running_reqs=15,
    num_scheduled_tokens={...: 2, ... (15 entries, all 2)},
    total_num_scheduled_tokens=30,
    scheduled_spec_decode_tokens={
        # all 15 reqs:
        cmpl-...: [-1],
        cmpl-...: [-1],
        ...
    },
    num_output_tokens=[935, 901, 871, 841, 813, 805, 799, 791, 763, 763, 731, 725, 723, 721, 712],
    ...
)

RAW_BUFFERClick to expand / collapse

Summary

Environment

vLLM: v0.20.1 (vllm/vllm-openai:v0.20.1)
Model: DeepSeek-V4-Flash (FP8 + MXFP4 MoE, custom model_type=deepseek_v4)
Hardware: 4× NVIDIA H20 (Hopper, 96 GB HBM3), TP=4

Reproducer

Server:

vllm serve <model_path> \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 16384 --max-num-seqs 512 --max-num-batched-tokens 16384 \
    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Client (SGLang bench_serving.py, any recent version):

python3 sglang/python/sglang/bench_serving.py \
    --backend vllm --base-url http://localhost:8000 \
    --model <model_path> --tokenizer <model_path> \
    --dataset-name random \
    --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.0 \
    --num-prompts 512 --max-concurrency 128

vllm bench serve with the same args does NOT reproduce the deadlock.

Observed log signature

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s     <-- STUCK
INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
... (repeats every 60s for ~4 min) ...
ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error.
Traceback:
  File ".../v1/executor/multiproc_executor.py", line 386, in get_response
    status, result = mq.dequeue(timeout=dequeue_timeout)
  File ".../device_communicators/shm_broadcast.py", line 755, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
  File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

Scheduler dump at the stuck step (from dump_input.py):

SchedulerOutput(
    num_running_reqs=15,
    num_scheduled_tokens={...: 2, ... (15 entries, all 2)},
    total_num_scheduled_tokens=30,
    scheduled_spec_decode_tokens={
        # all 15 reqs:
        cmpl-...: [-1],
        cmpl-...: [-1],
        ...
    },
    num_output_tokens=[935, 901, 871, 841, 813, 805, 799, 791, 763, 763, 731, 725, 723, 721, 712],
    ...
)

Every one of the 15 stuck requests is in MTP bonus-token-only mode (spec_decode_tokens=[-1]). The forward shape is 15 reqs × 1 query token (= 15 total), as opposed to the normal-MTP shape of N reqs × 2 query tokens.

Empirical pattern

Same server config across 5 client runs:

Client	range_ratio	n	Result
vllm bench	n/a (fixed 1024)	512	512/512 ✓
sglang	0.0 (random)	512	0/512 ✗ deadlock at request 0
sglang	0.0 (random)	512	497/512 (97 %) ⚠ tail deadlock, 15 stuck
sglang	1.0 (fixed 1024)	512	497/512 ⚠ same — refutes "varying output_lens triggers"
sglang	0.0	2048	2030/2048 (99.1 %) ⚠ — 18 stuck, same pattern at scale

Deadlock is independent of:

output_lens distribution (random vs fixed)
num_prompts (n=512 → 15 stuck, n=2048 → 18 stuck)
whether stuck requests are near OSL boundary (n=2048 stuck reqs have num_output_tokens 712-935, well below OSL=1024)

Deadlock is specific to:

cudagraph_mode=FULL_AND_PIECEWISE (FDO works)
MTP enabled (num_speculative_tokens ≥ 1)
Batched decode with >1 req in bonus-only mode at the same step (LL c=1 works fine)
Client submission pattern that produces simultaneous-but-not-all-128 bonus-only transitions

Working hypothesis

cudagraph_capture_sizes = [1, 2, 4, 8, 16, 24, ...]. Graphs are captured during warmup using MTP-normal conditions where each request contributes 2 query tokens. The captured size-16 graph expects total_num_scheduled_tokens=32 and the corresponding NCCL collective sizes.

When the scheduler issues a step with 15 reqs all in bonus-only mode, total_num_scheduled_tokens=30 (or 15 effective query tokens). vLLM selects the nearest captured size (16), but the cudagraph's recorded collective ops expect the MTP-normal token count. The actual data doesn't match what the graph's NCCL calls expect → at least one rank waits for bytes that never arrive → _spin_condition.wait timeout.

This is consistent with:

vllm bench --ignore-eos keeping all 128 concurrent reqs in lockstep so all 128 enter bonus-only at the same step → size-128 graph used → either it happens to be captured correctly for that case, or runs eager.
FDO capturing only static-decode shapes (no graph involvement for MTP-spec paths) → falls back to eager → no shape mismatch.
c=1 LL: only one request → only the size-1 cudagraph is involved → no inter-rank collective → no deadlock.

Suggested fixes

Capture additional cudagraph entries for the MTP bonus-only forward shape during warmup. The capture set should cover both "N reqs × 2 tokens" (normal MTP) and "N reqs × 1 token" (bonus-only) for each size.
Fall back to eager when a step's actual shape doesn't match the captured shape at the selected size (instead of replaying a mismatched graph).
Scheduler-level batch shaping: avoid mixing normal-MTP and bonus-only reqs in the same step, OR force bonus-only batches to use a separate cudagraph family.

How to verify a fix

Re-run the SGLang client reproducer above with the patched build. Expect 512/512 success on n=512 and 2048/2048 on n=2048.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: MTP + FULL_AND_PIECEWISE cudagraph deadlocks at HT batched-decode when bonus-token-only forward shape is scheduled

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproducer

Observed log signature

Empirical pattern

Working hypothesis

Suggested fixes

How to verify a fix

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: MTP + FULL_AND_PIECEWISE cudagraph deadlocks at HT batched-decode when bonus-token-only forward shape is scheduled

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproducer

Observed log signature

Empirical pattern

Working hypothesis

Suggested fixes

How to verify a fix

Still need to ship something?

RELATED_DISCOVERY

TRENDING