vllm - 💡(How to fix) Fix [Bug]: MTP + FULL_AND_PIECEWISE cudagraph deadlocks at HT batched-decode when bonus-token-only forward shape is scheduled

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

With vLLM v0.20.1 + cudagraph_mode=FULL_AND_PIECEWISE + MTP (num_speculative_tokens=1), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in scheduled_spec_decode_tokens=[-1] (MTP bonus-token-only mode).

The engine hangs at shm_broadcast.acquire_read._spin_condition.wait (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), EngineDeadError fires.

The deadlock does not reproduce with cudagraph_mode=FULL_DECODE_ONLY (FDO), nor with single-user (c=1) MTP=2, nor with vllm bench serve as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step).

Error Message

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s <-- STUCK INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. ... (repeats every 60s for ~4 min) ... ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error. Traceback: File ".../v1/executor/multiproc_executor.py", line 386, in get_response status, result = mq.dequeue(timeout=dequeue_timeout) File ".../device_communicators/shm_broadcast.py", line 755, in dequeue with self.acquire_read(timeout, indefinite) as buf: File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

Root Cause

With vLLM v0.20.1 + cudagraph_mode=FULL_AND_PIECEWISE + MTP (num_speculative_tokens=1), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in scheduled_spec_decode_tokens=[-1] (MTP bonus-token-only mode).

The engine hangs at shm_broadcast.acquire_read._spin_condition.wait (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), EngineDeadError fires.

The deadlock does not reproduce with cudagraph_mode=FULL_DECODE_ONLY (FDO), nor with single-user (c=1) MTP=2, nor with vllm bench serve as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step).

Fix Action

Fix / Workaround

Re-run the SGLang client reproducer above with the patched build. Expect 512/512 success on n=512 and 2048/2048 on n=2048.

Code Example

vllm serve <model_path> \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 16384 --max-num-seqs 512 --max-num-batched-tokens 16384 \
    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

---

python3 sglang/python/sglang/bench_serving.py \
    --backend vllm --base-url http://localhost:8000 \
    --model <model_path> --tokenizer <model_path> \
    --dataset-name random \
    --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.0 \
    --num-prompts 512 --max-concurrency 128

---

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s     <-- STUCK
INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
... (repeats every 60s for ~4 min) ...
ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error.
Traceback:
  File ".../v1/executor/multiproc_executor.py", line 386, in get_response
    status, result = mq.dequeue(timeout=dequeue_timeout)
  File ".../device_communicators/shm_broadcast.py", line 755, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
  File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

---

SchedulerOutput(
    num_running_reqs=15,
    num_scheduled_tokens={...: 2, ... (15 entries, all 2)},
    total_num_scheduled_tokens=30,
    scheduled_spec_decode_tokens={
        # all 15 reqs:
        cmpl-...: [-1],
        cmpl-...: [-1],
        ...
    },
    num_output_tokens=[935, 901, 871, 841, 813, 805, 799, 791, 763, 763, 731, 725, 723, 721, 712],
    ...
)
RAW_BUFFERClick to expand / collapse

Summary

With vLLM v0.20.1 + cudagraph_mode=FULL_AND_PIECEWISE + MTP (num_speculative_tokens=1), the EngineCore deterministically deadlocks during batched-decode serving whenever the scheduler issues a step with ~15-18 requests all in scheduled_spec_decode_tokens=[-1] (MTP bonus-token-only mode).

The engine hangs at shm_broadcast.acquire_read._spin_condition.wait (engine waiting for worker reply); the TP workers go silent (no log output, no crash). After the dequeue timeout (~4 min), EngineDeadError fires.

The deadlock does not reproduce with cudagraph_mode=FULL_DECODE_ONLY (FDO), nor with single-user (c=1) MTP=2, nor with vllm bench serve as the client (which keeps all concurrent reqs hitting bonus-only at the same scheduler step).

Environment

  • vLLM: v0.20.1 (vllm/vllm-openai:v0.20.1)
  • Model: DeepSeek-V4-Flash (FP8 + MXFP4 MoE, custom model_type=deepseek_v4)
  • Hardware: 4× NVIDIA H20 (Hopper, 96 GB HBM3), TP=4

Reproducer

Server:

vllm serve <model_path> \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 --block-size 256 \
    --max-model-len 16384 --max-num-seqs 512 --max-num-batched-tokens 16384 \
    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
    --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Client (SGLang bench_serving.py, any recent version):

python3 sglang/python/sglang/bench_serving.py \
    --backend vllm --base-url http://localhost:8000 \
    --model <model_path> --tokenizer <model_path> \
    --dataset-name random \
    --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.0 \
    --num-prompts 512 --max-concurrency 128

vllm bench serve with the same args does NOT reproduce the deadlock.

Observed log signature

INFO Engine 000: ... Running: 87 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 2389 tok/s
INFO Engine 000: ... Running: 15 reqs, Avg generation throughput: 0.0 tok/s     <-- STUCK
INFO (EngineCore) [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
... (repeats every 60s for ~4 min) ...
ERROR (EngineCore) [core.py:1138] EngineCore encountered a fatal error.
Traceback:
  File ".../v1/executor/multiproc_executor.py", line 386, in get_response
    status, result = mq.dequeue(timeout=dequeue_timeout)
  File ".../device_communicators/shm_broadcast.py", line 755, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
  File ".../device_communicators/shm_broadcast.py", line 674, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())

Scheduler dump at the stuck step (from dump_input.py):

SchedulerOutput(
    num_running_reqs=15,
    num_scheduled_tokens={...: 2, ... (15 entries, all 2)},
    total_num_scheduled_tokens=30,
    scheduled_spec_decode_tokens={
        # all 15 reqs:
        cmpl-...: [-1],
        cmpl-...: [-1],
        ...
    },
    num_output_tokens=[935, 901, 871, 841, 813, 805, 799, 791, 763, 763, 731, 725, 723, 721, 712],
    ...
)

Every one of the 15 stuck requests is in MTP bonus-token-only mode (spec_decode_tokens=[-1]). The forward shape is 15 reqs × 1 query token (= 15 total), as opposed to the normal-MTP shape of N reqs × 2 query tokens.

Empirical pattern

Same server config across 5 client runs:

Clientrange_rationResult
vllm benchn/a (fixed 1024)512512/512 ✓
sglang0.0 (random)5120/512 ✗ deadlock at request 0
sglang0.0 (random)512497/512 (97 %) ⚠ tail deadlock, 15 stuck
sglang1.0 (fixed 1024)512497/512 ⚠ same — refutes "varying output_lens triggers"
sglang0.020482030/2048 (99.1 %) ⚠ — 18 stuck, same pattern at scale

Deadlock is independent of:

  • output_lens distribution (random vs fixed)
  • num_prompts (n=512 → 15 stuck, n=2048 → 18 stuck)
  • whether stuck requests are near OSL boundary (n=2048 stuck reqs have num_output_tokens 712-935, well below OSL=1024)

Deadlock is specific to:

  • cudagraph_mode=FULL_AND_PIECEWISE (FDO works)
  • MTP enabled (num_speculative_tokens ≥ 1)
  • Batched decode with >1 req in bonus-only mode at the same step (LL c=1 works fine)
  • Client submission pattern that produces simultaneous-but-not-all-128 bonus-only transitions

Working hypothesis

cudagraph_capture_sizes = [1, 2, 4, 8, 16, 24, ...]. Graphs are captured during warmup using MTP-normal conditions where each request contributes 2 query tokens. The captured size-16 graph expects total_num_scheduled_tokens=32 and the corresponding NCCL collective sizes.

When the scheduler issues a step with 15 reqs all in bonus-only mode, total_num_scheduled_tokens=30 (or 15 effective query tokens). vLLM selects the nearest captured size (16), but the cudagraph's recorded collective ops expect the MTP-normal token count. The actual data doesn't match what the graph's NCCL calls expect → at least one rank waits for bytes that never arrive → _spin_condition.wait timeout.

This is consistent with:

  • vllm bench --ignore-eos keeping all 128 concurrent reqs in lockstep so all 128 enter bonus-only at the same step → size-128 graph used → either it happens to be captured correctly for that case, or runs eager.
  • FDO capturing only static-decode shapes (no graph involvement for MTP-spec paths) → falls back to eager → no shape mismatch.
  • c=1 LL: only one request → only the size-1 cudagraph is involved → no inter-rank collective → no deadlock.

Suggested fixes

  1. Capture additional cudagraph entries for the MTP bonus-only forward shape during warmup. The capture set should cover both "N reqs × 2 tokens" (normal MTP) and "N reqs × 1 token" (bonus-only) for each size.
  2. Fall back to eager when a step's actual shape doesn't match the captured shape at the selected size (instead of replaying a mismatched graph).
  3. Scheduler-level batch shaping: avoid mixing normal-MTP and bonus-only reqs in the same step, OR force bonus-only batches to use a separate cudagraph family.

How to verify a fix

Re-run the SGLang client reproducer above with the patched build. Expect 512/512 success on n=512 and 2048/2048 on n=2048.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING