vllm - ✅(Solved) Fix [Bug]: V1 engine + MTP + GLM-5.1 (DSA + MoE + MLA) — workers hang under sustained traffic, sample_tokens RPC timeout, EngineDeadError [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40926Fetched 2026-04-27 05:29:16
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1subscribed ×1

Error Message

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

The underlying cause:

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

Workers are stuck upstream of the message queue — no output coming back. Scheduler dump confirms it:

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)

Root Cause

Likely candidates for the actual root cause (per jsboige's diagnosis)

Fix Action

Fix / Workaround

  • Not OOM: GPU memory peaked at ~133 GB (94% util) sustained without exhausting. expandable_segments:True resolves the torch 2.11 fragmentation regression we initially hit.
  • Not the descriptor-corruption bug fixed by #40303: no SystemError, memory_fence not on stack.
  • Not specific to autotune compiling kernels mid-request: disabling autotune extends MTTF 8× but doesn't eliminate the hang.
  • Not capacity-bound: only 2 in-flight requests when crash occurs, kv_cache_usage 0.40.
  • Not LMCache-specific: same crash signature on Plan B v3/v4 attempts (cu130-nightly + MTP) before LMCache wheel work.

PR fix notes

PR #40303: [Bug] Fix shm_broadcast PyCFunction descriptor corruption under JIT loads

Description (problem / solution / changelog)

Summary

Fixes #35104.

Replaces the with _memory_fence_lock: (threading.Lock) memory barrier in shm_broadcast.memory_fence() with vllm.distributed.utils.sched_yield() — which is already imported in this same file (used by SpinCondition.wait) and provides equivalent memory-barrier guarantees without depending on the CPython class-method descriptor table.

Root cause

Under runtime C-extension loads (FlashInfer JIT autotune, Triton autotune, torch.compile), CPython 3.12's PyCFunction descriptor table can be corrupted for METH_METHOD class-bound descriptors. The next acquire on _thread.lock.__enter__ then crashes with:

SystemError: attempting to create PyCFunction with class but no METH_METHOD flag

This kills the worker, which surfaces as repeated shm_broadcast.py:733 No available shared memory broadcast block found in 60 seconds warnings (typically 3x), then EngineDeadError propagates and tears down the engine.

The exact failing line:

# vllm/distributed/device_communicators/shm_broadcast.py:72 (current main)
with _memory_fence_lock:
    pass

which is invoked from memory_fence() on every shared-memory message exchange.

We observed 9 such crashes in 50h of production traffic on Qwen3.6-35B-A3B-AWQ (v0.19.1.dev45+gf6983f01d) with --tensor-parallel-size 2 --enable-expert-parallel. Setting --no-enable-flashinfer-autotune reduced frequency (49 min uptime vs 25 min) but did not eliminate it — Triton autotune and torch.compile also dlopen .so at runtime.

Why sched_yield()

The original implementation relied on threading.Lock purely as a memory barrier (the lock is uncontended; with lock: pass is a hot no-op around the acquire/release). That puts a _thread.lock.__enter__ C-method call on every memory_fence() invocation, which is precisely the METH_METHOD class-bound descriptor type that gets corrupted in #35104.

sched_yield() already exists in vllm/distributed/utils.py:

def sched_yield():
    if USE_SCHED_YIELD:
        os.sched_yield()
    else:
        time.sleep(0)

It's already imported into shm_broadcast.py and used by SpinCondition.wait for the busy-loop. Using it for memory_fence() too:

  • Provides the same sequentially consistent memory barrier semantics — a kernel scheduling boundary is a full memory barrier on x86-64, ARM64, and POWER (the platforms vLLM cares about).
  • Same overhead as the original (~20ns; the comment in utils.py measures os.sched_yield at ~3e-7 s).
  • Avoids the METH_METHOD class-bound descriptor path entirely — os.sched_yield and time.sleep are module-level functions, not bound methods, so they don't have METH_METHOD set and aren't subject to the descriptor table corruption.

_memory_fence_lock is kept as an unused module-level symbol so any external code that touches it doesn't break.

Validation

Built a custom image from nightly v0.19.1.dev45+gf6983f01d with this patch applied and ran it under real production traffic on Qwen3.6-35B-A3B-AWQ:

  • TP=2 + EP=2, FP8 KV cache, 262K context, AWQ Marlin MoE
  • 655-1854 prompt tok/s, 87% prefix cache hit rate
  • --no-enable-flashinfer-autotune set defensively (orthogonal to this patch)
  • --gdn-prefill-backend triton set defensively (orthogonal)
BuildMTBF
v0.19.1.dev45+gf6983f01d stock~5 h (9 crashes / 50 h)
v0.19.1.dev45+gf6983f01d + this patch3 h+ uptime, 0 crashes, watch ongoing

Will update with 24h and 48h soak results in #35104.

Risk

Very low.

  • The change is isolated to vllm/distributed/device_communicators/shm_broadcast.py (+9 / -11).
  • Public function signature (memory_fence()) is unchanged.
  • Memory barrier semantics are equivalent.
  • Uses an existing helper that's already exercised in the same file.
  • _memory_fence_lock symbol kept (unused) for backward-compat.

History

The first version of this PR introduced a custom _make_memory_barrier() helper using ctypes to call libc.sched_yield / kernel32.SwitchToThread directly, with a threading.Lock fallback. After @gemini-code-assist caught a deadlock in the fallback (acquire() or release() short-circuits and never releases), I noticed the file already imports the much simpler vllm.distributed.utils.sched_yield() helper, which avoids the entire ctypes complexity. Force-pushed the simplified version.

Test plan

  • 3h+ stability under real production load on Qwen3.6-35B-A3B (TP=2 + EP=2)
  • 24h soak (in progress — update on #35104)
  • 48h soak (in progress — update on #35104)
  • CI: existing shm_broadcast tests should pass unchanged

cc @kitaekatt @slippersss (per #35104 thread)

Changed files

  • vllm/distributed/device_communicators/shm_broadcast.py (modified, +9/-11)

Code Example

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

---

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

---

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Under V1 engine + MTP speculative decoding + TP=8 + GLM-5.1-FP8 (GlmMoeDsaForCausalLM), workers hang under sustained production traffic. The scheduler is unable to advance — step_counter=0 in the dump, requests stuck in flight — and after 30s the sample_tokens RPC times out, killing EngineCore. The container's outer process survives so vLLM internally restarts the engine (~12-17 min downtime), but this is a recurring failure pattern.

This is the same bug jsboige diagnosed for new-TonyWang in #35104 — a different bug than what PR #40303 fixes (no SystemError, no memory_fence on the stack). new-TonyWang's report is buried in #35104 comments and has no standalone issue, so filing here.

Stack trace

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

The underlying cause:

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

Workers are stuck upstream of the message queue — no output coming back. Scheduler dump confirms it:

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)

Reproduction (deterministic given enough wall clock)

Setup
Image: built from v0.20.0 tag, --target vllm-openai, CUDA 13.0, Py 3.12, torch 2.11
Hardware: 8× NVIDIA H200 (140 GB each)
Model: zai-org/GLM-5.1-FP8 (GlmMoeDsaForCausalLM — DeepSeek Sparse Attention + MoE + MLA)
TP=8, EP=true
KV: fp8, prefix-caching on, max-model-len 202752
Speculative: {"method":"mtp","num_speculative_tokens":2}
Allocator: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + --disable-custom-all-reduce (required to avoid torch 2.11 fragmentation OOM at boot)
LMCache enabled via kv_connector=LMCacheConnectorV1
Workload: real chat-completion traffic, ~0.2 RPS sustained, prompt sizes 1k–70k tokens

MTTF measured

ConfigMean Time To Failure
Baseline (above)~38 minutes
+ --no-enable-flashinfer-autotune~5h 12min
MTP off (speculative_config=None)stable indefinitely

--no-enable-flashinfer-autotune mitigates but does not fix. Disabling MTP entirely is the only path to stable serving.

What we ruled out

  • Not OOM: GPU memory peaked at ~133 GB (94% util) sustained without exhausting. expandable_segments:True resolves the torch 2.11 fragmentation regression we initially hit.
  • Not the descriptor-corruption bug fixed by #40303: no SystemError, memory_fence not on stack.
  • Not specific to autotune compiling kernels mid-request: disabling autotune extends MTTF 8× but doesn't eliminate the hang.
  • Not capacity-bound: only 2 in-flight requests when crash occurs, kv_cache_usage 0.40.
  • Not LMCache-specific: same crash signature on Plan B v3/v4 attempts (cu130-nightly + MTP) before LMCache wheel work.

Likely candidates for the actual root cause (per jsboige's diagnosis)

  1. MTP speculative decoding deadlock in V1 + step_with_batch_queue + the multi-token draft path. scheduled_spec_decode_tokens={req: [-1, -1, ...]} (all rejected on the failing iteration) suggests the spec decode draft itself may be the hang site.
  2. DSA (Deepseek Sparse Attention) interaction with the MTP draft head. GLM-5.1's GlmMoeDsaForCausalLM uses DSA, and the MTP head appears to feed forward through this path.
  3. Some interaction with the V1 step_with_batch_queue async pipeline that's specific to MoE+MLA+MTP.

Suggested debug

Per jsboige's recommended isolation in #35104:

  • py-spy dump --pid <Worker_TP0 PID> at the moment of hang to identify which subsystem the worker is actually stuck in.
  • Reduce to TP=4 to see if the hang persists (TP-related?).
  • Try MTP n=1 instead of n=2 — we observed n=1 still hangs in earlier testing.

We can collect py-spy dump next time it occurs in our deployment if helpful. Auto-rollback fires within 2s of the log signature, so we have a narrow window.

Cross-references

  • #35104jsboige diagnosed new-TonyWang's report as this same bug (different from #35104's primary fix scope).
  • #34449 — closed; was the original GLM-5 + MTP malformed tool-calls bug, fixed in v0.20.0 (we confirmed: tool tests 16/16 with MTP). This issue is the next MTP blocker after that one.

extent analysis

TL;DR

Disable MTP speculative decoding to prevent workers from hanging under sustained production traffic.

Guidance

  • Investigate the MTP speculative decoding deadlock in V1 + step_with_batch_queue + the multi-token draft path as the likely root cause.
  • Try reducing the number of speculative tokens (e.g., num_speculative_tokens=1) to see if the hang persists.
  • Collect a py-spy dump at the moment of hang to identify which subsystem the worker is stuck in.
  • Consider reducing TP to 4 to determine if the issue is TP-related.

Example

No code snippet is provided as the issue does not require a code change, but rather a configuration adjustment.

Notes

The provided information suggests that disabling MTP speculative decoding or adjusting its configuration may mitigate the issue. However, the root cause is still under investigation, and further debugging is required to determine the exact solution.

Recommendation

Apply a workaround by disabling MTP speculative decoding or adjusting its configuration, as this has been shown to extend the Mean Time To Failure (MTTF) and may provide a temporary solution until the root cause is determined.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: V1 engine + MTP + GLM-5.1 (DSA + MoE + MLA) — workers hang under sustained traffic, sample_tokens RPC timeout, EngineDeadError [1 pull requests, 1 comments, 1 participants]