vllm - ✅(Solved) Fix [Bug]: V1 engine + MTP + GLM-5.1 (DSA + MoE + MLA) — workers hang under sustained traffic, sample_tokens RPC timeout, EngineDeadError [1 pull requests, 1 comments, 1 participants]

vllm2026-04-26 15:17:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40926•Fetched 2026-04-27 05:29:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ccgibson

Participants

ccgibson

Timeline (top)

commented ×1subscribed ×1

Error Message

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

The underlying cause:

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

Workers are stuck upstream of the message queue — no output coming back. Scheduler dump confirms it:

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)

Root Cause

Likely candidates for the actual root cause (per `jsboige`'s diagnosis)

Fix Action

Fix / Workaround

Not OOM: GPU memory peaked at ~133 GB (94% util) sustained without exhausting. expandable_segments:True resolves the torch 2.11 fragmentation regression we initially hit.
Not the descriptor-corruption bug fixed by #40303: no SystemError, memory_fence not on stack.
Not specific to autotune compiling kernels mid-request: disabling autotune extends MTTF 8× but doesn't eliminate the hang.
Not capacity-bound: only 2 in-flight requests when crash occurs, kv_cache_usage 0.40.
Not LMCache-specific: same crash signature on Plan B v3/v4 attempts (cu130-nightly + MTP) before LMCache wheel work.

PR fix notes

PR #40303: [Bug] Fix shm_broadcast PyCFunction descriptor corruption under JIT loads

Repository: vllm-project/vllm
Author: jsboige
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40303

Description (problem / solution / changelog)

Summary

Fixes #35104.

Replaces the with _memory_fence_lock: (threading.Lock) memory barrier in shm_broadcast.memory_fence() with vllm.distributed.utils.sched_yield() — which is already imported in this same file (used by SpinCondition.wait) and provides equivalent memory-barrier guarantees without depending on the CPython class-method descriptor table.

Root cause

Under runtime C-extension loads (FlashInfer JIT autotune, Triton autotune, torch.compile), CPython 3.12's PyCFunction descriptor table can be corrupted for METH_METHOD class-bound descriptors. The next acquire on _thread.lock.__enter__ then crashes with:

SystemError: attempting to create PyCFunction with class but no METH_METHOD flag

This kills the worker, which surfaces as repeated shm_broadcast.py:733 No available shared memory broadcast block found in 60 seconds warnings (typically 3x), then EngineDeadError propagates and tears down the engine.

The exact failing line:

# vllm/distributed/device_communicators/shm_broadcast.py:72 (current main)
with _memory_fence_lock:
    pass

which is invoked from memory_fence() on every shared-memory message exchange.

We observed 9 such crashes in 50h of production traffic on Qwen3.6-35B-A3B-AWQ (v0.19.1.dev45+gf6983f01d) with --tensor-parallel-size 2 --enable-expert-parallel. Setting --no-enable-flashinfer-autotune reduced frequency (49 min uptime vs 25 min) but did not eliminate it — Triton autotune and torch.compile also dlopen .so at runtime.

Why `sched_yield()`

The original implementation relied on threading.Lock purely as a memory barrier (the lock is uncontended; with lock: pass is a hot no-op around the acquire/release). That puts a _thread.lock.__enter__ C-method call on every memory_fence() invocation, which is precisely the METH_METHOD class-bound descriptor type that gets corrupted in #35104.

sched_yield() already exists in vllm/distributed/utils.py:

def sched_yield():
    if USE_SCHED_YIELD:
        os.sched_yield()
    else:
        time.sleep(0)

It's already imported into shm_broadcast.py and used by SpinCondition.wait for the busy-loop. Using it for memory_fence() too:

Provides the same sequentially consistent memory barrier semantics — a kernel scheduling boundary is a full memory barrier on x86-64, ARM64, and POWER (the platforms vLLM cares about).
Same overhead as the original (~20ns; the comment in utils.py measures os.sched_yield at ~3e-7 s).
Avoids the METH_METHOD class-bound descriptor path entirely — os.sched_yield and time.sleep are module-level functions, not bound methods, so they don't have METH_METHOD set and aren't subject to the descriptor table corruption.

_memory_fence_lock is kept as an unused module-level symbol so any external code that touches it doesn't break.

Validation

Built a custom image from nightly v0.19.1.dev45+gf6983f01d with this patch applied and ran it under real production traffic on Qwen3.6-35B-A3B-AWQ:

TP=2 + EP=2, FP8 KV cache, 262K context, AWQ Marlin MoE
655-1854 prompt tok/s, 87% prefix cache hit rate
--no-enable-flashinfer-autotune set defensively (orthogonal to this patch)
--gdn-prefill-backend triton set defensively (orthogonal)

Build	MTBF
`v0.19.1.dev45+gf6983f01d` stock	~5 h (9 crashes / 50 h)
`v0.19.1.dev45+gf6983f01d` + this patch	3 h+ uptime, 0 crashes, watch ongoing

Will update with 24h and 48h soak results in #35104.

Risk

Very low.

The change is isolated to vllm/distributed/device_communicators/shm_broadcast.py (+9 / -11).
Public function signature (memory_fence()) is unchanged.
Memory barrier semantics are equivalent.
Uses an existing helper that's already exercised in the same file.
_memory_fence_lock symbol kept (unused) for backward-compat.

History

The first version of this PR introduced a custom _make_memory_barrier() helper using ctypes to call libc.sched_yield / kernel32.SwitchToThread directly, with a threading.Lock fallback. After @gemini-code-assist caught a deadlock in the fallback (acquire() or release() short-circuits and never releases), I noticed the file already imports the much simpler vllm.distributed.utils.sched_yield() helper, which avoids the entire ctypes complexity. Force-pushed the simplified version.

Test plan

3h+ stability under real production load on Qwen3.6-35B-A3B (TP=2 + EP=2)
24h soak (in progress — update on #35104)
48h soak (in progress — update on #35104)
CI: existing shm_broadcast tests should pass unchanged

cc @kitaekatt @slippersss (per #35104 thread)

Changed files

vllm/distributed/device_communicators/shm_broadcast.py (modified, +9/-11)

Code Example

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

---

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

---

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Under V1 engine + MTP speculative decoding + TP=8 + GLM-5.1-FP8 (GlmMoeDsaForCausalLM), workers hang under sustained production traffic. The scheduler is unable to advance — step_counter=0 in the dump, requests stuck in flight — and after 30s the sample_tokens RPC times out, killing EngineCore. The container's outer process survives so vLLM internally restarts the engine (~12-17 min downtime), but this is a recurring failure pattern.

This is the same bug jsboige diagnosed for new-TonyWang in #35104 — a different bug than what PR #40303 fixes (no SystemError, no memory_fence on the stack). new-TonyWang's report is buried in #35104 comments and has no standalone issue, so filing here.

Stack trace

File ".../vllm/v1/engine/core.py", line 1205, in _process_engine_step
    outputs, model_executed = self.step_fn()
File ".../vllm/v1/engine/core.py", line 523, in step_with_batch_queue
    model_output = future.result()
...
File ".../vllm/v1/executor/multiproc_executor.py", line 388, in get_response
    raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

The underlying cause:

File ".../vllm/distributed/device_communicators/shm_broadcast.py", line 756, in dequeue
    with self.acquire_read(timeout, indefinite) as buf:
File ".../shm_broadcast.py", line 675, in acquire_read
    self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File ".../shm_broadcast.py", line 632, in timeout_ms
    raise TimeoutError

Workers are stuck upstream of the message queue — no output coming back. Scheduler dump confirms it:

SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, step_counter=0,
  kv_cache_usage=0.405, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, ...),
  spec_decoding_stats=None, ...)

Reproduction (deterministic given enough wall clock)

Setup
Image: built from v0.20.0 tag, `--target vllm-openai`, CUDA 13.0, Py 3.12, torch 2.11
Hardware: 8× NVIDIA H200 (140 GB each)
Model: `zai-org/GLM-5.1-FP8` (`GlmMoeDsaForCausalLM` — DeepSeek Sparse Attention + MoE + MLA)
TP=8, EP=true
KV: fp8, prefix-caching on, max-model-len 202752
Speculative: `{"method":"mtp","num_speculative_tokens":2}`
Allocator: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` + `--disable-custom-all-reduce` (required to avoid torch 2.11 fragmentation OOM at boot)
LMCache enabled via `kv_connector=LMCacheConnectorV1`
Workload: real chat-completion traffic, ~0.2 RPS sustained, prompt sizes 1k–70k tokens

MTTF measured

Config	Mean Time To Failure
Baseline (above)	~38 minutes
+ `--no-enable-flashinfer-autotune`	~5h 12min
MTP off (`speculative_config=None`)	stable indefinitely

--no-enable-flashinfer-autotune mitigates but does not fix. Disabling MTP entirely is the only path to stable serving.

What we ruled out

Not OOM: GPU memory peaked at ~133 GB (94% util) sustained without exhausting. expandable_segments:True resolves the torch 2.11 fragmentation regression we initially hit.
Not the descriptor-corruption bug fixed by #40303: no SystemError, memory_fence not on stack.
Not specific to autotune compiling kernels mid-request: disabling autotune extends MTTF 8× but doesn't eliminate the hang.
Not capacity-bound: only 2 in-flight requests when crash occurs, kv_cache_usage 0.40.
Not LMCache-specific: same crash signature on Plan B v3/v4 attempts (cu130-nightly + MTP) before LMCache wheel work.

Likely candidates for the actual root cause (per `jsboige`'s diagnosis)

MTP speculative decoding deadlock in V1 + step_with_batch_queue + the multi-token draft path. scheduled_spec_decode_tokens={req: [-1, -1, ...]} (all rejected on the failing iteration) suggests the spec decode draft itself may be the hang site.
DSA (Deepseek Sparse Attention) interaction with the MTP draft head. GLM-5.1's GlmMoeDsaForCausalLM uses DSA, and the MTP head appears to feed forward through this path.
Some interaction with the V1 step_with_batch_queue async pipeline that's specific to MoE+MLA+MTP.

Suggested debug

Per jsboige's recommended isolation in #35104:

py-spy dump --pid <Worker_TP0 PID> at the moment of hang to identify which subsystem the worker is actually stuck in.
Reduce to TP=4 to see if the hang persists (TP-related?).
Try MTP n=1 instead of n=2 — we observed n=1 still hangs in earlier testing.

We can collect py-spy dump next time it occurs in our deployment if helpful. Auto-rollback fires within 2s of the log signature, so we have a narrow window.

Cross-references

#35104 — jsboige diagnosed new-TonyWang's report as this same bug (different from #35104's primary fix scope).
#34449 — closed; was the original GLM-5 + MTP malformed tool-calls bug, fixed in v0.20.0 (we confirmed: tool tests 16/16 with MTP). This issue is the next MTP blocker after that one.

extent analysis

TL;DR

Disable MTP speculative decoding to prevent workers from hanging under sustained production traffic.

Guidance

Investigate the MTP speculative decoding deadlock in V1 + step_with_batch_queue + the multi-token draft path as the likely root cause.
Try reducing the number of speculative tokens (e.g., num_speculative_tokens=1) to see if the hang persists.
Collect a py-spy dump at the moment of hang to identify which subsystem the worker is stuck in.
Consider reducing TP to 4 to determine if the issue is TP-related.

Example

No code snippet is provided as the issue does not require a code change, but rather a configuration adjustment.

Notes

The provided information suggests that disabling MTP speculative decoding or adjusting its configuration may mitigate the issue. However, the root cause is still under investigation, and further debugging is required to determine the exact solution.

Recommendation

Apply a workaround by disabling MTP speculative decoding or adjusting its configuration, as this has been shown to extend the Mean Time To Failure (MTTF) and may provide a temporary solution until the root cause is determined.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: V1 engine + MTP + GLM-5.1 (DSA + MoE + MLA) — workers hang under sustained traffic, sample_tokens RPC timeout, EngineDeadError [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Likely candidates for the actual root cause (per jsboige's diagnosis)

Fix Action

Fix / Workaround

PR fix notes

PR #40303: [Bug] Fix shm_broadcast PyCFunction descriptor corruption under JIT loads

Description (problem / solution / changelog)

Summary

Root cause

Why sched_yield()

Validation

Risk

History

Test plan

Changed files

Code Example

🐛 Describe the bug

Stack trace

Reproduction (deterministic given enough wall clock)

MTTF measured

What we ruled out

Likely candidates for the actual root cause (per jsboige's diagnosis)

Suggested debug

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Likely candidates for the actual root cause (per `jsboige`'s diagnosis)

Why `sched_yield()`

Likely candidates for the actual root cause (per `jsboige`'s diagnosis)