vllm - 💡(How to fix) Fix [Bug]: EngineCore hangs in `_to_list` -> `cuEventSynchronize` under sustained traffic (Qwen3.6-35B-A3B-NVFP4 hybrid MoE, RTX 5090 sm_120; v0.21.0 + nightly dev39/dev42)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • No exception, traceback, or other error appears anywhere in the log after the last successful step — the log just goes silent.
  • All other Python threads (NCCL watchdog/heartbeat, gloo runloop, ZMQ background, tqdm monitors, _report_usage_worker, etc.) are idle on futex_do_wait or do_epoll_wait. None are in error states.

Fix Action

Fix / Workaround

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

self.transfer_event.synchronize() is the call that does not return. The function itself is already a workaround introduced by #22754 / PR #22760 to avoid blocking other CUDA streams during the GPU→CPU copy of sampled token ids.

Code Example

Thread <pid> (active): "MainThread"
    cuEventSynchronize                                       (libcuda.so.595.45.04)   ← blocked
    cudaEventSynchronize                                     (libcudart.so.13)
    c10::cuda::impl::CUDAGuardImpl::synchronizeEvent         (libc10_cuda.so)
    THPEvent_synchronize                                     (libtorch_python.so)
    _to_list                                                 (vllm/v1/worker/gpu_model_runner.py:7102)*
    _bookkeeping_sync                                        (vllm/v1/worker/gpu_model_runner.py:3471)*
    sample_tokens                                            (vllm/v1/worker/gpu_model_runner.py:4376)*
    sample_tokens                                            (vllm/v1/worker/gpu_worker.py:780)
    collective_rpc                                           (vllm/v1/executor/uniproc_executor.py:93)
    sample_tokens                                            (vllm/v1/executor/uniproc_executor.py:125)
    step                                                     (vllm/v1/engine/core.py:426)
    _process_engine_step                                     (vllm/v1/engine/core.py:1213)
    run_busy_loop                                            (vllm/v1/engine/core.py:1174)
    run_engine_core                                          (vllm/v1/engine/core.py:1133)

---

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

---

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --host 0.0.0.0 --port 8356 \
  --served-model-name qwen3.6-35b-a3b qwen3.6-35b-a3b-nothinker \
  --quantization compressed-tensors \
  --moe-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.847 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --trust-remote-code
RAW_BUFFERClick to expand / collapse

Symptom

After some hours of sustained chat-completion traffic, the engine stops producing tokens. The HTTP layer continues returning 200 OK on /health and on POST /v1/chat/completions, so the API server remains alive and keeps queueing new requests. Inside the engine, however:

  • No tokens are emitted for any in-flight or new request.
  • The periodic stats line (Avg prompt throughput: ... tokens/s) that vLLM normally logs every ~10 s stops appearing in the journal.
  • The last stats line before the silence consistently shows Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s with Running: N reqs for some N>0 — i.e. the engine reported zero throughput while still having sequences in flight, then stopped logging entirely.
  • No exception, traceback, or other error appears anywhere in the log after the last successful step — the log just goes silent.

The engine never recovers; only a restart clears the state.

Where it is stuck

py-spy dump --pid <EngineCore-pid> --native consistently shows the engine MainThread blocked inside the CUDA driver in cuEventSynchronize:

Thread <pid> (active): "MainThread"
    cuEventSynchronize                                       (libcuda.so.595.45.04)   ← blocked
    cudaEventSynchronize                                     (libcudart.so.13)
    c10::cuda::impl::CUDAGuardImpl::synchronizeEvent         (libc10_cuda.so)
    THPEvent_synchronize                                     (libtorch_python.so)
    _to_list                                                 (vllm/v1/worker/gpu_model_runner.py:7102)*
    _bookkeeping_sync                                        (vllm/v1/worker/gpu_model_runner.py:3471)*
    sample_tokens                                            (vllm/v1/worker/gpu_model_runner.py:4376)*
    sample_tokens                                            (vllm/v1/worker/gpu_worker.py:780)
    collective_rpc                                           (vllm/v1/executor/uniproc_executor.py:93)
    sample_tokens                                            (vllm/v1/executor/uniproc_executor.py:125)
    step                                                     (vllm/v1/engine/core.py:426)
    _process_engine_step                                     (vllm/v1/engine/core.py:1213)
    run_busy_loop                                            (vllm/v1/engine/core.py:1174)
    run_engine_core                                          (vllm/v1/engine/core.py:1133)

(* line numbers from 0.21.1rc1.dev39; the same call sites are at :7265 / :3529 / :4436 on 0.21.1rc1.dev42.)

The code at _to_list (gpu_model_runner.py) is:

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

self.transfer_event.synchronize() is the call that does not return. The function itself is already a workaround introduced by #22754 / PR #22760 to avoid blocking other CUDA streams during the GPU→CPU copy of sampled token ids.

The hang is steady-state: four py-spy dump --native snapshots spaced ~90 s apart, spanning 5 min 38 s of wall time on one captured incident, all show the identical stack. The engine does not progress at all in that interval.

State at the time of hang

  • nvidia-smi --query-gpu=utilization.gpu,utilization.memory,power.draw --format=csv,noheader100 %, 2 %, 142.45 W on an RTX 5090 (TDP 600 W). The card reports fully utilised compute while drawing roughly a quarter of its capability — consistent with a single stream parked on an event sync rather than doing real work.
  • nvidia-smi --query-compute-apps=pid,used_memory --format=csvVLLM::EngineCore holding the expected steady-state VRAM (~26 GiB for this config).
  • /proc/<pid>/status reports State: R (running) with wchan: 0 — the main thread is busy in userspace (inside libcuda), not blocked in a kernel syscall.
  • /metrics showed vllm:num_requests_running = 5 (or 6, depending on the incident), vllm:num_requests_waiting = 0, vllm:num_preemptions_total = 0. The engine had in-flight sequences and was not preempting anything.
  • gdb -p <pid> agrees with py-spy: thread 1 (the MainThread) is inside libcuda.so.1, with frames in ?? () (no exported symbols in libcuda).
  • All other Python threads (NCCL watchdog/heartbeat, gloo runloop, ZMQ background, tqdm monitors, _report_usage_worker, etc.) are idle on futex_do_wait or do_epoll_wait. None are in error states.

Versions where this has been observed

The same stack and the same external symptoms occurred on all three vLLM builds I have tested with this model on this hardware:

vLLM versionBuild kindAffected
0.21.0 (commit ad7125a)stable wheelyes
0.21.1rc1.dev39+g0fa888465nightlyyes
0.21.1rc1.dev42+g966903eb9nightlyyes

The most recent capture (dev42) was triggered automatically by a watchdog that monitors the stats-line heartbeat in the journal; the dump was taken 83 s after the last Avg prompt throughput line was emitted, and shows the identical _to_list → cuEventSynchronize stack.

Reproduction

Hardware: NVIDIA RTX 5090 (Blackwell, sm_120), 32 GiB VRAM, driver 595.45.04. Single-GPU host (CUDA_VISIBLE_DEVICES=0).

Software: CUDA 13.2 on the host, torch CUDA build cu130. PyTorch 2.11.0+cu130, FlashInfer 0.6.8.post1 (on 0.21.0) / 0.6.11.post2 (on the nightly). Python 3.13.13.

Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 — 40-layer hybrid MoE: 10 full-attention + 30 GDN linear-attention layers, NVFP4 (compressed-tensors) weights, bundled vision encoder.

Launch command (the one that produced the captured hangs; I have not bisected which flags are necessary):

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --host 0.0.0.0 --port 8356 \
  --served-model-name qwen3.6-35b-a3b qwen3.6-35b-a3b-nothinker \
  --quantization compressed-tensors \
  --moe-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.847 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

The hangs appear after extended periods (hours) of multi-turn chat traffic with concurrency > 1. Single-request smoke tests against a freshly started engine do not reproduce. I have not narrowed which specific request or batch state triggers the hang.

I have also tested with --no-async-scheduling added to the command above; the hang still occurs and still terminates at the same _to_list → cuEventSynchronize stack (synchronous-scheduling code path, step() rather than step_with_batch_queue). Disabling async scheduling is not sufficient to prevent the hang on this stack.

What I have not verified

  • I have not bisected which flags are needed. The launch command above is what reproduces; I have not removed flags one at a time to narrow it down.
  • I have not reproduced on hardware other than RTX 5090 (sm_120). I do not know whether this is Blackwell-specific.
  • I have not captured a CUDA-side trace (nsys profile, cuda-gdb thread apply all bt) of the stuck driver call — only the Python+native userspace stack via py-spy and a high-level gdb thread view.
  • I do not have a small repro outside this serving config — every observation is from a real workload.

I have py-spy native dumps, gdb thread snapshots, /metrics dumps, and journals from several incidents (including one captured automatically by a watchdog on dev42 at 2026-05-18 00:30:07 +03:00). Happy to upload any of them or run additional captures (nsys, longer spans, alternate flag combinations) if it would help diagnosis — please say which.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: EngineCore hangs in `_to_list` -> `cuEventSynchronize` under sustained traffic (Qwen3.6-35B-A3B-NVFP4 hybrid MoE, RTX 5090 sm_120; v0.21.0 + nightly dev39/dev42)