vllm - 💡(How to fix) Fix [Bug]: EngineCore hangs in `_to_list` -> `cuEventSynchronize` under sustained traffic (Qwen3.6-35B-A3B-NVFP4 hybrid MoE, RTX 5090 sm_120; v0.21.0 + nightly dev39/dev42)

vllm2026-05-17 20:39:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

No exception, traceback, or other error appears anywhere in the log after the last successful step — the log just goes silent.
All other Python threads (NCCL watchdog/heartbeat, gloo runloop, ZMQ background, tqdm monitors, _report_usage_worker, etc.) are idle on futex_do_wait or do_epoll_wait. None are in error states.

Fix Action

Fix / Workaround

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

self.transfer_event.synchronize() is the call that does not return. The function itself is already a workaround introduced by #22754 / PR #22760 to avoid blocking other CUDA streams during the GPU→CPU copy of sampled token ids.

Code Example

Thread <pid> (active): "MainThread"
    cuEventSynchronize                                       (libcuda.so.595.45.04)   ← blocked
    cudaEventSynchronize                                     (libcudart.so.13)
    c10::cuda::impl::CUDAGuardImpl::synchronizeEvent         (libc10_cuda.so)
    THPEvent_synchronize                                     (libtorch_python.so)
    _to_list                                                 (vllm/v1/worker/gpu_model_runner.py:7102)*
    _bookkeeping_sync                                        (vllm/v1/worker/gpu_model_runner.py:3471)*
    sample_tokens                                            (vllm/v1/worker/gpu_model_runner.py:4376)*
    sample_tokens                                            (vllm/v1/worker/gpu_worker.py:780)
    collective_rpc                                           (vllm/v1/executor/uniproc_executor.py:93)
    sample_tokens                                            (vllm/v1/executor/uniproc_executor.py:125)
    step                                                     (vllm/v1/engine/core.py:426)
    _process_engine_step                                     (vllm/v1/engine/core.py:1213)
    run_busy_loop                                            (vllm/v1/engine/core.py:1174)
    run_engine_core                                          (vllm/v1/engine/core.py:1133)

---

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

---

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --host 0.0.0.0 --port 8356 \
  --served-model-name qwen3.6-35b-a3b qwen3.6-35b-a3b-nothinker \
  --quantization compressed-tensors \
  --moe-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.847 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

RAW_BUFFERClick to expand / collapse

Symptom

After some hours of sustained chat-completion traffic, the engine stops producing tokens. The HTTP layer continues returning 200 OK on /health and on POST /v1/chat/completions, so the API server remains alive and keeps queueing new requests. Inside the engine, however:

No tokens are emitted for any in-flight or new request.
The periodic stats line (Avg prompt throughput: ... tokens/s) that vLLM normally logs every ~10 s stops appearing in the journal.
The last stats line before the silence consistently shows Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s with Running: N reqs for some N>0 — i.e. the engine reported zero throughput while still having sequences in flight, then stopped logging entirely.
No exception, traceback, or other error appears anywhere in the log after the last successful step — the log just goes silent.

The engine never recovers; only a restart clears the state.

Where it is stuck

py-spy dump --pid <EngineCore-pid> --native consistently shows the engine MainThread blocked inside the CUDA driver in cuEventSynchronize:

Thread <pid> (active): "MainThread"
    cuEventSynchronize                                       (libcuda.so.595.45.04)   ← blocked
    cudaEventSynchronize                                     (libcudart.so.13)
    c10::cuda::impl::CUDAGuardImpl::synchronizeEvent         (libc10_cuda.so)
    THPEvent_synchronize                                     (libtorch_python.so)
    _to_list                                                 (vllm/v1/worker/gpu_model_runner.py:7102)*
    _bookkeeping_sync                                        (vllm/v1/worker/gpu_model_runner.py:3471)*
    sample_tokens                                            (vllm/v1/worker/gpu_model_runner.py:4376)*
    sample_tokens                                            (vllm/v1/worker/gpu_worker.py:780)
    collective_rpc                                           (vllm/v1/executor/uniproc_executor.py:93)
    sample_tokens                                            (vllm/v1/executor/uniproc_executor.py:125)
    step                                                     (vllm/v1/engine/core.py:426)
    _process_engine_step                                     (vllm/v1/engine/core.py:1213)
    run_busy_loop                                            (vllm/v1/engine/core.py:1174)
    run_engine_core                                          (vllm/v1/engine/core.py:1133)

(* line numbers from 0.21.1rc1.dev39; the same call sites are at :7265 / :3529 / :4436 on 0.21.1rc1.dev42.)

The code at _to_list (gpu_model_runner.py) is:

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
    # This is a short term mitigation for issue mentioned in
    # https://github.com/vllm-project/vllm/issues/22754.
    # `tolist` would trigger a cuda wise stream sync, which
    # would block other copy ops from other cuda streams.
    # A cuda event sync would avoid such a situation. ...
    pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
    pinned.copy_(sampled_token_ids, non_blocking=True)
    self.transfer_event.record()
    self.transfer_event.synchronize()      # ← never returns
    return pinned.tolist()

The hang is steady-state: four py-spy dump --native snapshots spaced ~90 s apart, spanning 5 min 38 s of wall time on one captured incident, all show the identical stack. The engine does not progress at all in that interval.

State at the time of hang

nvidia-smi --query-gpu=utilization.gpu,utilization.memory,power.draw --format=csv,noheader → 100 %, 2 %, 142.45 W on an RTX 5090 (TDP 600 W). The card reports fully utilised compute while drawing roughly a quarter of its capability — consistent with a single stream parked on an event sync rather than doing real work.
nvidia-smi --query-compute-apps=pid,used_memory --format=csv → VLLM::EngineCore holding the expected steady-state VRAM (~26 GiB for this config).
/proc/<pid>/status reports State: R (running) with wchan: 0 — the main thread is busy in userspace (inside libcuda), not blocked in a kernel syscall.
/metrics showed vllm:num_requests_running = 5 (or 6, depending on the incident), vllm:num_requests_waiting = 0, vllm:num_preemptions_total = 0. The engine had in-flight sequences and was not preempting anything.
gdb -p <pid> agrees with py-spy: thread 1 (the MainThread) is inside libcuda.so.1, with frames in ?? () (no exported symbols in libcuda).
All other Python threads (NCCL watchdog/heartbeat, gloo runloop, ZMQ background, tqdm monitors, _report_usage_worker, etc.) are idle on futex_do_wait or do_epoll_wait. None are in error states.

Versions where this has been observed

The same stack and the same external symptoms occurred on all three vLLM builds I have tested with this model on this hardware:

vLLM version	Build kind	Affected
`0.21.0` (commit `ad7125a`)	stable wheel	yes
`0.21.1rc1.dev39+g0fa888465`	nightly	yes
`0.21.1rc1.dev42+g966903eb9`	nightly	yes

The most recent capture (dev42) was triggered automatically by a watchdog that monitors the stats-line heartbeat in the journal; the dump was taken 83 s after the last Avg prompt throughput line was emitted, and shows the identical _to_list → cuEventSynchronize stack.

Reproduction

Hardware: NVIDIA RTX 5090 (Blackwell, sm_120), 32 GiB VRAM, driver 595.45.04. Single-GPU host (CUDA_VISIBLE_DEVICES=0).

Software: CUDA 13.2 on the host, torch CUDA build cu130. PyTorch 2.11.0+cu130, FlashInfer 0.6.8.post1 (on 0.21.0) / 0.6.11.post2 (on the nightly). Python 3.13.13.

Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 — 40-layer hybrid MoE: 10 full-attention + 30 GDN linear-attention layers, NVFP4 (compressed-tensors) weights, bundled vision encoder.

Launch command (the one that produced the captured hangs; I have not bisected which flags are necessary):

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --host 0.0.0.0 --port 8356 \
  --served-model-name qwen3.6-35b-a3b qwen3.6-35b-a3b-nothinker \
  --quantization compressed-tensors \
  --moe-backend flashinfer_cutlass \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 196608 \
  --gpu-memory-utilization 0.847 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

The hangs appear after extended periods (hours) of multi-turn chat traffic with concurrency > 1. Single-request smoke tests against a freshly started engine do not reproduce. I have not narrowed which specific request or batch state triggers the hang.

I have also tested with --no-async-scheduling added to the command above; the hang still occurs and still terminates at the same _to_list → cuEventSynchronize stack (synchronous-scheduling code path, step() rather than step_with_batch_queue). Disabling async scheduling is not sufficient to prevent the hang on this stack.

What I have not verified

I have not bisected which flags are needed. The launch command above is what reproduces; I have not removed flags one at a time to narrow it down.
I have not reproduced on hardware other than RTX 5090 (sm_120). I do not know whether this is Blackwell-specific.
I have not captured a CUDA-side trace (nsys profile, cuda-gdb thread apply all bt) of the stuck driver call — only the Python+native userspace stack via py-spy and a high-level gdb thread view.
I do not have a small repro outside this serving config — every observation is from a real workload.

I have py-spy native dumps, gdb thread snapshots, /metrics dumps, and journals from several incidents (including one captured automatically by a watchdog on dev42 at 2026-05-18 00:30:07 +03:00). Happy to upload any of them or run additional captures (nsys, longer spans, alternate flag combinations) if it would help diagnosis — please say which.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: EngineCore hangs in `_to_list` -> `cuEventSynchronize` under sustained traffic (Qwen3.6-35B-A3B-NVFP4 hybrid MoE, RTX 5090 sm_120; v0.21.0 + nightly dev39/dev42)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Symptom

Where it is stuck

State at the time of hang

Versions where this has been observed

Reproduction

What I have not verified

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: EngineCore hangs in `_to_list` -> `cuEventSynchronize` under sustained traffic (Qwen3.6-35B-A3B-NVFP4 hybrid MoE, RTX 5090 sm_120; v0.21.0 + nightly dev39/dev42)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Symptom

Where it is stuck

State at the time of hang

Versions where this has been observed

Reproduction

What I have not verified

Still need to ship something?

RELATED_DISCOVERY

TRENDING