vllm - 💡(How to fix) Fix [Bug]: Qwen3.6-27B-FP8 on GB10: get_output() spin-loop on stuck CUDA stream during prefill burst (run<6, KV<5%)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm-qwen36 containers running Qwen3.6-27B-FP8 on NVIDIA GB10 (DGX Spark) hosts occasionally enter a stuck CUDA stream state under sustained inference load: the EngineCore process spin-loops at 99% CPU in vllm/v1/worker/gpu_model_runner.py::get_output() waiting for a GPU forward-pass result that never arrives. /v1/models keeps answering (parent API process) but /v1/chat/completions hangs. nvidia-smi reports GPU-Util pinned at 96% with abnormally low power draw (19 W vs 36 W on a healthy peer at the same utilization) — i.e. utilization counter pinned on a stuck kernel doing no real work.

We have mitigations in place that keep this from being user-visible (wedge-recovery watchdog auto-restarts in ~5 min, sub-cap concurrency limit --max-num-seqs 6 reduces incidence). This issue is to track the upstream cause so we can either fix it permanently or quantify the residual frequency over a longer window.

Root Cause

The 09:51 incident is the most reproducible signal because the cap was already in effect — wedge fired at running BELOW the cap, ruling out simple concurrency saturation.

Fix Action

Fix / Workaround

We have mitigations in place that keep this from being user-visible (wedge-recovery watchdog auto-restarts in ~5 min, sub-cap concurrency limit --max-num-seqs 6 reduces incidence). This issue is to track the upstream cause so we can either fix it permanently or quantify the residual frequency over a longer window.

Mitigations in place (so this isn't user-visible)

Code Example

--max-model-len 262144 --gpu-memory-utilization 0.60 --port 8000
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml
  --max-log-len 200 --max-num-seqs 6

---

09:46:04  prompt 42  tok/s  gen 35  tok/s  run=4  KV=5.3%   ← healthy
09:46:14  prompt 77  tok/s  gen 30  tok/s  run=4  KV=4.1%
09:46:24  prompt 39  tok/s  gen 33  tok/s  run=4  KV=2.1%
09:46:34  prompt 89  tok/s  gen 28  tok/s  run=3  KV=1.6%
09:46:44  prompt 37  tok/s  gen 20  tok/s  run=2  KV=1.1%
09:46:54  prompt 222 tok/s  gen 19  tok/s  run=4  KV=2.2%   ← prefill burst
09:47:04  prompt 107 tok/s  gen 29  tok/s  run=4  KV=2.4%
09:47:14  prompt 31  tok/s  gen 31  tok/s  run=4  KV=2.5%
09:47:24  prompt 147 tok/s  gen 28  tok/s  run=5  KV=3.2%   ← prefill burst, run→5
09:47:34  prompt 0   tok/s  gen 0   tok/s  run=5  KV=3.2%   ⚠️ WEDGED (zero throughput)

---

Thread 77 (active): "MainThread"
    get_output             (vllm/v1/worker/gpu_model_runner.py:274)
    result                 (vllm/v1/executor/uniproc_executor.py:38)
    step_with_batch_queue  (vllm/v1/engine/core.py:525)
    _process_engine_step   (vllm/v1/engine/core.py:1213)
    run_busy_loop          (vllm/v1/engine/core.py:1174)
    run_engine_core        (vllm/v1/engine/core.py:1133)
RAW_BUFFERClick to expand / collapse

Your current environment

Environment

  • Hardware: NVIDIA GB10 Spark (4 hosts, identical config)
  • Kernel: 6.17.0-1018-nvidia (Ubuntu nvidia driver kernel)
  • NVIDIA driver: 580.159.03, CUDA 13.0
  • Container image: vllm/vllm-openai@sha256:f023269abe06db3a1a7cd9e170a0f5bd2b333a19ef9cb99ed8df97a70345bc25 (vLLM v0.21.0)
  • Model: Qwen/Qwen3.6-27B-FP8 (dense, Gated-DeltaNet hybrid attention, 262K native context, FP8 weights)
  • vLLM args:
    --max-model-len 262144 --gpu-memory-utilization 0.60 --port 8000
    --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml
    --max-log-len 200 --max-num-seqs 6

🐛 Describe the bug

Summary

vllm-qwen36 containers running Qwen3.6-27B-FP8 on NVIDIA GB10 (DGX Spark) hosts occasionally enter a stuck CUDA stream state under sustained inference load: the EngineCore process spin-loops at 99% CPU in vllm/v1/worker/gpu_model_runner.py::get_output() waiting for a GPU forward-pass result that never arrives. /v1/models keeps answering (parent API process) but /v1/chat/completions hangs. nvidia-smi reports GPU-Util pinned at 96% with abnormally low power draw (19 W vs 36 W on a healthy peer at the same utilization) — i.e. utilization counter pinned on a stuck kernel doing no real work.

We have mitigations in place that keep this from being user-visible (wedge-recovery watchdog auto-restarts in ~5 min, sub-cap concurrency limit --max-num-seqs 6 reduces incidence). This issue is to track the upstream cause so we can either fix it permanently or quantify the residual frequency over a longer window.

Observed wedge thresholds (running count at moment of wedge)

Date (UTC)NoderunningKV usageTrigger pattern
2026-05-28 03:22gb10-212unknownSustained load after pool unification
2026-05-28 03:56gb10-287.67%~15 min after first restart, no cap
2026-05-28 04:35gb10-2 (llm1)1918.66%Recovery-burst from research-engine retry queue
2026-05-28 09:51gb10-253.2%Below the --max-num-seqs 6 cap. Prompt-burst pattern.

The 09:51 incident is the most reproducible signal because the cap was already in effect — wedge fired at running BELOW the cap, ruling out simple concurrency saturation.

Smoking-gun vLLM engine log (09:51 UTC, llm2)

09:46:04  prompt 42  tok/s  gen 35  tok/s  run=4  KV=5.3%   ← healthy
09:46:14  prompt 77  tok/s  gen 30  tok/s  run=4  KV=4.1%
09:46:24  prompt 39  tok/s  gen 33  tok/s  run=4  KV=2.1%
09:46:34  prompt 89  tok/s  gen 28  tok/s  run=3  KV=1.6%
09:46:44  prompt 37  tok/s  gen 20  tok/s  run=2  KV=1.1%
09:46:54  prompt 222 tok/s  gen 19  tok/s  run=4  KV=2.2%   ← prefill burst
09:47:04  prompt 107 tok/s  gen 29  tok/s  run=4  KV=2.4%
09:47:14  prompt 31  tok/s  gen 31  tok/s  run=4  KV=2.5%
09:47:24  prompt 147 tok/s  gen 28  tok/s  run=5  KV=3.2%   ← prefill burst, run→5
09:47:34  prompt 0   tok/s  gen 0   tok/s  run=5  KV=3.2%   ⚠️ WEDGED (zero throughput)

Hypothesis: prefill burst at run=5 (one below the cap) hit a CUDA-stream / hybrid-attention edge case in the Qwen3.6 forward pass on GB10 silicon. Once the stream stalls, subsequent decode steps queue behind it and never advance.

py-spy capture of the wedged state (from a prior incident, same signature)

Thread 77 (active): "MainThread"
    get_output             (vllm/v1/worker/gpu_model_runner.py:274)
    result                 (vllm/v1/executor/uniproc_executor.py:38)
    step_with_batch_queue  (vllm/v1/engine/core.py:525)
    _process_engine_step   (vllm/v1/engine/core.py:1213)
    run_busy_loop          (vllm/v1/engine/core.py:1174)
    run_engine_core        (vllm/v1/engine/core.py:1133)

The engine has submitted work to the GPU executor and is busy-waiting on result(). The GPU never returns.

nvidia-smi signature when wedged

MetricHealthy peer (llm1)Wedged node (llm2)
GPU-Util96%96% (pinned)
Power36 W19 W ⚠️
Temp67°C55°C ⚠️
VRAM67 GiB67 GiB
Generating tokens?yes, 30 tok/sno, 0 tok/s

GPU-Util at 96% with only 19 W power draw is incompatible with real compute — the utilization counter is pinned on a stuck kernel.

Reproduction conditions (best guess)

  • Sustained workload of 3–5 concurrent generation requests with intermittent prompt-prefill bursts (100+ tok/s prefill while requests are running)
  • Long-context model (Qwen3.6-27B is 262K context, prefill batches can be tens of thousands of tokens)
  • Wedge frequency: rough estimate ~1 per 6–24 hours of active research-engine batch traffic per node (n=4 events in ~24 hours observed, all on the research pool)
  • No specific request seems to consistently trigger it — the wedge is a tail event in the load distribution, not a poison prompt

Mitigations in place (so this isn't user-visible)

  1. --max-num-seqs 6 cap per node — reduces frequency (eliminates the 8/12/19-running wedges seen pre-cap) but does NOT eliminate it (09:51 wedge fired at run=5)
  2. Auto-restart watchdog (scripts/ops/watchdog/vllm_wedge_recovery.py) — detects wedge via (running > 0) AND (generation_tokens flat) AND (canary fails 3× over 10s) then docker restart vllm-qwen36 + Discord alert. ~5 min full recovery cycle. Cooldown + daily-cap guardrails prevent thrashing
  3. Pool-budget semaphore on caller side keeps in-flight count bounded so a wedged node doesn't cascade backpressure

What we want from upstream

  • vLLM: can get_output() be made cancellable/timeoutable, so a stuck stream surfaces as a request failure instead of an infinite engine spin? Today it requires an external watchdog to recover. Even a coarse "if no progress in N seconds, raise and let the API server propagate 500" would let clients fail-fast and reduce blast radius

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING