vllm - 💡(How to fix) Fix [Bug]: Qwen3.6-27B-FP8 on GB10: get_output() spin-loop on stuck CUDA stream during prefill burst (run<6, KV<5%)

vllm-qwen36 containers running Qwen3.6-27B-FP8 on NVIDIA GB10 (DGX Spark) hosts occasionally enter a stuck CUDA stream state under sustained inference load: the EngineCore process spin-loops at 99% CPU in vllm/v1/worker/gpu_model_runner.py::get_output() waiting for a GPU forward-pass result that never arrives. /v1/models keeps answering (parent API process) but /v1/chat/completions hangs. nvidia-smi reports GPU-Util pinned at 96% with abnormally low power draw (19 W vs 36 W on a healthy peer at the same utilization) — i.e. utilization counter pinned on a stuck kernel doing no real work.

We have mitigations in place that keep this from being user-visible (wedge-recovery watchdog auto-restarts in ~5 min, sub-cap concurrency limit --max-num-seqs 6 reduces incidence). This issue is to track the upstream cause so we can either fix it permanently or quantify the residual frequency over a longer window.

Fix Action

Fix / Workaround

Mitigations in place (so this isn't user-visible)

Code Example

--max-model-len 262144 --gpu-memory-utilization 0.60 --port 8000
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml
  --max-log-len 200 --max-num-seqs 6

---

09:46:04  prompt 42  tok/s  gen 35  tok/s  run=4  KV=5.3%   ← healthy
09:46:14  prompt 77  tok/s  gen 30  tok/s  run=4  KV=4.1%
09:46:24  prompt 39  tok/s  gen 33  tok/s  run=4  KV=2.1%
09:46:34  prompt 89  tok/s  gen 28  tok/s  run=3  KV=1.6%
09:46:44  prompt 37  tok/s  gen 20  tok/s  run=2  KV=1.1%
09:46:54  prompt 222 tok/s  gen 19  tok/s  run=4  KV=2.2%   ← prefill burst
09:47:04  prompt 107 tok/s  gen 29  tok/s  run=4  KV=2.4%
09:47:14  prompt 31  tok/s  gen 31  tok/s  run=4  KV=2.5%
09:47:24  prompt 147 tok/s  gen 28  tok/s  run=5  KV=3.2%   ← prefill burst, run→5
09:47:34  prompt 0   tok/s  gen 0   tok/s  run=5  KV=3.2%   ⚠️ WEDGED (zero throughput)

---

Thread 77 (active): "MainThread"
    get_output             (vllm/v1/worker/gpu_model_runner.py:274)
    result                 (vllm/v1/executor/uniproc_executor.py:38)
    step_with_batch_queue  (vllm/v1/engine/core.py:525)
    _process_engine_step   (vllm/v1/engine/core.py:1213)
    run_busy_loop          (vllm/v1/engine/core.py:1174)
    run_engine_core        (vllm/v1/engine/core.py:1133)

Your current environment

Environment

Hardware: NVIDIA GB10 Spark (4 hosts, identical config)
Kernel: 6.17.0-1018-nvidia (Ubuntu nvidia driver kernel)
NVIDIA driver: 580.159.03, CUDA 13.0
Container image: vllm/vllm-openai@sha256:f023269abe06db3a1a7cd9e170a0f5bd2b333a19ef9cb99ed8df97a70345bc25 (vLLM v0.21.0)
Model: Qwen/Qwen3.6-27B-FP8 (dense, Gated-DeltaNet hybrid attention, 262K native context, FP8 weights)

vLLM args:

--max-model-len 262144 --gpu-memory-utilization 0.60 --port 8000
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml
--max-log-len 200 --max-num-seqs 6

🐛 Describe the bug

Summary

Observed wedge thresholds (running count at moment of wedge)

Date (UTC)	Node	running	KV usage	Trigger pattern
2026-05-28 03:22	gb10-2	12	unknown	Sustained load after pool unification
2026-05-28 03:56	gb10-2	8	7.67%	~15 min after first restart, no cap
2026-05-28 04:35	gb10-2 (llm1)	19	18.66%	Recovery-burst from research-engine retry queue
2026-05-28 09:51	gb10-2	5	3.2%	Below the `--max-num-seqs 6` cap. Prompt-burst pattern.

The 09:51 incident is the most reproducible signal because the cap was already in effect — wedge fired at running BELOW the cap, ruling out simple concurrency saturation.

Smoking-gun vLLM engine log (09:51 UTC, llm2)

09:46:04  prompt 42  tok/s  gen 35  tok/s  run=4  KV=5.3%   ← healthy
09:46:14  prompt 77  tok/s  gen 30  tok/s  run=4  KV=4.1%
09:46:24  prompt 39  tok/s  gen 33  tok/s  run=4  KV=2.1%
09:46:34  prompt 89  tok/s  gen 28  tok/s  run=3  KV=1.6%
09:46:44  prompt 37  tok/s  gen 20  tok/s  run=2  KV=1.1%
09:46:54  prompt 222 tok/s  gen 19  tok/s  run=4  KV=2.2%   ← prefill burst
09:47:04  prompt 107 tok/s  gen 29  tok/s  run=4  KV=2.4%
09:47:14  prompt 31  tok/s  gen 31  tok/s  run=4  KV=2.5%
09:47:24  prompt 147 tok/s  gen 28  tok/s  run=5  KV=3.2%   ← prefill burst, run→5
09:47:34  prompt 0   tok/s  gen 0   tok/s  run=5  KV=3.2%   ⚠️ WEDGED (zero throughput)

Hypothesis: prefill burst at run=5 (one below the cap) hit a CUDA-stream / hybrid-attention edge case in the Qwen3.6 forward pass on GB10 silicon. Once the stream stalls, subsequent decode steps queue behind it and never advance.

py-spy capture of the wedged state (from a prior incident, same signature)

Thread 77 (active): "MainThread"
    get_output             (vllm/v1/worker/gpu_model_runner.py:274)
    result                 (vllm/v1/executor/uniproc_executor.py:38)
    step_with_batch_queue  (vllm/v1/engine/core.py:525)
    _process_engine_step   (vllm/v1/engine/core.py:1213)
    run_busy_loop          (vllm/v1/engine/core.py:1174)
    run_engine_core        (vllm/v1/engine/core.py:1133)

The engine has submitted work to the GPU executor and is busy-waiting on result(). The GPU never returns.

`nvidia-smi` signature when wedged

Metric	Healthy peer (llm1)	Wedged node (llm2)
GPU-Util	96%	96% (pinned)
Power	36 W	19 W ⚠️
Temp	67°C	55°C ⚠️
VRAM	67 GiB	67 GiB
Generating tokens?	yes, 30 tok/s	no, 0 tok/s

GPU-Util at 96% with only 19 W power draw is incompatible with real compute — the utilization counter is pinned on a stuck kernel.

Reproduction conditions (best guess)

Sustained workload of 3–5 concurrent generation requests with intermittent prompt-prefill bursts (100+ tok/s prefill while requests are running)
Long-context model (Qwen3.6-27B is 262K context, prefill batches can be tens of thousands of tokens)
Wedge frequency: rough estimate ~1 per 6–24 hours of active research-engine batch traffic per node (n=4 events in ~24 hours observed, all on the research pool)
No specific request seems to consistently trigger it — the wedge is a tail event in the load distribution, not a poison prompt

Mitigations in place (so this isn't user-visible)

--max-num-seqs 6 cap per node — reduces frequency (eliminates the 8/12/19-running wedges seen pre-cap) but does NOT eliminate it (09:51 wedge fired at run=5)
Auto-restart watchdog (scripts/ops/watchdog/vllm_wedge_recovery.py) — detects wedge via (running > 0) AND (generation_tokens flat) AND (canary fails 3× over 10s) then docker restart vllm-qwen36 + Discord alert. ~5 min full recovery cycle. Cooldown + daily-cap guardrails prevent thrashing
Pool-budget semaphore on caller side keeps in-flight count bounded so a wedged node doesn't cascade backpressure

What we want from upstream

vLLM: can get_output() be made cancellable/timeoutable, so a stuck stream surfaces as a request failure instead of an infinite engine spin? Today it requires an external watchdog to recover. Even a coarse "if no progress in N seconds, raise and let the API server propagate 500" would let clients fail-fast and reduce blast radius

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3.6-27B-FP8 on GB10: get_output() spin-loop on stuck CUDA stream during prefill burst (run<6, KV<5%)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Mitigations in place (so this isn't user-visible)

Code Example

Your current environment

Environment

🐛 Describe the bug

Summary

Observed wedge thresholds (running count at moment of wedge)

Smoking-gun vLLM engine log (09:51 UTC, llm2)

py-spy capture of the wedged state (from a prior incident, same signature)

`nvidia-smi` signature when wedged

Reproduction conditions (best guess)

Mitigations in place (so this isn't user-visible)

What we want from upstream

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.6-27B-FP8 on GB10: get_output() spin-loop on stuck CUDA stream during prefill burst (run<6, KV<5%)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Mitigations in place (so this isn't user-visible)

Code Example

Your current environment

Environment

🐛 Describe the bug

Summary

Observed wedge thresholds (running count at moment of wedge)

Smoking-gun vLLM engine log (09:51 UTC, llm2)

py-spy capture of the wedged state (from a prior incident, same signature)

nvidia-smi signature when wedged

Reproduction conditions (best guess)

Mitigations in place (so this isn't user-visible)

What we want from upstream

Before submitting a new issue...

Still need to ship something?

TRENDING

`nvidia-smi` signature when wedged