vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-397B-A17B-NVFP4 engine hangs (Running≥1, 0 tok/s) under high concurrency on Blackwell GPUs

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

vLLM version: 0.19.1rc1.dev391+g80b18230e (CUDA 13.0)
GPU: NVIDIA B300 × 4 (single node, TP=4, EP=4)
Attention backend: FLASHINFER
Quantization: modelopt (NVFP4)

---

python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --trust-remote-code \
    --max-model-len 3072 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --quantization modelopt \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --speculative_config.method mtp \
    --speculative_config.num_speculative_tokens 3 \
    --host 0.0.0.0 --port 60000

---

vllm bench serve \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 --random-output-len 2000 \
    --num-prompts 1536 --max-concurrency 512 \
    --ignore-eos

---

INFO 04-17 23:38:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12921.2 tokens/s, Running: 493 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.1%
INFO 04-17 23:38:11 [metrics.py:101]  SpecDecoding metrics: Mean acceptance length: 3.07, Accepted throughput: 8715.68 tokens/s, Drafted throughput: 12624.90 tokens/s
INFO 04-17 23:38:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11353.3 tokens/s, Running: 177 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4%
INFO 04-17 23:38:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1639.9 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
INFO 04-17 23:38:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:    0.0 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
# ... from here only `GET /metrics` 200 OK for 55 minutes, no more loggers.py/metrics.py lines ...
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
vLLM version: 0.19.1rc1.dev391+g80b18230e (CUDA 13.0)
GPU: NVIDIA B300 × 4 (single node, TP=4, EP=4)
Attention backend: FLASHINFER
Quantization: modelopt (NVFP4)
</details>

🐛 Describe the bug

Under high-concurrency load, the V1 engine silently stops generating tokens for the last in-flight request and never recovers. The API server stays healthy (/metrics keeps returning 200 OK), but no new tokens are produced and the request hangs forever until the job is killed externally.

This is NVFP4-specific. The FP8 build of the same model on the same GPUs never hangs; only the NVFP4 build does. The hang was observed on B200, B300, GB200, and GB300.

Reproducer

python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --trust-remote-code \
    --max-model-len 3072 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --quantization modelopt \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --speculative_config.method mtp \
    --speculative_config.num_speculative_tokens 3 \
    --host 0.0.0.0 --port 60000
vllm bench serve \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 --random-output-len 2000 \
    --num-prompts 1536 --max-concurrency 512 \
    --ignore-eos

Full server logs: server_log.txt

Observed behavior

  1. The server starts normally and services the first ~1000 requests at full throughput (≈13k tok/s generation, 512 running concurrent).
  2. Running drops quickly from 512 → 177 → 1 as requests finish.
  3. The last request's generation throughput goes from ≈1640 tok/s to 0.0 tok/s, but Running: 1 does not drop to zero.
  4. From that point on, no new loggers.py or metrics.py lines are emitted — the engine stops reporting. Only GET /metrics 200 OK log spam continues (API server alive, engine frozen).
  5. The request never completes. After 55 minutes server was killed by time limit.

Full engine log excerpt:

INFO 04-17 23:38:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12921.2 tokens/s, Running: 493 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.1%
INFO 04-17 23:38:11 [metrics.py:101]  SpecDecoding metrics: Mean acceptance length: 3.07, Accepted throughput: 8715.68 tokens/s, Drafted throughput: 12624.90 tokens/s
INFO 04-17 23:38:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11353.3 tokens/s, Running: 177 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4%
INFO 04-17 23:38:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1639.9 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
INFO 04-17 23:38:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:    0.0 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
# ... from here only `GET /metrics` 200 OK for 55 minutes, no more loggers.py/metrics.py lines ...

From the client side, only 1024 of 1536 requests receive a completion; 512 stay in flight forever until the client is killed externally.

Notes

  • The APIServer process stays responsive to /metrics and /health, so this is not a crash — it is an engine-core deadlock / stuck-request state.
  • GPU KV cache usage: 0.1% confirms the engine believes only one short sequence is active, yet no forward progress happens.
  • Prefix caching is disabled (--no-enable-prefix-caching), so this is distinct from #37729 which involves prefix caching.
  • MTP speculative decoding is enabled (--speculative_config.method mtp --speculative_config.num_speculative_tokens 3) together with --async-scheduling. It would be worth investigating whether either of these is involved in the deadlock.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be potentially resolved by investigating and disabling MTP speculative decoding or async scheduling, as these features may be contributing to the engine-core deadlock.

Guidance

  • Investigate the role of MTP speculative decoding (--speculative_config.method mtp) in the deadlock, as it is enabled along with async scheduling (--async-scheduling).
  • Try disabling MTP speculative decoding or async scheduling to see if it resolves the issue.
  • Verify that the GPU KV cache usage is correctly reported and not causing any issues.
  • Check the engine logs for any patterns or errors that may indicate the cause of the deadlock.

Example

No code snippet is provided as the issue is more related to configuration and feature interactions.

Notes

The issue seems to be specific to the NVFP4 build and is not observed in the FP8 build. The fact that the API server remains responsive to /metrics and /health requests suggests that the issue is not a crash, but rather an engine-core deadlock.

Recommendation

Apply a workaround by disabling MTP speculative decoding or async scheduling to see if it resolves the issue, as these features may be contributing to the deadlock. This is a reasonable starting point for investigation, given the information provided.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-397B-A17B-NVFP4 engine hangs (Running≥1, 0 tok/s) under high concurrency on Blackwell GPUs