vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-397B-A17B-NVFP4 engine hangs (Running≥1, 0 tok/s) under high concurrency on Blackwell GPUs

StepCodex · 2026-04-20T12:03:59Z

[vllm] Your current environment The output of python collect_env.py ```text vLLM version: 0.19.1rc1.dev391+g80b18230e (CUDA 13.0) GPU: NVIDIA B300 × 4 (single node, TP=4, EP=4) Attention backend: FLASHINFER Quantization: modelopt (NVFP4) ``` ### 🐛 Describe the bug Under high-concurrency load, the V1 engine silently stops generating tokens for the last in-flight request and never recovers. The API server stays healthy (`/metrics` keeps returning 200 OK), but no new tokens are produced and the request hangs forever until the job is killed externally. **This is NVFP4-specific.** The FP8 build of the same model on the same GPUs never hangs; only the NVFP4 build does. The hang was observed on B200, B300, GB200, and GB300. #### Reproducer ```bash python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/Qwen3.5-397B-A17B-NVFP4 \ --tensor-parallel-size 4 \ --kv-cache-dtype fp8_e4m3 \ --trust-remote-code \ --max-model-len 3072 \ --no-enable-prefix-caching \ --language-model-only \ --async-scheduling \ --attention-backend FLASHINFER \ --enable-expert-parallel \ --quantization modelopt \ --compilation_config.max_cudagraph_capture_size 2048 \ --speculative_config.method mtp \ --speculative_config.num_speculative_tokens 3 \ --host 0.0.0.0 --port 60000 ``` ```bash vllm bench serve \ --backend openai-chat \ --endpoint /v1/chat/completions \ --model nvidia/Qwen3.5-397B-A17B-NVFP4 \ --dataset-name random \ --random-input-len 1000 --random-output-len 2000 \ --num-prompts 1536 --max-concurrency 512 \ --ignore-eos ``` Full server logs: [server_log.txt](https://github.com/user-attachments/files/26895130/server_log.txt) #### Observed behavior 1. The server starts normally and services the first ~1000 requests at full throughput (≈13k tok/s generation, 512 running concurrent). 2. `Running` drops quickly from 512 → 177 → 1 as requests finish. 3. The last request's generation throughput goes from ≈1640 tok/s to **0.0 tok/s**, but `Running: 1` does not drop to zero. 4. From that point on, no new `loggers.py` or `metrics.py` lines are emitted — the engine stops reporting. Only `GET /metrics 200 OK` log spam continues (API server alive, engine frozen). 5. The request never completes. After 55 minutes server was killed by time limit. Full engine log excerpt: ``` INFO 04-17 23:38:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12921.2 tokens/s, Running: 493 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.1% INFO 04-17 23:38:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.07, Accepted throughput: 8715.68 tokens/s, Drafted throughput: 12624.90 tokens/s INFO 04-17 23:38:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11353.3 tokens/s, Running: 177 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4% INFO 04-17 23:38:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1639.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1% INFO 04-17 23:38:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1% # ... from here only `GET /metrics` 200 OK for 55 minutes, no more loggers.py/metrics.py lines ... ``` From the client side, only 1024 of 1536 requests receive a completion; 512 stay in flight forever until the client is killed externally. #### Notes - The `APIServer` process stays responsive to `/metrics` and `/health`, so this is not a crash — it is an engine-core deadlock / stuck-request state. - `GPU KV cache usage: 0.1%` confirms the engine believes only one short sequence is active, yet no forward progress happens. - Prefix caching is **disabled** (`--no-enable-prefix-caching`), so this is distinct from #37729 which involves prefix caching. - MTP speculative decoding is enabled (`--speculative_config.method mtp --speculative_config.num_speculative_tokens 3`) together with `--async-scheduling`. It would be worth investigating whether either of these is involved in the deadlock. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Code Example

vLLM version: 0.19.1rc1.dev391+g80b18230e (CUDA 13.0)
GPU: NVIDIA B300 × 4 (single node, TP=4, EP=4)
Attention backend: FLASHINFER
Quantization: modelopt (NVFP4)

---

python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --trust-remote-code \
    --max-model-len 3072 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --quantization modelopt \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --speculative_config.method mtp \
    --speculative_config.num_speculative_tokens 3 \
    --host 0.0.0.0 --port 60000

---

vllm bench serve \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 --random-output-len 2000 \
    --num-prompts 1536 --max-concurrency 512 \
    --ignore-eos

---

INFO 04-17 23:38:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12921.2 tokens/s, Running: 493 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.1%
INFO 04-17 23:38:11 [metrics.py:101]  SpecDecoding metrics: Mean acceptance length: 3.07, Accepted throughput: 8715.68 tokens/s, Drafted throughput: 12624.90 tokens/s
INFO 04-17 23:38:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11353.3 tokens/s, Running: 177 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4%
INFO 04-17 23:38:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1639.9 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
INFO 04-17 23:38:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:    0.0 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
# ... from here only `GET /metrics` 200 OK for 55 minutes, no more loggers.py/metrics.py lines ...

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM version: 0.19.1rc1.dev391+g80b18230e (CUDA 13.0)
GPU: NVIDIA B300 × 4 (single node, TP=4, EP=4)
Attention backend: FLASHINFER
Quantization: modelopt (NVFP4)

</details>

🐛 Describe the bug

Under high-concurrency load, the V1 engine silently stops generating tokens for the last in-flight request and never recovers. The API server stays healthy (/metrics keeps returning 200 OK), but no new tokens are produced and the request hangs forever until the job is killed externally.

This is NVFP4-specific. The FP8 build of the same model on the same GPUs never hangs; only the NVFP4 build does. The hang was observed on B200, B300, GB200, and GB300.

Reproducer

python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --trust-remote-code \
    --max-model-len 3072 \
    --no-enable-prefix-caching \
    --language-model-only \
    --async-scheduling \
    --attention-backend FLASHINFER \
    --enable-expert-parallel \
    --quantization modelopt \
    --compilation_config.max_cudagraph_capture_size 2048 \
    --speculative_config.method mtp \
    --speculative_config.num_speculative_tokens 3 \
    --host 0.0.0.0 --port 60000

vllm bench serve \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 --random-output-len 2000 \
    --num-prompts 1536 --max-concurrency 512 \
    --ignore-eos

Full server logs: server_log.txt

Observed behavior

The server starts normally and services the first ~1000 requests at full throughput (≈13k tok/s generation, 512 running concurrent).
Running drops quickly from 512 → 177 → 1 as requests finish.
The last request's generation throughput goes from ≈1640 tok/s to 0.0 tok/s, but Running: 1 does not drop to zero.
From that point on, no new loggers.py or metrics.py lines are emitted — the engine stops reporting. Only GET /metrics 200 OK log spam continues (API server alive, engine frozen).
The request never completes. After 55 minutes server was killed by time limit.

Full engine log excerpt:

INFO 04-17 23:38:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12921.2 tokens/s, Running: 493 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.1%
INFO 04-17 23:38:11 [metrics.py:101]  SpecDecoding metrics: Mean acceptance length: 3.07, Accepted throughput: 8715.68 tokens/s, Drafted throughput: 12624.90 tokens/s
INFO 04-17 23:38:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11353.3 tokens/s, Running: 177 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.4%
INFO 04-17 23:38:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1639.9 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
INFO 04-17 23:38:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:    0.0 tokens/s, Running:   1 reqs, Waiting: 0 reqs, GPU KV cache usage:  0.1%
# ... from here only `GET /metrics` 200 OK for 55 minutes, no more loggers.py/metrics.py lines ...

From the client side, only 1024 of 1536 requests receive a completion; 512 stay in flight forever until the client is killed externally.

Notes

The APIServer process stays responsive to /metrics and /health, so this is not a crash — it is an engine-core deadlock / stuck-request state.
GPU KV cache usage: 0.1% confirms the engine believes only one short sequence is active, yet no forward progress happens.
Prefix caching is disabled (--no-enable-prefix-caching), so this is distinct from #37729 which involves prefix caching.
MTP speculative decoding is enabled (--speculative_config.method mtp --speculative_config.num_speculative_tokens 3) together with --async-scheduling. It would be worth investigating whether either of these is involved in the deadlock.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be potentially resolved by investigating and disabling MTP speculative decoding or async scheduling, as these features may be contributing to the engine-core deadlock.

Guidance

Investigate the role of MTP speculative decoding (--speculative_config.method mtp) in the deadlock, as it is enabled along with async scheduling (--async-scheduling).
Try disabling MTP speculative decoding or async scheduling to see if it resolves the issue.
Verify that the GPU KV cache usage is correctly reported and not causing any issues.
Check the engine logs for any patterns or errors that may indicate the cause of the deadlock.

Example

No code snippet is provided as the issue is more related to configuration and feature interactions.

Notes

The issue seems to be specific to the NVFP4 build and is not observed in the FP8 build. The fact that the API server remains responsive to /metrics and /health requests suggests that the issue is not a crash, but rather an engine-core deadlock.

Recommendation

Apply a workaround by disabling MTP speculative decoding or async scheduling to see if it resolves the issue, as these features may be contributing to the deadlock. This is a reasonable starting point for investigation, given the information provided.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-397B-A17B-NVFP4 engine hangs (Running≥1, 0 tok/s) under high concurrency on Blackwell GPUs

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Reproducer

Observed behavior

Notes

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-397B-A17B-NVFP4 engine hangs (Running≥1, 0 tok/s) under high concurrency on Blackwell GPUs

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Reproducer

Observed behavior

Notes

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING