vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36736Fetched 2026-04-08 00:35:11
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
subscribed ×2commented ×1

Fix Action

Fix / Workaround

Question: Any suggestions for workarounds or fixes for this issue?

Code Example

python3 -m vllm.entrypoints.openai.api_server \
 --model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
 --served-model-name Qwen3.5-35B-A3B-FP8 \
 --trust-remote-code \
 --gpu-memory-utilization 0.85 \
 --host 0.0.0.0 \
 --port 8000 \
 --tensor-parallel-size 2 \
 --enable-chunked-prefill \
 --max-num-seqs 16 \
 --max-model-len 65536 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --calculate-kv-scales \
 --reasoning-parser qwen3

---

(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
RAW_BUFFERClick to expand / collapse

Environment:

  • vLLM version: 0.17+ (CUDA 130)
  • Model: Qwen/Qwen3.5-35B-A3B-FP8
  • GPU: RTX 5090D × 2
  • Open WebUI version: 0.8.10
  • Launch command:
python3 -m vllm.entrypoints.openai.api_server \
 --model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
 --served-model-name Qwen3.5-35B-A3B-FP8 \
 --trust-remote-code \
 --gpu-memory-utilization 0.85 \
 --host 0.0.0.0 \
 --port 8000 \
 --tensor-parallel-size 2 \
 --enable-chunked-prefill \
 --max-num-seqs 16 \
 --max-model-len 65536 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --calculate-kv-scales \
 --reasoning-parser qwen3

Bug Description: When using Open WebUI to call vLLM for inference, the output suddenly terminates during generation. Logs show everything is normal, request status shows 200 OK, but the client hangs and cannot get the complete output.

The vLLM service itself does not crash. Re-sending the prompt (with priority) or opening a new chat can continue inference, but the same issue occurs again quickly.

Steps to Reproduce:

  1. Start vLLM service (configuration as above)
  2. Send a chat request via Open WebUI
  3. Model starts generating output, but stops mid-way
  4. Client cannot get complete response, request appears successful but content is truncated

Logs:

(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Question: Any suggestions for workarounds or fixes for this issue?

extent analysis

Fix Plan

To address the issue of truncated output during inference, we'll focus on adjusting the configuration and implementing a workaround to ensure complete responses are received by the client.

Configuration Adjustments

  1. Increase --max-num-seqs: Try increasing the maximum number of sequences to a higher value (e.g., 32 or 64) to allow for longer generations.
  2. Adjust --gpu-memory-utilization: Lower the GPU memory utilization to prevent overloading the GPU, which might cause truncation (e.g., set it to 0.7 or 0.8).
  3. Enable Streaming: If the model supports it, enable streaming to receive outputs in chunks, which can help mitigate truncation issues.

Code Workaround

Implement a retry mechanism on the client-side to resend the request if the response is truncated. This can be done by checking the response length against an expected minimum length or by looking for a specific ending token that indicates completion.

Example Retry Mechanism (Python)

import requests

def send_request(prompt, max_retries=3):
    url = "http://0.0.0.0:8000/v1/chat/completions"
    params = {"prompt": prompt}
    retries = 0
    while retries < max_retries:
        response = requests.post(url, json=params)
        if response.status_code == 200 and len(response.text) > 100:  # Adjust the length check as needed
            return response.text
        retries += 1
    return None

# Example usage
prompt = "Your prompt here"
response = send_request(prompt)
if response:
    print(response)
else:
    print("Failed to get a complete response.")

Verification

  • Monitor the logs for any changes in behavior after applying the configuration adjustments and implementing the retry mechanism.
  • Test with various prompts to ensure that complete responses are consistently received.
  • Adjust the retry mechanism and configuration settings as needed based on the outcomes of these tests.

Extra Tips

  • Regularly update the vLLM and Open WebUI to the latest versions to ensure you have the latest fixes and improvements.
  • Consider implementing a more sophisticated method to detect truncated responses, such as checking for specific tokens or patterns that indicate the end of a generation.
  • If the issue persists, explore increasing the resources (e.g., GPU memory, model parallelism) allocated to the vLLM service.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING