vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs [1 comments, 2 participants]

Code Example

python3 -m vllm.entrypoints.openai.api_server \
 --model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
 --served-model-name Qwen3.5-35B-A3B-FP8 \
 --trust-remote-code \
 --gpu-memory-utilization 0.85 \
 --host 0.0.0.0 \
 --port 8000 \
 --tensor-parallel-size 2 \
 --enable-chunked-prefill \
 --max-num-seqs 16 \
 --max-model-len 65536 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --calculate-kv-scales \
 --reasoning-parser qwen3

---

(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Environment:

vLLM version: 0.17+ (CUDA 130)
Model: Qwen/Qwen3.5-35B-A3B-FP8
GPU: RTX 5090D × 2
Open WebUI version: 0.8.10
Launch command:

python3 -m vllm.entrypoints.openai.api_server \
 --model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
 --served-model-name Qwen3.5-35B-A3B-FP8 \
 --trust-remote-code \
 --gpu-memory-utilization 0.85 \
 --host 0.0.0.0 \
 --port 8000 \
 --tensor-parallel-size 2 \
 --enable-chunked-prefill \
 --max-num-seqs 16 \
 --max-model-len 65536 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --calculate-kv-scales \
 --reasoning-parser qwen3

Bug Description: When using Open WebUI to call vLLM for inference, the output suddenly terminates during generation. Logs show everything is normal, request status shows 200 OK, but the client hangs and cannot get the complete output.

The vLLM service itself does not crash. Re-sending the prompt (with priority) or opening a new chat can continue inference, but the same issue occurs again quickly.

Steps to Reproduce:

Start vLLM service (configuration as above)
Send a chat request via Open WebUI
Model starts generating output, but stops mid-way
Client cannot get complete response, request appears successful but content is truncated

Logs:

(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Question: Any suggestions for workarounds or fixes for this issue?

extent analysis

Fix Plan

To address the issue of truncated output during inference, we'll focus on adjusting the configuration and implementing a workaround to ensure complete responses are received by the client.

Configuration Adjustments

Increase --max-num-seqs: Try increasing the maximum number of sequences to a higher value (e.g., 32 or 64) to allow for longer generations.
Adjust --gpu-memory-utilization: Lower the GPU memory utilization to prevent overloading the GPU, which might cause truncation (e.g., set it to 0.7 or 0.8).
Enable Streaming: If the model supports it, enable streaming to receive outputs in chunks, which can help mitigate truncation issues.

Code Workaround

Implement a retry mechanism on the client-side to resend the request if the response is truncated. This can be done by checking the response length against an expected minimum length or by looking for a specific ending token that indicates completion.

Example Retry Mechanism (Python)

import requests

def send_request(prompt, max_retries=3):
    url = "http://0.0.0.0:8000/v1/chat/completions"
    params = {"prompt": prompt}
    retries = 0
    while retries < max_retries:
        response = requests.post(url, json=params)
        if response.status_code == 200 and len(response.text) > 100:  # Adjust the length check as needed
            return response.text
        retries += 1
    return None

# Example usage
prompt = "Your prompt here"
response = send_request(prompt)
if response:
    print(response)
else:
    print("Failed to get a complete response.")

Verification

Monitor the logs for any changes in behavior after applying the configuration adjustments and implementing the retry mechanism.
Test with various prompts to ensure that complete responses are consistently received.
Adjust the retry mechanism and configuration settings as needed based on the outcomes of these tests.

Extra Tips

Regularly update the vLLM and Open WebUI to the latest versions to ensure you have the latest fixes and improvements.
Consider implementing a more sophisticated method to detect truncated responses, such as checking for specific tokens or patterns that indicate the end of a generation.
If the issue persists, explore increasing the resources (e.g., GPU memory, model parallelism) allocated to the vLLM service.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

extent analysis

Fix Plan

Configuration Adjustments

Code Workaround

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

extent analysis

Fix Plan

Configuration Adjustments

Code Workaround

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING