vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-9B answer !!!!!!!!! [6 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38077Fetched 2026-04-08 01:26:46
View on GitHub
Comments
6
Participants
4
Timeline
15
Reactions
2
Timeline (top)
commented ×6mentioned ×4subscribed ×4labeled ×1

Code Example

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

---

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#
RAW_BUFFERClick to expand / collapse

Your current environment

my env cat vllm-a10.yaml

vllm-a10.yaml


apiVersion: v1 kind: Namespace metadata: name: vllm

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-qwen3-5-9b namespace: vllm labels: app: vllm-qwen3-5-9b spec: replicas: 1 selector: matchLabels: app: vllm-qwen3-5-9b template: metadata: labels: app: vllm-qwen3-5-9b spec: restartPolicy: Always nodeName: adctrain2 containers: - name: vllm image: docker.xuanyuan.run/vllm/vllm-openai:v0.18.0 imagePullPolicy: IfNotPresent command: - python3 - -m - vllm.entrypoints.openai.api_server - --model - /data/Qwen3.5-9B - --served-model-name - Qwen3.5-9B - --host - "0.0.0.0" - --port - "8000" - --tensor-parallel-size - "4" - --dtype - auto - --max-model-len - "32768" - --gpu-memory-utilization - "0.85" - --trust-remote-code - --enable-auto-tool-choice - --reasoning-parser - qwen3 - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --attention-backend - auto - --kv-cache-dtype - auto env: - name: VLLM_LOGGING_LEVEL value: "INFO" - name: HF_HUB_OFFLINE value: "1" - name: TRANSFORMERS_OFFLINE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "expandable_segments:True" # 优化显存碎片 ports: - containerPort: 8000 name: http resources: requests: nvidia.com/gpu: "4" cpu: "32" memory: "48Gi" limits: nvidia.com/gpu: "4" cpu: "64" memory: "96Gi" volumeMounts: - name: model-storage mountPath: /data readOnly: true - name: shm mountPath: /dev/shm securityContext: privileged: true volumes: - name: model-storage hostPath: path: /DATA/vllm/model type: Directory - name: shm emptyDir: medium: Memory sizeLimit: 32Gi

apiVersion: v1 kind: Service metadata: name: vllm-qwen3-5-9b-service namespace: vllm spec: selector: app: vllm-qwen3-5-9b ports: - name: http protocol: TCP port: 8000 targetPort: 8000 nodePort: 32000 type: NodePort

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#
</details>

🐛 Describe the bug

my env cat vllm-a10.yaml

vllm-a10.yaml


apiVersion: v1 kind: Namespace metadata: name: vllm

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-qwen3-5-9b namespace: vllm labels: app: vllm-qwen3-5-9b spec: replicas: 1 selector: matchLabels: app: vllm-qwen3-5-9b template: metadata: labels: app: vllm-qwen3-5-9b spec: restartPolicy: Always nodeName: adctrain2 containers: - name: vllm image: docker.xuanyuan.run/vllm/vllm-openai:v0.18.0 imagePullPolicy: IfNotPresent command: - python3 - -m - vllm.entrypoints.openai.api_server - --model - /data/Qwen3.5-9B - --served-model-name - Qwen3.5-9B - --host - "0.0.0.0" - --port - "8000" - --tensor-parallel-size - "4" - --dtype - auto - --max-model-len - "32768" - --gpu-memory-utilization - "0.85" - --trust-remote-code - --enable-auto-tool-choice - --reasoning-parser - qwen3 - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --attention-backend - auto - --kv-cache-dtype - auto env: - name: VLLM_LOGGING_LEVEL value: "INFO" - name: HF_HUB_OFFLINE value: "1" - name: TRANSFORMERS_OFFLINE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "expandable_segments:True" # 优化显存碎片 ports: - containerPort: 8000 name: http resources: requests: nvidia.com/gpu: "4" cpu: "32" memory: "48Gi" limits: nvidia.com/gpu: "4" cpu: "64" memory: "96Gi" volumeMounts: - name: model-storage mountPath: /data readOnly: true - name: shm mountPath: /dev/shm securityContext: privileged: true volumes: - name: model-storage hostPath: path: /DATA/vllm/model type: Directory - name: shm emptyDir: medium: Memory sizeLimit: 32Gi

apiVersion: v1 kind: Service metadata: name: vllm-qwen3-5-9b-service namespace: vllm spec: selector: app: vllm-qwen3-5-9b ports: - name: http protocol: TCP port: 8000 targetPort: 8000 nodePort: 32000 type: NodePort

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the model's response being a series of exclamation marks instead of a meaningful answer. To fix this, we can try the following steps:

  • Check the model configuration and ensure that it is correctly loaded and served.
  • Verify that the input prompt is correctly formatted and sent to the model.
  • Increase the max_tokens parameter to allow the model to generate longer responses.
  • Check the model's logging output to see if there are any error messages or warnings that could indicate the cause of the issue.

Here is an example of how to modify the curl command to increase the max_tokens parameter:

curl http://10.90.248.78:32001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-9B",
    "messages": [
      {"role": "user", "content": "请用一句话解释量子纠缠。"}
    ],
    "max_tokens": 200
  }'

Additionally, you can try to modify the vllm-a10.yaml file to increase the max-model-len parameter:

command:
  - python3
  - -m
  - vllm.entrypoints.openai.api_server
  - --model
  - /data/Qwen3.5-9B
  - --served-model-name
  - Qwen3.5-9B
  - --host
  - "0.0.0.0"
  - --port
  - "8000"
  - --tensor-parallel-size
  - "4"
  - --dtype
  - auto
  - --max-model-len
  - "65536"

Verification

To verify that the fix worked, you can try sending the same input prompt to the model again and check if the response is meaningful and not just a series of exclamation marks.

Extra Tips

  • Make sure to check the model's documentation and configuration options to ensure that you are using the correct parameters and settings.
  • If you are still experiencing issues, try to enable debug logging or increase the logging level to get more detailed error messages.
  • You can also try to use a different model or a different input prompt to see if the issue is specific to this particular model or prompt.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING