vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-9B answer !!!!!!!!! [6 comments, 4 participants]

vllm2026-03-25 06:59:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38077•Fetched 2026-04-08 01:26:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6mentioned ×4subscribed ×4labeled ×1

Code Example

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

---

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

RAW_BUFFERClick to expand / collapse

Your current environment

my env cat vllm-a10.yaml

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-qwen3-5-9b namespace: vllm labels: app: vllm-qwen3-5-9b spec: replicas: 1 selector: matchLabels: app: vllm-qwen3-5-9b template: metadata: labels: app: vllm-qwen3-5-9b spec: restartPolicy: Always nodeName: adctrain2 containers: - name: vllm image: docker.xuanyuan.run/vllm/vllm-openai:v0.18.0 imagePullPolicy: IfNotPresent command: - python3 - -m - vllm.entrypoints.openai.api_server - --model - /data/Qwen3.5-9B - --served-model-name - Qwen3.5-9B - --host - "0.0.0.0" - --port - "8000" - --tensor-parallel-size - "4" - --dtype - auto - --max-model-len - "32768" - --gpu-memory-utilization - "0.85" - --trust-remote-code - --enable-auto-tool-choice - --reasoning-parser - qwen3 - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --attention-backend - auto - --kv-cache-dtype - auto env: - name: VLLM_LOGGING_LEVEL value: "INFO" - name: HF_HUB_OFFLINE value: "1" - name: TRANSFORMERS_OFFLINE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "expandable_segments:True" # 优化显存碎片 ports: - containerPort: 8000 name: http resources: requests: nvidia.com/gpu: "4" cpu: "32" memory: "48Gi" limits: nvidia.com/gpu: "4" cpu: "64" memory: "96Gi" volumeMounts: - name: model-storage mountPath: /data readOnly: true - name: shm mountPath: /dev/shm securityContext: privileged: true volumes: - name: model-storage hostPath: path: /DATA/vllm/model type: Directory - name: shm emptyDir: medium: Memory sizeLimit: 32Gi

apiVersion: v1 kind: Service metadata: name: vllm-qwen3-5-9b-service namespace: vllm spec: selector: app: vllm-qwen3-5-9b ports: - name: http protocol: TCP port: 8000 targetPort: 8000 nodePort: 32000 type: NodePort

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

</details>

🐛 Describe the bug

my env cat vllm-a10.yaml

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-qwen3-5-9b namespace: vllm labels: app: vllm-qwen3-5-9b spec: replicas: 1 selector: matchLabels: app: vllm-qwen3-5-9b template: metadata: labels: app: vllm-qwen3-5-9b spec: restartPolicy: Always nodeName: adctrain2 containers: - name: vllm image: docker.xuanyuan.run/vllm/vllm-openai:v0.18.0 imagePullPolicy: IfNotPresent command: - python3 - -m - vllm.entrypoints.openai.api_server - --model - /data/Qwen3.5-9B - --served-model-name - Qwen3.5-9B - --host - "0.0.0.0" - --port - "8000" - --tensor-parallel-size - "4" - --dtype - auto - --max-model-len - "32768" - --gpu-memory-utilization - "0.85" - --trust-remote-code - --enable-auto-tool-choice - --reasoning-parser - qwen3 - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --attention-backend - auto - --kv-cache-dtype - auto env: - name: VLLM_LOGGING_LEVEL value: "INFO" - name: HF_HUB_OFFLINE value: "1" - name: TRANSFORMERS_OFFLINE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "expandable_segments:True" # 优化显存碎片 ports: - containerPort: 8000 name: http resources: requests: nvidia.com/gpu: "4" cpu: "32" memory: "48Gi" limits: nvidia.com/gpu: "4" cpu: "64" memory: "96Gi" volumeMounts: - name: model-storage mountPath: /data readOnly: true - name: shm mountPath: /dev/shm securityContext: privileged: true volumes: - name: model-storage hostPath: path: /DATA/vllm/model type: Directory - name: shm emptyDir: medium: Memory sizeLimit: 32Gi

for my test curl http://10.90.248.78:32001/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen3.5-9B",
>     "messages": [
>       {"role": "user", "content": "请用一句话解释量子纠缠。"}
>     ],
>     "max_tokens": 100
>   }'

{"id":"chatcmpl-81c1c9a994b78c85","object":"chat.completion","created":1774420563,"model":"Qwen3.5-9B","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":116,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}[root@adctrain2 vllm]#

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the model's response being a series of exclamation marks instead of a meaningful answer. To fix this, we can try the following steps:

Check the model configuration and ensure that it is correctly loaded and served.
Verify that the input prompt is correctly formatted and sent to the model.
Increase the max_tokens parameter to allow the model to generate longer responses.
Check the model's logging output to see if there are any error messages or warnings that could indicate the cause of the issue.

Here is an example of how to modify the curl command to increase the max_tokens parameter:

curl http://10.90.248.78:32001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-9B",
    "messages": [
      {"role": "user", "content": "请用一句话解释量子纠缠。"}
    ],
    "max_tokens": 200
  }'

Additionally, you can try to modify the vllm-a10.yaml file to increase the max-model-len parameter:

command:
  - python3
  - -m
  - vllm.entrypoints.openai.api_server
  - --model
  - /data/Qwen3.5-9B
  - --served-model-name
  - Qwen3.5-9B
  - --host
  - "0.0.0.0"
  - --port
  - "8000"
  - --tensor-parallel-size
  - "4"
  - --dtype
  - auto
  - --max-model-len
  - "65536"

Verification

To verify that the fix worked, you can try sending the same input prompt to the model again and check if the response is meaningful and not just a series of exclamation marks.

Extra Tips

Make sure to check the model's documentation and configuration options to ensure that you are using the correct parameters and settings.
If you are still experiencing issues, try to enable debug logging or increase the logging level to get more detailed error messages.
You can also try to use a different model or a different input prompt to see if the issue is specific to this particular model or prompt.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-9B answer !!!!!!!!! [6 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

🐛 Describe the bug

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-9B answer !!!!!!!!! [6 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

🐛 Describe the bug

vllm-a10.yaml

apiVersion: v1 kind: Namespace metadata: name: vllm

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING