vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 model outputs all exclamation points [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38527Fetched 2026-04-08 01:53:31
View on GitHub
Comments
3
Participants
3
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×3subscribed ×2labeled ×1
RAW_BUFFERClick to expand / collapse

Your current environment

model download from modelscope

vllm:0.18.0 ( pip install vllm ) pytorch:2.10.0+cu128 model path:/home/pc/qwen-models/qwen3_5_35B_A3B_FP8 model : Qwen/Qwen3.5-35B-A3B-FP8 nvidia-smi: NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 gpu: nvidia rtx pro 6000 blackwell workstation 96G * 2 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Tue_Dec_16_07:23:41_PM_PST_2025 Cuda compilation tools, release 13.1, V13.1.115 Build cuda_13.1.r13.1/compiler.37061995_0

start script:

#!/bin/bash

export VLLM_RPC_TIMEOUT=300 export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=0 export NCCL_DEBUG=INFO export CUDA_DEVICE_ORDER=PCI_BUS_ID export OMP_NUM_THREADS=56 export TRITON_AUTOTUNE_KCACHE_LIMIT=1000000 VENV_PATH="/home/pc/qwen-models/qwen3_5_venv/bin/activate" MODEL_PATH="/home/pc/qwen-models/qwen3_5_35B_A3B" MODEL_NAME="qwen3.5-flash" REASONING_PARSER="qwen3" TOOL_CALL_PARSER="qwen3_coder" MAX_MODEL_LEN="262144" TP_SIZE=2 GPU_UTIL=0.5

HOST="0.0.0.0" PORT=16688 SWAP_SPACE=16 API_KEY="sk-av4b3456925d4380ac44h4674952y72c" MAX_NUM_SEQS="128" LOG_DIR="/home/pc/qwen-models/log/vllm/llm" PID_FILE="/home/pc/qwen-models/qwen_vllm_service.pid"

source $VENV_PATH nohup python -m vllm.entrypoints.openai.api_server --model "$MODEL_PATH" --host "$HOST" --port "$PORT" --tensor-parallel-size "$TP_SIZE" --gpu-memory-utilization "$GPU_UTIL" --api-key "$API_KEY" --served-model-name "$MODEL_NAME" --dtype auto --default-chat-template-kwargs '{"enable_thinking": true}' --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --max_num_seqs "$MAX_NUM_SEQS" --max-model-len "$MAX_MODEL_LEN" --disable-custom-all-reduce --language-model-only --trust-remote-code --uvicorn-log-level debug
$LOG_DIR/service.log 2>&1 &

🐛 Describe the bug

Qwen3.5-35B-A3B-FP8 model outputs all exclamation points

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the Qwen3.5-35B-A3B-FP8 model outputting all exclamation points, we will focus on adjusting the model configuration and environment settings.

  • Model Configuration: Ensure that the model is correctly configured for the task at hand. This might involve checking the default-chat-template-kwargs to ensure it's set appropriately for the desired output.
  • Environment Settings: Verify that the environment variables and the script used to start the model server are correctly set. This includes checking VLLM_RPC_TIMEOUT, NCCL_IB_DISABLE, NCCL_P2P_DISABLE, NCCL_DEBUG, and other relevant settings.

Code Changes

Here's an example of how you might adjust the default-chat-template-kwargs in your start script to potentially resolve the issue:

--default-chat-template-kwargs '{"enable_thinking": false, "output_filter": "none"}'

Additionally, ensure that the reasoning-parser and tool-call-parser are correctly set for your model:

--reasoning-parser qwen3
--tool-call-parser qwen3_coder

Temporary Workaround

If the issue persists, try temporarily disabling --trust-remote-code to see if it affects the output:

--no-trust-remote-code

Verification

To verify that the fix worked, start the model server with the adjusted settings and test the output. If the model still outputs all exclamation points, further investigation into the model's training data or the specific task configuration may be necessary.

Extra Tips

  • Ensure your CUDA and PyTorch versions are compatible with the model requirements.
  • Check the model's documentation for any specific requirements or recommendations for running the Qwen3.5-35B-A3B-FP8 model.
  • If issues persist, consider reaching out to the model developers or the VLLM community for further assistance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5-35B-A3B-FP8 model outputs all exclamation points [3 comments, 3 participants]