vllm - ✅(Solved) Fix [Bug]: Qwen3.6 streaming chat completions emit final answer in delta.reasoning and leave delta.content empty even with enable_thinking=false [1 pull requests, 1 participants]

Q: Expected behavior

With `chat_template_kwargs.enable_thinking=false`, streaming output should not emit answer tokens in `delta.reasoning`. The final answer should be emitted in `delta.content`, matching the non-streaming behavior.

vllm2026-04-24 15:33:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40816•Fetched 2026-04-25 06:03:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

xy3xy3

Participants

xy3xy3

Timeline (top)

cross-referenced ×1labeled ×1

Error Message

The final answer (12) is emitted entirely via delta.reasoning instead of delta.content, even though thinking is explicitly disabled.

Root Cause

This breaks OpenAI-compatible streaming clients that only read delta.content, because they see "reasoning only" and never receive the final answer in the content channel.

Fix Action

Fixed

Fixed by PR: Fix Qwen3 streaming content routing (https://github.com/vllm-project/vllm/pull/40820)

PR fix notes

PR #40820: Fix Qwen3 streaming content routing

Repository: vllm-project/vllm
Author: xy3xy3
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40820

Description (problem / solution / changelog)

Purpose

Fix a Qwen3 streaming routing bug in the OpenAI-compatible /v1/chat/completions endpoint when --reasoning-parser qwen3 is enabled and chat_template_kwargs.enable_thinking=false.

This PR is related to https://github.com/vllm-project/vllm/issues/40816.

Before this change:

Non-streaming requests correctly returned the answer in message.content
Streaming requests could incorrectly emit the answer in choices[0].delta.reasoning
OpenAI-compatible streaming clients that only read delta.content would miss the final answer

Root cause:

The streaming path relied on res.prompt_token_ids to determine whether the prompt had already ended the reasoning block
For Qwen3 with enable_thinking=false, the rendered prompt already contains the empty reasoning terminator
Some streaming RequestOutput chunks do not carry prompt_token_ids, so the answer tokens could be misrouted into delta.reasoning

Fix:

Capture rendered prompt_token_ids before streaming starts
Pass them into chat_completion_stream_generator
Initialize prompt_is_reasoning_end_arr from those prompt tokens up front
Only fall back to res.prompt_token_ids when needed

This makes streaming behavior consistent with non-streaming behavior for Qwen3/Qwen3.5 requests with thinking disabled.

Test Plan

Code-level regression coverage:

pytest -q tests/entrypoints/openai/chat_completion/test_thinking_token_budget.py \
  -k streaming_with_thinking_disabled_stays_in_content

Manual validation against a running container:

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false}
  }'

curl -N -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "stream":true,
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false},
    "stream_options":{"include_usage":true}
  }'

Test Result

Environment used for manual verification:

Model: qwen3.6-35b-nvfp4
Server args include:
- --reasoning-parser qwen3
- --default-chat-template-kwargs '{"enable_thinking": false}'

Before fix:

Non-streaming response returned message.content: "12"
Streaming response emitted:
- delta.reasoning: "1"
- delta.reasoning: "2"
- no usable delta.content

After fix:

Non-streaming response returns:

{
  "choices": [
    {
      "message": {
        "content": "12",
        "reasoning": null
      }
    }
  ]
}

Streaming response emits:

data: {"choices":[{"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":"1"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":"2"},"finish_reason":null}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Observed result:

Answer tokens now stay in delta.content
No delta.reasoning is emitted for the disabled-thinking request

Documentation update:

No documentation update required for this bug fix

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/entrypoints/openai/chat_completion/test_thinking_token_budget.py (modified, +46/-0)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +18/-4)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GB10
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.0
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.2rc1.dev134+gfe9c3d6c5 (git sha: fe9c3d6c5)
vLLM Build Flags:
  CUDA Archs: 8.7 8.9 9.0 10.0+PTX 12.0 12.1; ROCm: Disabled; XPU: Disabled

==============================
     Environment Variables
==============================
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_LOGGING_LEVEL=INFO
CUDA_VERSION=13.0.1
VLLM_USAGE_SOURCE=production-docker-image
NVIDIA_VISIBLE_DEVICES=all

---

vllm serve /model \
  --served-model-name qwen3.6-35b-nvfp4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --attention-backend FLASH_ATTN \
  --default-chat-template-kwargs '{"enable_thinking": false}'

---

vllm/vllm-openai:cu130-nightly-aarch64

---

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false}
  }'

---

{
  "id":"chatcmpl-a448ad50fd1bff2b",
  "object":"chat.completion",
  "created":1777044630,
  "model":"qwen3.6-35b-nvfp4",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "content":"12",
        "reasoning":null
      },
      "finish_reason":"stop"
    }
  ],
  "usage":{"prompt_tokens":35,"total_tokens":38,"completion_tokens":3}
}

---

curl -N -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "stream":true,
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false},
    "stream_options":{"include_usage":true}
  }'

---

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"reasoning":"1"},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"reasoning":"2"},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[],"usage":{"prompt_tokens":35,"total_tokens":38,"completion_tokens":3}}

data: [DONE]

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>


Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1008-nvidia-aarch64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GB10
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.0
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.2rc1.dev134+gfe9c3d6c5 (git sha: fe9c3d6c5)
vLLM Build Flags:
  CUDA Archs: 8.7 8.9 9.0 10.0+PTX 12.0 12.1; ROCm: Disabled; XPU: Disabled

==============================
     Environment Variables
==============================
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_LOGGING_LEVEL=INFO
CUDA_VERSION=13.0.1
VLLM_USAGE_SOURCE=production-docker-image
NVIDIA_VISIBLE_DEVICES=all

</details>

🐛 Describe the bug

When using the OpenAI-compatible /v1/chat/completions endpoint with a Qwen3-family model and --reasoning-parser qwen3, non-streaming responses behave correctly, but streaming responses may emit the final answer in choices[0].delta.reasoning while choices[0].delta.content stays empty for the entire stream.

I can reproduce this even when explicitly disabling thinking with chat_template_kwargs.enable_thinking=false.

In other words:

stream=false + enable_thinking=false -> normal message.content
stream=true + enable_thinking=false -> answer tokens appear in delta.reasoning, no delta.content

This breaks OpenAI-compatible streaming clients that only read delta.content, because they see "reasoning only" and never receive the final answer in the content channel.

Server startup

The model I used is Qwen3.6-35B-A3B-NVFP4 (https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4).

I am serving the model with:

vllm serve /model \
  --served-model-name qwen3.6-35b-nvfp4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --attention-backend FLASH_ATTN \
  --default-chat-template-kwargs '{"enable_thinking": false}'

Container image:

vllm/vllm-openai:cu130-nightly-aarch64

Minimal reproduction

Case 1: non-streaming works

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false}
  }'

Observed response:

{
  "id":"chatcmpl-a448ad50fd1bff2b",
  "object":"chat.completion",
  "created":1777044630,
  "model":"qwen3.6-35b-nvfp4",
  "choices":[
    {
      "index":0,
      "message":{
        "role":"assistant",
        "content":"12",
        "reasoning":null
      },
      "finish_reason":"stop"
    }
  ],
  "usage":{"prompt_tokens":35,"total_tokens":38,"completion_tokens":3}
}

Case 2: streaming routes the answer into `delta.reasoning`

curl -N -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"qwen3.6-35b-nvfp4",
    "messages":[{"role":"user","content":"Which is larger, 4 or 12? Output exactly one token: 4 or 12."}],
    "stream":true,
    "temperature":0.1,
    "max_tokens":16,
    "chat_template_kwargs":{"enable_thinking":false},
    "stream_options":{"include_usage":true}
  }'

Observed stream:

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"reasoning":"1"},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{"reasoning":"2"},"finish_reason":null}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"chatcmpl-890423d350192a92","object":"chat.completion.chunk","created":1777044630,"model":"qwen3.6-35b-nvfp4","choices":[],"usage":{"prompt_tokens":35,"total_tokens":38,"completion_tokens":3}}

data: [DONE]

Observed behavior

The final answer (12) is emitted entirely via delta.reasoning instead of delta.content, even though thinking is explicitly disabled.

Expected behavior

With chat_template_kwargs.enable_thinking=false, streaming output should not emit answer tokens in delta.reasoning. The final answer should be emitted in delta.content, matching the non-streaming behavior.

Additional notes

I can also reproduce similar behavior from a third-party OpenAI-compatible client: it receives SSE chunks containing delta.reasoning only, then a final stop, with no usable delta.content.

This seems to be either:

a regression in the streaming path for Qwen3ReasoningParser, or
an incomplete separation between reasoning and final content when streaming.

If helpful, I can also provide DEBUG logs (VLLM_LOGGING_LEVEL=DEBUG) or test another stable release to help narrow down whether this is a regression.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be resolved by adjusting the Qwen3ReasoningParser configuration or the streaming logic to ensure that the final answer is emitted in delta.content instead of delta.reasoning when enable_thinking is set to false.

Guidance

Verify that the Qwen3ReasoningParser is correctly configured to handle non-streaming and streaming responses differently.
Check the streaming logic to ensure that it properly separates reasoning and final content when enable_thinking is disabled.
Consider adding additional logging or debugging statements to understand how the delta.reasoning and delta.content fields are being populated.
Test the issue with different models and configurations to determine if it's specific to the Qwen3 model or a more general problem.

Example

No code example is provided as the issue seems to be related to the configuration or logic of the Qwen3ReasoningParser and streaming functionality, which is not explicitly shown in the provided code snippets.

Notes

The issue may be related to a regression in the streaming path for Qwen3ReasoningParser or an incomplete separation between reasoning and final content when streaming. Further investigation and debugging are needed to determine the root cause.

Recommendation

Apply a workaround by adjusting the Qwen3ReasoningParser configuration or the streaming logic to ensure correct emission of the final answer in delta.content. This may involve modifying the parser or streaming code to handle the enable_thinking flag correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#tool integration #LLM response #prompt template #agent execution #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Qwen3.6 streaming chat completions emit final answer in delta.reasoning and leave delta.content empty even with enable_thinking=false [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #40820: Fix Qwen3 streaming content routing

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Server startup

Minimal reproduction

Case 1: non-streaming works

Case 2: streaming routes the answer into delta.reasoning

Observed behavior

Expected behavior

Additional notes

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Case 2: streaming routes the answer into `delta.reasoning`