llamaIndex - 💡(How to fix) Fix [Bug]: from_openai_message misses vLLM-served Qwen3 reasoning field (uses 'reasoning' instead of 'reasoning_content')

llamaIndex2026-05-07 05:37:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

In llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py, from_openai_message only inspects reasoning_content:

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

This matches the OpenAI/DeepSeek convention. However, vLLM's OpenAI-compatible server uses a different field name:

Layer	OpenAI / DeepSeek convention	vLLM Qwen3 convention
Server response	`message.reasoning_content`	`message.reasoning`
LlamaIndex `from_openai_message`	Reads as `ThinkingBlock`	Misses entirely

vLLM's official documentation confirms reasoning is the field name in current versions: https://docs.vllm.ai/en/latest/features/reasoning_outputs/

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

Fix Action

Fix / Workaround

vLLM is one of the most widely used inference engines for self-hosted LLMs
Qwen3 / Qwen3.5 / Qwen3.6 are the canonical open-source reasoning models supported on vLLM
This combination (LlamaIndex + vLLM + Qwen3) is a very common stack
The bug is silent: users assume the model isn't reasoning, when in fact the trace is being discarded at the conversion layer
Workaround requires subclassing OpenAILike or monkey-patching from_openai_message, which is fragile across LlamaIndex upgrades

Workaround: inspect the raw response — the reasoning IS there, just not extracted

raw_msg = response.raw.choices[0].message print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))

True — the data is in the raw response but lost during conversion

Code Example

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

---

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

---

# Current
reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

# Proposed
reasoning_content = getattr(openai_message, "reasoning_content", None)
if not reasoning_content:
    # vLLM's OpenAI-compatible server uses `reasoning` instead of `reasoning_content`
    reasoning_content = getattr(openai_message, "reasoning", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

---

docker run -d --name vllm-qwen3-6-35b \
  --runtime nvidia --gpus all \
  -p 9008:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.1 \
  --model Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --max-model-len 262144

---

curl -s http://localhost:9008/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Which is greater, 9.11 or 9.8?"}],
    "max_tokens": 2048,
    "stream": false
  }' | jq '.choices[0].message | keys'

---

["content", "reasoning", "role", "tool_calls", ...]

---

from llama_index.llms.openai_like import OpenAILike
from llama_index.core.llms.types import ChatMessage, ThinkingBlock, TextBlock

llm = OpenAILike(
    model="Qwen/Qwen3.6-35B-A3B",
    api_base="http://localhost:9008/v1",
    api_key="EMPTY",
    is_chat_model=True,
)

response = llm.chat([
    ChatMessage(role="user", content="Which is greater, 9.11 or 9.8?")
])

print("Blocks:", [type(b).__name__ for b in response.message.blocks])
# Actual:   ['TextBlock']
# Expected: ['ThinkingBlock', 'TextBlock']

thinking_blocks = [b for b in response.message.blocks if isinstance(b, ThinkingBlock)]
print("ThinkingBlocks found:", len(thinking_blocks))
# Actual:   0
# Expected: 1

# Workaround: inspect the raw response — the reasoning IS there, just not extracted
raw_msg = response.raw.choices[0].message
print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))
# True — the data is in the raw response but lost during conversion

---

RAW_BUFFERClick to expand / collapse

Bug Description

When using LlamaIndex's OpenAI-compatible LLM client against a vLLM server (>=0.20.x) serving Qwen3-family reasoning models, the model's reasoning trace is silently dropped because from_openai_message only checks for the reasoning_content field, while current vLLM exposes it as reasoning.

As a result, ThinkingBlock is never appended to the assistant message, and the entire chain-of-thought produced by the model becomes invisible to downstream LlamaIndex components (workflows, agents, evaluators, etc.).

Affected Versions

llama-index-llms-openai: latest (verified against main as of this report)
llama-index-core: latest
vLLM: 0.20.1+cu129 (also reproducible on other vLLM 0.20.x builds)
Model: Qwen/Qwen3.6-35B-A3B (also affects Qwen3, Qwen3.5 family with --reasoning-parser qwen3)

Root Cause

In llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py, from_openai_message only inspects reasoning_content:

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

This matches the OpenAI/DeepSeek convention. However, vLLM's OpenAI-compatible server uses a different field name:

Layer	OpenAI / DeepSeek convention	vLLM Qwen3 convention
Server response	`message.reasoning_content`	`message.reasoning`
LlamaIndex `from_openai_message`	Reads as `ThinkingBlock`	Misses entirely

vLLM's official documentation confirms reasoning is the field name in current versions: https://docs.vllm.ai/en/latest/features/reasoning_outputs/

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

Expected Behavior

from_openai_message should construct a ThinkingBlock from message.reasoning when reasoning_content is absent, so that vLLM-served reasoning models work out of the box.

Actual Behavior

The reasoning trace from vLLM's response is silently discarded. Downstream code that relies on ThinkingBlock (workflow visualization, reasoning-aware evaluators, agentic loops that re-feed thinking, etc.) never receives the reasoning content even though it was generated by the model and present in the raw response.

Proposed Fix

Add a fallback in from_openai_message:

# Current
reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

# Proposed
reasoning_content = getattr(openai_message, "reasoning_content", None)
if not reasoning_content:
    # vLLM's OpenAI-compatible server uses `reasoning` instead of `reasoning_content`
    reasoning_content = getattr(openai_message, "reasoning", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

The same fallback is needed in the streaming path (stream_chat / astream_chat) where the delta object similarly carries reasoning instead of reasoning_content for vLLM-served Qwen3 models.

I'm happy to open a PR with this change plus a test fixture if maintainers agree on the approach.

Why This Matters

vLLM is one of the most widely used inference engines for self-hosted LLMs
Qwen3 / Qwen3.5 / Qwen3.6 are the canonical open-source reasoning models supported on vLLM
This combination (LlamaIndex + vLLM + Qwen3) is a very common stack
The bug is silent: users assume the model isn't reasoning, when in fact the trace is being discarded at the conversion layer
Workaround requires subclassing OpenAILike or monkey-patching from_openai_message, which is fragile across LlamaIndex upgrades

vLLM Reasoning Outputs docs (showing message.reasoning): https://docs.vllm.ai/en/latest/features/reasoning_outputs/
Existing ThinkingBlock support PR (added reasoning_content handling): https://github.com/run-llama/llama_index/pull/19919
Qwen3 reasoning parser source: https://docs.vllm.ai/en/latest/api/vllm/reasoning/qwen3_reasoning_parser/

Environment

OS: Ubuntu 22.04 (Docker container)
Python: 3.13
llama-index-core: 0.14.21
llama-index-llms-openai: 0.7.7
llama-index-llms-openai-like: 0.7.2
vllm: 0.20.1+cu129
GPU: B200

Version

0.14.21

Steps to Reproduce

1. Start a vLLM server with Qwen3.6 and the qwen3 reasoning parser

docker run -d --name vllm-qwen3-6-35b \
  --runtime nvidia --gpus all \
  -p 9008:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.1 \
  --model Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --max-model-len 262144

2. Verify vLLM emits `reasoning` (not `reasoning_content`)

curl -s http://localhost:9008/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Which is greater, 9.11 or 9.8?"}],
    "max_tokens": 2048,
    "stream": false
  }' | jq '.choices[0].message | keys'

Output:

["content", "reasoning", "role", "tool_calls", ...]

Note the field is reasoning, not reasoning_content.

3. Call via LlamaIndex and observe missing ThinkingBlock

from llama_index.llms.openai_like import OpenAILike
from llama_index.core.llms.types import ChatMessage, ThinkingBlock, TextBlock

llm = OpenAILike(
    model="Qwen/Qwen3.6-35B-A3B",
    api_base="http://localhost:9008/v1",
    api_key="EMPTY",
    is_chat_model=True,
)

response = llm.chat([
    ChatMessage(role="user", content="Which is greater, 9.11 or 9.8?")
])

print("Blocks:", [type(b).__name__ for b in response.message.blocks])
# Actual:   ['TextBlock']
# Expected: ['ThinkingBlock', 'TextBlock']

thinking_blocks = [b for b in response.message.blocks if isinstance(b, ThinkingBlock)]
print("ThinkingBlocks found:", len(thinking_blocks))
# Actual:   0
# Expected: 1

# Workaround: inspect the raw response — the reasoning IS there, just not extracted
raw_msg = response.raw.choices[0].message
print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))
# True — the data is in the raw response but lost during conversion

Relevant Logs/Tracebacks

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #vector store #embedding generation #cache error #pipeline error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix [Bug]: from_openai_message misses vLLM-served Qwen3 reasoning field (uses 'reasoning' instead of 'reasoning_content')

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround: inspect the raw response — the reasoning IS there, just not extracted

True — the data is in the raw response but lost during conversion

Code Example

Bug Description

Affected Versions

Root Cause

Expected Behavior

Actual Behavior

Proposed Fix

Why This Matters

Related

Environment

Version

Steps to Reproduce

1. Start a vLLM server with Qwen3.6 and the qwen3 reasoning parser

2. Verify vLLM emits `reasoning` (not `reasoning_content`)

3. Call via LlamaIndex and observe missing ThinkingBlock

Relevant Logs/Tracebacks

Still need to ship something?

TRENDING

llamaIndex - 💡(How to fix) Fix [Bug]: from_openai_message misses vLLM-served Qwen3 reasoning field (uses 'reasoning' instead of 'reasoning_content')

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround: inspect the raw response — the reasoning IS there, just not extracted

True — the data is in the raw response but lost during conversion

Code Example

Bug Description

Affected Versions

Root Cause

Expected Behavior

Actual Behavior

Proposed Fix

Why This Matters

Related

Environment

Version

Steps to Reproduce

1. Start a vLLM server with Qwen3.6 and the qwen3 reasoning parser

2. Verify vLLM emits reasoning (not reasoning_content)

3. Call via LlamaIndex and observe missing ThinkingBlock

Relevant Logs/Tracebacks

Still need to ship something?

RELATED_DISCOVERY

TRENDING

2. Verify vLLM emits `reasoning` (not `reasoning_content`)