llamaIndex - 💡(How to fix) Fix [Bug]: from_openai_message misses vLLM-served Qwen3 reasoning field (uses 'reasoning' instead of 'reasoning_content')

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

In llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py, from_openai_message only inspects reasoning_content:

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

This matches the OpenAI/DeepSeek convention. However, vLLM's OpenAI-compatible server uses a different field name:

LayerOpenAI / DeepSeek conventionvLLM Qwen3 convention
Server responsemessage.reasoning_contentmessage.reasoning
LlamaIndex from_openai_messageReads as ThinkingBlockMisses entirely

vLLM's official documentation confirms reasoning is the field name in current versions: https://docs.vllm.ai/en/latest/features/reasoning_outputs/

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

Fix Action

Fix / Workaround

  • vLLM is one of the most widely used inference engines for self-hosted LLMs
  • Qwen3 / Qwen3.5 / Qwen3.6 are the canonical open-source reasoning models supported on vLLM
  • This combination (LlamaIndex + vLLM + Qwen3) is a very common stack
  • The bug is silent: users assume the model isn't reasoning, when in fact the trace is being discarded at the conversion layer
  • Workaround requires subclassing OpenAILike or monkey-patching from_openai_message, which is fragile across LlamaIndex upgrades

Workaround: inspect the raw response — the reasoning IS there, just not extracted

raw_msg = response.raw.choices[0].message print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))

True — the data is in the raw response but lost during conversion

Code Example

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

---

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

---

# Current
reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

# Proposed
reasoning_content = getattr(openai_message, "reasoning_content", None)
if not reasoning_content:
    # vLLM's OpenAI-compatible server uses `reasoning` instead of `reasoning_content`
    reasoning_content = getattr(openai_message, "reasoning", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

---

docker run -d --name vllm-qwen3-6-35b \
  --runtime nvidia --gpus all \
  -p 9008:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.1 \
  --model Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --max-model-len 262144

---

curl -s http://localhost:9008/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Which is greater, 9.11 or 9.8?"}],
    "max_tokens": 2048,
    "stream": false
  }' | jq '.choices[0].message | keys'

---

["content", "reasoning", "role", "tool_calls", ...]

---

from llama_index.llms.openai_like import OpenAILike
from llama_index.core.llms.types import ChatMessage, ThinkingBlock, TextBlock

llm = OpenAILike(
    model="Qwen/Qwen3.6-35B-A3B",
    api_base="http://localhost:9008/v1",
    api_key="EMPTY",
    is_chat_model=True,
)

response = llm.chat([
    ChatMessage(role="user", content="Which is greater, 9.11 or 9.8?")
])

print("Blocks:", [type(b).__name__ for b in response.message.blocks])
# Actual:   ['TextBlock']
# Expected: ['ThinkingBlock', 'TextBlock']

thinking_blocks = [b for b in response.message.blocks if isinstance(b, ThinkingBlock)]
print("ThinkingBlocks found:", len(thinking_blocks))
# Actual:   0
# Expected: 1

# Workaround: inspect the raw response — the reasoning IS there, just not extracted
raw_msg = response.raw.choices[0].message
print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))
# True — the data is in the raw response but lost during conversion

---
RAW_BUFFERClick to expand / collapse

Bug Description

When using LlamaIndex's OpenAI-compatible LLM client against a vLLM server (>=0.20.x) serving Qwen3-family reasoning models, the model's reasoning trace is silently dropped because from_openai_message only checks for the reasoning_content field, while current vLLM exposes it as reasoning.

As a result, ThinkingBlock is never appended to the assistant message, and the entire chain-of-thought produced by the model becomes invisible to downstream LlamaIndex components (workflows, agents, evaluators, etc.).

Affected Versions

  • llama-index-llms-openai: latest (verified against main as of this report)
  • llama-index-core: latest
  • vLLM: 0.20.1+cu129 (also reproducible on other vLLM 0.20.x builds)
  • Model: Qwen/Qwen3.6-35B-A3B (also affects Qwen3, Qwen3.5 family with --reasoning-parser qwen3)

Root Cause

In llama-index-integrations/llms/llama-index-llms-openai/llama_index/llms/openai/utils.py, from_openai_message only inspects reasoning_content:

reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

This matches the OpenAI/DeepSeek convention. However, vLLM's OpenAI-compatible server uses a different field name:

LayerOpenAI / DeepSeek conventionvLLM Qwen3 convention
Server responsemessage.reasoning_contentmessage.reasoning
LlamaIndex from_openai_messageReads as ThinkingBlockMisses entirely

vLLM's official documentation confirms reasoning is the field name in current versions: https://docs.vllm.ai/en/latest/features/reasoning_outputs/

response = client.chat.completions.create(model=model, messages=messages)
reasoning = response.choices[0].message.reasoning  # vLLM uses `reasoning`
content = response.choices[0].message.content

Expected Behavior

from_openai_message should construct a ThinkingBlock from message.reasoning when reasoning_content is absent, so that vLLM-served reasoning models work out of the box.

Actual Behavior

The reasoning trace from vLLM's response is silently discarded. Downstream code that relies on ThinkingBlock (workflow visualization, reasoning-aware evaluators, agentic loops that re-feed thinking, etc.) never receives the reasoning content even though it was generated by the model and present in the raw response.

Proposed Fix

Add a fallback in from_openai_message:

# Current
reasoning_content = getattr(openai_message, "reasoning_content", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

# Proposed
reasoning_content = getattr(openai_message, "reasoning_content", None)
if not reasoning_content:
    # vLLM's OpenAI-compatible server uses `reasoning` instead of `reasoning_content`
    reasoning_content = getattr(openai_message, "reasoning", None)
if isinstance(reasoning_content, str) and reasoning_content:
    blocks.append(ThinkingBlock(content=reasoning_content))

The same fallback is needed in the streaming path (stream_chat / astream_chat) where the delta object similarly carries reasoning instead of reasoning_content for vLLM-served Qwen3 models.

I'm happy to open a PR with this change plus a test fixture if maintainers agree on the approach.

Why This Matters

  • vLLM is one of the most widely used inference engines for self-hosted LLMs
  • Qwen3 / Qwen3.5 / Qwen3.6 are the canonical open-source reasoning models supported on vLLM
  • This combination (LlamaIndex + vLLM + Qwen3) is a very common stack
  • The bug is silent: users assume the model isn't reasoning, when in fact the trace is being discarded at the conversion layer
  • Workaround requires subclassing OpenAILike or monkey-patching from_openai_message, which is fragile across LlamaIndex upgrades

Related

Environment

  • OS: Ubuntu 22.04 (Docker container)
  • Python: 3.13
  • llama-index-core: 0.14.21
  • llama-index-llms-openai: 0.7.7
  • llama-index-llms-openai-like: 0.7.2
  • vllm: 0.20.1+cu129
  • GPU: B200

Version

0.14.21

Steps to Reproduce

1. Start a vLLM server with Qwen3.6 and the qwen3 reasoning parser

docker run -d --name vllm-qwen3-6-35b \
  --runtime nvidia --gpus all \
  -p 9008:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.20.1 \
  --model Qwen/Qwen3.6-35B-A3B \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --max-model-len 262144

2. Verify vLLM emits reasoning (not reasoning_content)

curl -s http://localhost:9008/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Which is greater, 9.11 or 9.8?"}],
    "max_tokens": 2048,
    "stream": false
  }' | jq '.choices[0].message | keys'

Output:

["content", "reasoning", "role", "tool_calls", ...]

Note the field is reasoning, not reasoning_content.

3. Call via LlamaIndex and observe missing ThinkingBlock

from llama_index.llms.openai_like import OpenAILike
from llama_index.core.llms.types import ChatMessage, ThinkingBlock, TextBlock

llm = OpenAILike(
    model="Qwen/Qwen3.6-35B-A3B",
    api_base="http://localhost:9008/v1",
    api_key="EMPTY",
    is_chat_model=True,
)

response = llm.chat([
    ChatMessage(role="user", content="Which is greater, 9.11 or 9.8?")
])

print("Blocks:", [type(b).__name__ for b in response.message.blocks])
# Actual:   ['TextBlock']
# Expected: ['ThinkingBlock', 'TextBlock']

thinking_blocks = [b for b in response.message.blocks if isinstance(b, ThinkingBlock)]
print("ThinkingBlocks found:", len(thinking_blocks))
# Actual:   0
# Expected: 1

# Workaround: inspect the raw response — the reasoning IS there, just not extracted
raw_msg = response.raw.choices[0].message
print("Raw reasoning field present:", bool(getattr(raw_msg, "reasoning", None)))
# True — the data is in the raw response but lost during conversion

Relevant Logs/Tracebacks

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - 💡(How to fix) Fix [Bug]: from_openai_message misses vLLM-served Qwen3 reasoning field (uses 'reasoning' instead of 'reasoning_content')