openclaw - 💡(How to fix) Fix Support reasoning-field outputs and visible final-answer handling for native reasoning models [2 comments, 2 participants]

jmystaki-create · 2026-04-29T03:03:09Z

[openclaw] Native reasoning models can return the useful answer in message.reasoning while message.content is null or incomplete. This makes benchmark scores a… Native reasoning models can return the useful answer in `message.reasoning` while `message.content` is null or incomplete. This makes benchmark scores and OpenClaw compatibility misleading unless the harness and runtime explicitly support reasoning fields and/or enforce visible final answers. This is a feature request for OpenClaw compatibility and `oc llm` benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug. # Support reasoning-field outputs and visible final-answer handling for native reasoning models ## Summary Native reasoning models can return the useful answer in `message.reasoning` while `message.content` is null or incomplete. This makes benchmark scores and OpenClaw compatibility misleading unless the harness and runtime explicitly support reasoning fields and/or enforce visible final answers. This is a feature request for OpenClaw compatibility and `oc llm` benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug. ## Background / context Evidence comes from the Nemotron Super GB10 benchmark investigation under: `/root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/` Known context: - Model tested: `nemotron-3-super` - Underlying model: `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` - Runtime: vLLM OpenAI-compatible endpoint - Hardware: Dell GB10 Pro Max - Benchmark: OpenClaw local `oc llm` non-smoke benchmark - Lane A: strict OpenClaw-compatible mode - Lane B: native reasoning mode ## Problem Native reasoning mode can improve model quality substantially, but OpenClaw currently appears to be at risk of underusing or mishandling these outputs if it only consumes visible `message.content`. Observed issue: - Corrected Lane B quality is high: **197/210 = 93.81%** - But visible-content-only scoring drops to **151/210 = 71.90%** - This means if OpenClaw only consumes visible content, it may treat many correct native-reasoning responses as unusable. - Some tasks returned `content:null` while the reasoning field had the useful answer. - Coding and OpenClaw realworld visible-content-only scoring were especially poor in the audit, showing that native reasoning output shape is not reliably compatible with ordinary content-only consumers. ## Evidence from Nemotron Lane B Verified from the local corrected metrics and audit files: - Lane A baseline: **184/210 = 87.62%** - Original Lane B: **188/210 = 89.52%** - Corrected final-answer Lane B: **197/210 = 93.81%** - Visible-content-only Lane B: **151/210 = 71.90%** Per-pack visible-content-only scores: - coding: **0/25 = 0.00%** - longcontext: **17/20 = 85.00%** - openclaw_realworld: **0/20 = 0.00%** - reasoning: **134/145 = 92.41%** Audit interpretation: Native reasoning is higher quality after corrected final-answer extraction, but OpenClaw usability without reasoning-field support is poor because native reasoning frequently returned `content:null` or no final answer in visible content. ## Desired behavior OpenClaw and `oc llm` should explicitly support reasoning-aware response handling rather than silently treating `content:null` as an ordinary empty/failed model response when a useful `message.reasoning` field exists. The system should be able to answer separate questions: - Is the model response high quality if reasoning fields are allowed? - Is the response visible-content compatible? - Is the response strict-output compatible? - Is the response safe for tool/JSON tasks? - Did extraction come from `message.content`, `message.reasoning`, or a fallback path? ## Proposed design ### 1. Reasoning field ingestion - Detect and record `message.reasoning` when present. - Allow selected benchmarks/modes to extract final answers from `message.reasoning` when `message.content` is null or incomplete. - Make this opt-in per benchmark lane or model profile so ordinary content-only scoring remains available. ### 2. Visible final answer enforcement - Add an OpenClaw model/profile option to require a final answer in visible `message.content`. - Support request-level flags such as: ```json { "extra_body": { "chat_template_kwargs": { "force_nonempty_content": true } } } ``` - Report whether visible content was non-empty for each task. ### 3. Extraction source reporting Store these fields in `oc llm` result output: - raw `message.content` - raw `message.reasoning` - extraction source: `content`, `reasoning`, or `fallback` - extraction method - ambiguity flag - final extracted answer Include these in JSON and markdown reports. ### 4. Model profile support Add model profile fields such as: - `supports_reasoning_field: true` - `requires_visible_content: true/false` - `final_answer_extraction: content_first | reasoning_allowed | content_only` - `force_nonempty_content: true/false` - `native_reasoning_ena

openclaw2026-04-29 03:03:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74021•Fetched 2026-04-30 06:29:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jmystaki-create

Participants

clawsweeper[bot]

jmystaki-create

Timeline (top)

commented ×2cross-referenced ×2mentioned ×1subscribed ×1

Native reasoning models can return the useful answer in message.reasoning while message.content is null or incomplete. This makes benchmark scores and OpenClaw compatibility misleading unless the harness and runtime explicitly support reasoning fields and/or enforce visible final answers.

This is a feature request for OpenClaw compatibility and oc llm benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug.

Error Message

Reports should clearly warn when a mode is high quality but poor content compatibility.

Reports clearly warn when a mode is high quality but poor content compatibility.

Root Cause

Native reasoning is higher quality after corrected final-answer extraction, but OpenClaw usability without reasoning-field support is poor because native reasoning frequently returned content:null or no final answer in visible content.

Code Example

{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}

RAW_BUFFERClick to expand / collapse

Support reasoning-field outputs and visible final-answer handling for native reasoning models

Summary

This is a feature request for OpenClaw compatibility and oc llm benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug.

Background / context

Evidence comes from the Nemotron Super GB10 benchmark investigation under:

/root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/

Known context:

Model tested: nemotron-3-super
Underlying model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Runtime: vLLM OpenAI-compatible endpoint
Hardware: Dell GB10 Pro Max
Benchmark: OpenClaw local oc llm non-smoke benchmark
Lane A: strict OpenClaw-compatible mode
Lane B: native reasoning mode

Problem

Native reasoning mode can improve model quality substantially, but OpenClaw currently appears to be at risk of underusing or mishandling these outputs if it only consumes visible message.content.

Observed issue:

Corrected Lane B quality is high: 197/210 = 93.81%
But visible-content-only scoring drops to 151/210 = 71.90%
This means if OpenClaw only consumes visible content, it may treat many correct native-reasoning responses as unusable.
Some tasks returned content:null while the reasoning field had the useful answer.
Coding and OpenClaw realworld visible-content-only scoring were especially poor in the audit, showing that native reasoning output shape is not reliably compatible with ordinary content-only consumers.

Evidence from Nemotron Lane B

Verified from the local corrected metrics and audit files:

Lane A baseline: 184/210 = 87.62%
Original Lane B: 188/210 = 89.52%
Corrected final-answer Lane B: 197/210 = 93.81%
Visible-content-only Lane B: 151/210 = 71.90%

Per-pack visible-content-only scores:

coding: 0/25 = 0.00%
longcontext: 17/20 = 85.00%
openclaw_realworld: 0/20 = 0.00%
reasoning: 134/145 = 92.41%

Audit interpretation:

Desired behavior

OpenClaw and oc llm should explicitly support reasoning-aware response handling rather than silently treating content:null as an ordinary empty/failed model response when a useful message.reasoning field exists.

The system should be able to answer separate questions:

Is the model response high quality if reasoning fields are allowed?
Is the response visible-content compatible?
Is the response strict-output compatible?
Is the response safe for tool/JSON tasks?
Did extraction come from message.content, message.reasoning, or a fallback path?

Proposed design

1. Reasoning field ingestion

Detect and record message.reasoning when present.
Allow selected benchmarks/modes to extract final answers from message.reasoning when message.content is null or incomplete.
Make this opt-in per benchmark lane or model profile so ordinary content-only scoring remains available.

2. Visible final answer enforcement

Add an OpenClaw model/profile option to require a final answer in visible message.content.
Support request-level flags such as:

{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}

Report whether visible content was non-empty for each task.

3. Extraction source reporting

Store these fields in oc llm result output:

raw message.content
raw message.reasoning
extraction source: content, reasoning, or fallback
extraction method
ambiguity flag
final extracted answer

Include these in JSON and markdown reports.

4. Model profile support

Add model profile fields such as:

supports_reasoning_field: true
requires_visible_content: true/false
final_answer_extraction: content_first | reasoning_allowed | content_only
force_nonempty_content: true/false
native_reasoning_enabled: true/false

5. Reporting

In oc llm reports, include:

content-null count
reasoning-field-present count
reasoning-only answer count
visible-final-answer count
extraction-source breakdown
compatibility warning when quality score is high but visible-content-only score is low

Suggested scoring/reporting modes

Add separate scoring modes:

quality_score: may use reasoning field if configured
visible_content_score: only uses visible content
openclaw_compat_score: tests whether the result is usable by normal OpenClaw message handling
strict_content_score: requires exact visible output
reasoning_aware_score: allows final extraction from reasoning

The same Lane B dataset should show both:

corrected reasoning-aware quality around 93.81%
visible-content-only compatibility around 71.90%

Reports should clearly warn when a mode is high quality but poor content compatibility.

Operational routing implications

OpenClaw should be able to route native-reasoning models differently:

strict mode for normal tool/JSON/content tasks
native reasoning mode for hard reasoning tasks
fallback mode only when latency budget allows
disable native reasoning for tool/JSON tasks unless explicitly configured

This is important because a native reasoning model may be strong for hard reasoning but unreliable for visible-content-only or strict structured-output tasks.

Acceptance criteria

oc llm can report both reasoning-aware and visible-content-only scores.
oc llm records whether each answer came from message.content or message.reasoning.
OpenClaw can optionally use message.reasoning for final-answer extraction when configured.
OpenClaw can optionally require visible non-empty content and report failures when content is null.
The same Lane B dataset shows both:
- corrected reasoning-aware quality around 197/210 = 93.81%
- visible-content-only compatibility around 151/210 = 71.90%
Reports clearly warn when a mode is high quality but poor content compatibility.
The system does not silently treat content:null as equivalent to an empty/failed model response when a reasoning field exists.

Attachments / local evidence paths

Source metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
Corrected metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
Scorer audit report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md
A/B comparison report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/A_vs_B_native_reasoning_comparison_20260429_004524.md
Project status: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/PROJECT_STATUS.md

Relationship to scorer bug issue

This is related to, but separate from, the scorer bug filed as:

https://github.com/openclaw/openclaw/issues/74017

The scorer bug is: correct final extracted answers were marked wrong when ambiguity_flag=true.

This feature request is: OpenClaw should explicitly support and report native reasoning output shapes, including cases where the useful answer is in message.reasoning while message.content is null or incomplete.

extent analysis

TL;DR

To address the issue of OpenClaw underutilizing native reasoning model outputs, support for reasoning fields and visible final answer handling should be added to OpenClaw and oc llm benchmarking.

Guidance

Detect and record message.reasoning: Modify OpenClaw to detect and record message.reasoning when present, allowing for the extraction of final answers from this field when message.content is null or incomplete.
Add opt-in benchmark modes: Introduce opt-in modes for benchmarks to extract final answers from message.reasoning, ensuring ordinary content-only scoring remains available.
Enforce visible final answers: Implement an option to require a final answer in visible message.content and report whether visible content was non-empty for each task.
Report extraction sources: Store and report the extraction source (content, reasoning, or fallback) and method in oc llm result outputs.
Update model profiles: Add fields to model profiles to indicate support for reasoning fields, requirement for visible content, and final answer extraction methods.

Example

{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}

This example shows how to require a non-empty message.content for a specific task.

Notes

The implementation should ensure backward compatibility with existing content-only scoring and allow for separate scoring modes (e.g., quality_score, visible_content_score) to accommodate different use cases.

Recommendation

Apply the proposed design changes to support reasoning-aware response handling and visible final answer enforcement, ensuring OpenClaw can effectively utilize native reasoning model outputs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tool integration #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.