openclaw - 💡(How to fix) Fix Support reasoning-field outputs and visible final-answer handling for native reasoning models [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74021Fetched 2026-04-30 06:29:42
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Timeline (top)
commented ×2cross-referenced ×2mentioned ×1subscribed ×1

Native reasoning models can return the useful answer in message.reasoning while message.content is null or incomplete. This makes benchmark scores and OpenClaw compatibility misleading unless the harness and runtime explicitly support reasoning fields and/or enforce visible final answers.

This is a feature request for OpenClaw compatibility and oc llm benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug.

Error Message

Reports should clearly warn when a mode is high quality but poor content compatibility.

  • Reports clearly warn when a mode is high quality but poor content compatibility.

Root Cause

Native reasoning is higher quality after corrected final-answer extraction, but OpenClaw usability without reasoning-field support is poor because native reasoning frequently returned content:null or no final answer in visible content.

Code Example

{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}
RAW_BUFFERClick to expand / collapse

Support reasoning-field outputs and visible final-answer handling for native reasoning models

Summary

Native reasoning models can return the useful answer in message.reasoning while message.content is null or incomplete. This makes benchmark scores and OpenClaw compatibility misleading unless the harness and runtime explicitly support reasoning fields and/or enforce visible final answers.

This is a feature request for OpenClaw compatibility and oc llm benchmark/reporting support for reasoning-aware model outputs. It is not a Nemotron bug and not a vLLM bug.

Background / context

Evidence comes from the Nemotron Super GB10 benchmark investigation under:

/root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/

Known context:

  • Model tested: nemotron-3-super
  • Underlying model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  • Runtime: vLLM OpenAI-compatible endpoint
  • Hardware: Dell GB10 Pro Max
  • Benchmark: OpenClaw local oc llm non-smoke benchmark
  • Lane A: strict OpenClaw-compatible mode
  • Lane B: native reasoning mode

Problem

Native reasoning mode can improve model quality substantially, but OpenClaw currently appears to be at risk of underusing or mishandling these outputs if it only consumes visible message.content.

Observed issue:

  • Corrected Lane B quality is high: 197/210 = 93.81%
  • But visible-content-only scoring drops to 151/210 = 71.90%
  • This means if OpenClaw only consumes visible content, it may treat many correct native-reasoning responses as unusable.
  • Some tasks returned content:null while the reasoning field had the useful answer.
  • Coding and OpenClaw realworld visible-content-only scoring were especially poor in the audit, showing that native reasoning output shape is not reliably compatible with ordinary content-only consumers.

Evidence from Nemotron Lane B

Verified from the local corrected metrics and audit files:

  • Lane A baseline: 184/210 = 87.62%
  • Original Lane B: 188/210 = 89.52%
  • Corrected final-answer Lane B: 197/210 = 93.81%
  • Visible-content-only Lane B: 151/210 = 71.90%

Per-pack visible-content-only scores:

  • coding: 0/25 = 0.00%
  • longcontext: 17/20 = 85.00%
  • openclaw_realworld: 0/20 = 0.00%
  • reasoning: 134/145 = 92.41%

Audit interpretation:

Native reasoning is higher quality after corrected final-answer extraction, but OpenClaw usability without reasoning-field support is poor because native reasoning frequently returned content:null or no final answer in visible content.

Desired behavior

OpenClaw and oc llm should explicitly support reasoning-aware response handling rather than silently treating content:null as an ordinary empty/failed model response when a useful message.reasoning field exists.

The system should be able to answer separate questions:

  • Is the model response high quality if reasoning fields are allowed?
  • Is the response visible-content compatible?
  • Is the response strict-output compatible?
  • Is the response safe for tool/JSON tasks?
  • Did extraction come from message.content, message.reasoning, or a fallback path?

Proposed design

1. Reasoning field ingestion

  • Detect and record message.reasoning when present.
  • Allow selected benchmarks/modes to extract final answers from message.reasoning when message.content is null or incomplete.
  • Make this opt-in per benchmark lane or model profile so ordinary content-only scoring remains available.

2. Visible final answer enforcement

  • Add an OpenClaw model/profile option to require a final answer in visible message.content.
  • Support request-level flags such as:
{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}
  • Report whether visible content was non-empty for each task.

3. Extraction source reporting

Store these fields in oc llm result output:

  • raw message.content
  • raw message.reasoning
  • extraction source: content, reasoning, or fallback
  • extraction method
  • ambiguity flag
  • final extracted answer

Include these in JSON and markdown reports.

4. Model profile support

Add model profile fields such as:

  • supports_reasoning_field: true
  • requires_visible_content: true/false
  • final_answer_extraction: content_first | reasoning_allowed | content_only
  • force_nonempty_content: true/false
  • native_reasoning_enabled: true/false

5. Reporting

In oc llm reports, include:

  • content-null count
  • reasoning-field-present count
  • reasoning-only answer count
  • visible-final-answer count
  • extraction-source breakdown
  • compatibility warning when quality score is high but visible-content-only score is low

Suggested scoring/reporting modes

Add separate scoring modes:

  • quality_score: may use reasoning field if configured
  • visible_content_score: only uses visible content
  • openclaw_compat_score: tests whether the result is usable by normal OpenClaw message handling
  • strict_content_score: requires exact visible output
  • reasoning_aware_score: allows final extraction from reasoning

The same Lane B dataset should show both:

  • corrected reasoning-aware quality around 93.81%
  • visible-content-only compatibility around 71.90%

Reports should clearly warn when a mode is high quality but poor content compatibility.

Operational routing implications

OpenClaw should be able to route native-reasoning models differently:

  • strict mode for normal tool/JSON/content tasks
  • native reasoning mode for hard reasoning tasks
  • fallback mode only when latency budget allows
  • disable native reasoning for tool/JSON tasks unless explicitly configured

This is important because a native reasoning model may be strong for hard reasoning but unreliable for visible-content-only or strict structured-output tasks.

Acceptance criteria

  • oc llm can report both reasoning-aware and visible-content-only scores.
  • oc llm records whether each answer came from message.content or message.reasoning.
  • OpenClaw can optionally use message.reasoning for final-answer extraction when configured.
  • OpenClaw can optionally require visible non-empty content and report failures when content is null.
  • The same Lane B dataset shows both:
    • corrected reasoning-aware quality around 197/210 = 93.81%
    • visible-content-only compatibility around 151/210 = 71.90%
  • Reports clearly warn when a mode is high quality but poor content compatibility.
  • The system does not silently treat content:null as equivalent to an empty/failed model response when a reasoning field exists.

Attachments / local evidence paths

  • Source metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_metrics_20260429_004524.json
  • Corrected metrics: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/results/B_native_reasoning_corrected_metrics_20260429_005736.json
  • Scorer audit report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/B_native_reasoning_scorer_audit_20260429_005736.md
  • A/B comparison report: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/reports/A_vs_B_native_reasoning_comparison_20260429_004524.md
  • Project status: /root/.openclaw/workspace/benchmarking/nemotron_config_matrix_2026_04/PROJECT_STATUS.md

Relationship to scorer bug issue

This is related to, but separate from, the scorer bug filed as:

https://github.com/openclaw/openclaw/issues/74017

The scorer bug is: correct final extracted answers were marked wrong when ambiguity_flag=true.

This feature request is: OpenClaw should explicitly support and report native reasoning output shapes, including cases where the useful answer is in message.reasoning while message.content is null or incomplete.

extent analysis

TL;DR

To address the issue of OpenClaw underutilizing native reasoning model outputs, support for reasoning fields and visible final answer handling should be added to OpenClaw and oc llm benchmarking.

Guidance

  1. Detect and record message.reasoning: Modify OpenClaw to detect and record message.reasoning when present, allowing for the extraction of final answers from this field when message.content is null or incomplete.
  2. Add opt-in benchmark modes: Introduce opt-in modes for benchmarks to extract final answers from message.reasoning, ensuring ordinary content-only scoring remains available.
  3. Enforce visible final answers: Implement an option to require a final answer in visible message.content and report whether visible content was non-empty for each task.
  4. Report extraction sources: Store and report the extraction source (content, reasoning, or fallback) and method in oc llm result outputs.
  5. Update model profiles: Add fields to model profiles to indicate support for reasoning fields, requirement for visible content, and final answer extraction methods.

Example

{
  "extra_body": {
    "chat_template_kwargs": {
      "force_nonempty_content": true
    }
  }
}

This example shows how to require a non-empty message.content for a specific task.

Notes

The implementation should ensure backward compatibility with existing content-only scoring and allow for separate scoring modes (e.g., quality_score, visible_content_score) to accommodate different use cases.

Recommendation

Apply the proposed design changes to support reasoning-aware response handling and visible final answer enforcement, ensuring OpenClaw can effectively utilize native reasoning model outputs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Support reasoning-field outputs and visible final-answer handling for native reasoning models [2 comments, 2 participants]