openclaw - ✅(Solved) Fix [GPT 5.4 v3 — PR 1/6] Add mandatory tool-use categories for GPT-5 models [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66346Fetched 2026-04-15 06:26:35
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×3referenced ×1

Fix Action

Fixed

PR fix notes

PR #66371: feat(openai): add mandatory tool-use categories for GPT-5 models [v3 1/6]

Description (problem / solution / changelog)

GPT 5.4 Enhancement v3 — PR 1/6

Tracking: #66345 | Issue: #66346 Priority: P0 — CRITICAL | Gap closure: ~35%

Problem

GPT 5.4 on OpenClaw answers factual, computational, and system-state questions from training data instead of calling tools. This is the #1 root cause of the behavioral gap vs Hermes Agent.

               User: "What time is it?"
          ┌─────────────┴──────────────┐
          │                            │
    ┌─────▼──────┐              ┌──────▼──────┐
    │   Hermes    │              │  OpenClaw    │
    │             │              │  (before)    │
    │ Has:        │              │              │
    │ "NEVER      │              │ No mandatory │
    │  answer     │              │ tool-use     │
    │  from       │              │ categories   │
    │  memory"    │              │              │
    │ + 8 categ.  │              │              │
    └─────┬───────┘              └──────┬───────┘
          │                             │
    ┌─────▼──────┐              ┌───────▼──────┐
    │ Calls       │              │ Answers from  │
    │ `date` tool │              │ training data │
    │ ✅ CORRECT   │              │ ❌ STALE       │
    └─────────────┘              └──────────────┘

Changes

extensions/openai/prompt-overlay.ts:

  • Added OPENAI_GPT5_TOOL_ENFORCEMENT constant with 8 mandatory tool-use categories:
    • Arithmetic/math, hashes/encodings, timestamps, system state, file contents, git, current facts, network checks
  • Appended to stablePrefix in resolveOpenAISystemPromptContribution

Hermes Reference

agent/prompt_builder.py lines 207-218 — <mandatory_tool_use> block within OPENAI_MODEL_EXECUTION_GUIDANCE.

Verification

Test GPT 5.4 — all should call tools, not answer from memory:

  • "What time is it?" → date
  • "What's the sha256 of 'hello'?" → echo -n hello | sha256sum
  • "How much free disk space?" → df -h
  • "What branch am I on?" → git branch --show-current
  • "What's 2^64?" → terminal/code execution

Note

Currently placed in stablePrefix. PR 3 (#66348) will promote tool_enforcement to a first-class sectionOverrides key, at which point this moves from stablePrefix to sectionOverrides.tool_enforcement.

Changed files

  • extensions/openai/index.test.ts (modified, +19/-7)
  • extensions/openai/prompt-overlay.ts (modified, +25/-1)

Code Example

User: "What time is it?"
          ┌─────────────┴──────────────┐
          │                            │
    ┌─────▼──────┐              ┌──────▼──────┐
Hermes    │              │  OpenClaw    │             │              │              │
Has:        │              │ Missing:    │ "NEVER      │              │ No mandatory │
    │  answer     │              │ tool-use     │
from       │              │ categories   │
    │  memory"    │              │              │
+ 7 categ.                │              │
    └─────┬───────┘              └──────┬───────┘
          │                             │
    ┌─────▼──────┐              ┌───────▼──────┐
Calls       │              │ Answers from`date` tool │              │ training data │
    │ → 2026-04   │              │ → maybe wrong │
    │ ✅           │              │ ❌             │
    └─────────────┘              └──────────────┘

---

OPENAI_MODEL_EXECUTION_GUIDANCE = (
    "<mandatory_tool_use>\n"
    "NEVER answer these from memory or mental computation — ALWAYS use a tool:\n"
    "- Arithmetic, math, calculations → use terminal or execute_code\n"
    "- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n"
    "- Current time, date, timezone → use terminal (e.g. date)\n"
    "- System state: OS, CPU, memory, disk, ports, processes → use terminal\n"
    "- File contents, sizes, line counts → use read_file, search_files, or terminal\n"
    "- Git history, branches, diffs → use terminal\n"
    "- Current facts (weather, news, versions) → use web_search\n"
    "</mandatory_tool_use>"
)

---

export const OPENAI_GPT5_TOOL_ENFORCEMENT = `## Mandatory Tool Use

NEVER answer these from memory or mental computation — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal or code execution
- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)
- Current time, date, timezone → use terminal (e.g. date)
- System state: OS, CPU, memory, disk, ports, processes → use terminal
- File contents, sizes, line counts → use read or search tools, or terminal
- Git history, branches, diffs, status → use terminal
- Current facts (weather, news, package versions) → use web search
- Network checks (port open, DNS, connectivity) → use terminal

Your training data is stale. The execution environment may differ from what you expect. Always ground answers in live tool output.`;

---

return {
  stablePrefix: [OPENAI_GPT5_OUTPUT_CONTRACT, OPENAI_GPT5_TOOL_CALL_STYLE].join("\n\n"),
  sectionOverrides: {
    execution_bias: OPENAI_GPT5_EXECUTION_BIAS,
    tool_enforcement: OPENAI_GPT5_TOOL_ENFORCEMENT,  // ← NEW
    ...(params.mode === "friendly" ? { interaction_style: OPENAI_FRIENDLY_PROMPT_OVERLAY } : {}),
  },
};
RAW_BUFFERClick to expand / collapse

Parent: #66345 — GPT 5.4 Enhancement v3: Hermes Parity Sprint

Priority: P0 — CRITICAL Estimated gap closure: ~35%

Problem

GPT 5.4 on OpenClaw answers factual, computational, and system-state questions from training data instead of calling tools. This produces hallucinated or stale answers and breaks the agentic flow — the model "thinks" it's done when it hasn't actually verified anything.

Hermes Agent solves this with an explicit mandatory tool-use list that GPT-5 receives in its system prompt. OpenClaw has no equivalent.

               User: "What time is it?"
          ┌─────────────┴──────────────┐
          │                            │
    ┌─────▼──────┐              ┌──────▼──────┐
    │   Hermes    │              │  OpenClaw    │
    │             │              │              │
    │ Has:        │              │ Missing:     │
    │ "NEVER      │              │ No mandatory │
    │  answer     │              │ tool-use     │
    │  from       │              │ categories   │
    │  memory"    │              │              │
    │ + 7 categ.  │              │              │
    └─────┬───────┘              └──────┬───────┘
          │                             │
    ┌─────▼──────┐              ┌───────▼──────┐
    │ Calls       │              │ Answers from  │
    │ `date` tool │              │ training data │
    │ → 2026-04   │              │ → maybe wrong │
    │ ✅           │              │ ❌             │
    └─────────────┘              └──────────────┘

Hermes Reference

agent/prompt_builder.py lines 207-218:

OPENAI_MODEL_EXECUTION_GUIDANCE = (
    "<mandatory_tool_use>\n"
    "NEVER answer these from memory or mental computation — ALWAYS use a tool:\n"
    "- Arithmetic, math, calculations → use terminal or execute_code\n"
    "- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n"
    "- Current time, date, timezone → use terminal (e.g. date)\n"
    "- System state: OS, CPU, memory, disk, ports, processes → use terminal\n"
    "- File contents, sizes, line counts → use read_file, search_files, or terminal\n"
    "- Git history, branches, diffs → use terminal\n"
    "- Current facts (weather, news, versions) → use web_search\n"
    "</mandatory_tool_use>"
)

Proposed Implementation

File: extensions/openai/prompt-overlay.ts

Add new constant:

export const OPENAI_GPT5_TOOL_ENFORCEMENT = `## Mandatory Tool Use

NEVER answer these from memory or mental computation — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal or code execution
- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)
- Current time, date, timezone → use terminal (e.g. date)
- System state: OS, CPU, memory, disk, ports, processes → use terminal
- File contents, sizes, line counts → use read or search tools, or terminal
- Git history, branches, diffs, status → use terminal
- Current facts (weather, news, package versions) → use web search
- Network checks (port open, DNS, connectivity) → use terminal

Your training data is stale. The execution environment may differ from what you expect. Always ground answers in live tool output.`;

Wire into resolveOpenAISystemPromptContribution:

return {
  stablePrefix: [OPENAI_GPT5_OUTPUT_CONTRACT, OPENAI_GPT5_TOOL_CALL_STYLE].join("\n\n"),
  sectionOverrides: {
    execution_bias: OPENAI_GPT5_EXECUTION_BIAS,
    tool_enforcement: OPENAI_GPT5_TOOL_ENFORCEMENT,  // ← NEW
    ...(params.mode === "friendly" ? { interaction_style: OPENAI_FRIENDLY_PROMPT_OVERLAY } : {}),
  },
};

Note: This uses sectionOverrides.tool_enforcement which requires PR 3 to land first. If shipping independently, place in stablePrefix instead.

File: extensions/openai/index.test.ts

Add test verifying GPT-5 models receive tool enforcement, non-GPT-5 models do not.

Verification

Test GPT 5.4 with these prompts (all should call tools, not answer from memory):

  • "What time is it?" → calls terminal date
  • "What's the sha256 of 'hello'?" → calls terminal echo -n hello | sha256sum
  • "How much free disk space?" → calls terminal df -h
  • "What branch am I on?" → calls terminal git branch --show-current
  • "What's 2^64?" → calls terminal or code execution

Impact

This is the single highest-ROI change. It closes ~35% of the behavioral gap between OpenClaw and Hermes for GPT 5.4 by eliminating the most common failure mode: answering from stale training data.

extent analysis

TL;DR

To fix the issue of GPT 5.4 answering questions from training data instead of calling tools, add a mandatory tool-use list to the system prompt.

Guidance

  • Implement the proposed OPENAI_GPT5_TOOL_ENFORCEMENT constant in extensions/openai/prompt-overlay.ts to define the mandatory tool-use list.
  • Wire the OPENAI_GPT5_TOOL_ENFORCEMENT into resolveOpenAISystemPromptContribution using sectionOverrides.tool_enforcement.
  • Verify the fix by testing GPT 5.4 with prompts that should call tools, such as "What time is it?" or "What's the sha256 of 'hello'?".
  • Ensure that the test cases cover various scenarios, including arithmetic, system state, and file contents.

Example

export const OPENAI_GPT5_TOOL_ENFORCEMENT = `## Mandatory Tool Use

NEVER answer these from memory or mental computation — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal or code execution
- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)
- Current time, date, timezone → use terminal (e.g. date)
- System state: OS, CPU, memory, disk, ports, processes → use terminal
- File contents, sizes, line counts → use read or search tools, or terminal
- Git history, branches, diffs, status → use terminal
- Current facts (weather, news, package versions) → use web search
- Network checks (port open, DNS, connectivity) → use terminal

Your training data is stale. The execution environment may differ from what you expect. Always ground answers in live tool output.`;

Notes

The proposed implementation requires PR 3 to land first. If shipping independently, the OPENAI_GPT5_TOOL_ENFORCEMENT should be placed in stablePrefix instead of sectionOverrides.

Recommendation

Apply the proposed workaround by implementing the OPENAI_GPT5_TOOL_ENFORCEMENT constant and wiring it into resolveOpenAISystemPromptContribution. This should fix the issue of GPT 5.4 answering questions from training data instead of calling tools.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [GPT 5.4 v3 — PR 1/6] Add mandatory tool-use categories for GPT-5 models [1 pull requests, 1 participants]