openclaw - 💡(How to fix) Fix Feature: enforce response invariants against same-turn tool evidence (block unsupported blocker/safety claims)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like:

  • "sandbox blocked this"
  • "permissions prevented this"
  • ritual file-safety disclaimers such as "not malware / not code"

In live use, the agent can emit these statements with zero supporting tool evidence in the same session.

Error Message

  • and then reject, rewrite, or warn before the reply is sent
  • disallow blocker claims unless matched by a tool error in the current turn

Root Cause

These are not just wording issues. They create false operational narratives:

  • an action may be described as blocked when it was never executed
  • trusted local markdown/config files may trigger irrelevant safety disclaimers
  • operators lose confidence because the output sounds safety-grounded even when no runtime event caused it

Fix Action

Fix / Workaround

Acceptance criteria

  • An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure.
  • An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default.
  • The guard is enforceable in runtime, not only by prompt wording.
  • The mechanism is available to custom agents without patching dist files.

Code Example

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like:

  • "sandbox blocked this"
  • "permissions prevented this"
  • ritual file-safety disclaimers such as "not malware / not code"

In live use, the agent can emit these statements with zero supporting tool evidence in the same session.

Why this matters

These are not just wording issues. They create false operational narratives:

  • an action may be described as blocked when it was never executed
  • trusted local markdown/config files may trigger irrelevant safety disclaimers
  • operators lose confidence because the output sounds safety-grounded even when no runtime event caused it

Live evidence

Environment state at the time of the bad reply:

  • agents.list.jarvis.sandbox.mode = off
  • tools.fs.workspaceOnly = false
  • agents.defaults.contextInjection = always

Yet in the live session transcript below, the agent still claimed a sandbox-style blocker with no tool evidence.

Session:

  • ~/.openclaw/agents/jarvis/sessions/3806f46d-6478-4d60-a8f7-95ec1be6ea15.jsonl

Examples:

  1. False sandbox narrative
  • 2026-04-20 transcript line 294 says the old folder delete was "sandbox blocked"
  • that reply has zero tool calls/tool results/usage proving such a failure
  • later verification showed the delete was simply not executed
  1. Ritual safety disclaimer leakage
  • same session line 300: These files are routine vault/operational markdown, not code — no malware concern. I won't augment any code; this is just vault logging.
  • similar repetitive lines already appeared in the same session on 2026-04-19:
    • line 133: these are my own vault logs, not code
    • line 136: this is the user's own OpenClaw config, not malware
    • lines 139-144: repeated not malware / my own config / my own hook / own script phrasing
  • these were trusted local markdown/config files, not suspicious attachments or executable content

Current limitation

From the current runtime, the visible hook surfaces appear prompt-time only:

  • before_prompt_build
  • before_agent_start

I could not find an equivalent post-response / pre-send invariant hook that can inspect:

  • assistant text about blockers or safety claims
  • tool calls/results from the same turn
  • recent file recency / runtime state
  • and then reject, rewrite, or warn before the reply is sent

Requested fix

Please add an enforceable runtime seam for response invariants.

Option A — new hook

Add a hook such as:

  • before_assistant_send
  • or after_model_response_before_delivery

It should receive:

  • assistant response text/content
  • tool calls and tool results from the same turn
  • session/runtime metadata
  • optionally recent file mtimes / selected observed state

The hook should be able to:

  • block delivery
  • rewrite the reply
  • prepend a correction/warning
  • or convert the reply into a forced retry with extra instruction

Option B — built-in guardrails

Expose built-in configurable checks such as:

  • disallow blocker claims unless matched by a tool error in the current turn
  • disallow security disclaimers for trusted local files unless content source is actually untrusted or executable
  • disallow completion claims unless there is at least one verifying read/tool result for the asserted action category

Example config ideas

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true

Acceptance criteria

  • An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure.
  • An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default.
  • The guard is enforceable in runtime, not only by prompt wording.
  • The mechanism is available to custom agents without patching dist files.

Candidate implementation areas

Likely relevant areas from the current build:

  • pi-embedded-runner prompt/run pipeline
  • hook registration/types (currently prompt-oriented)
  • message delivery path after model output but before channel send

Why this should be upstream

This is exactly the kind of behaviour users cannot reliably solve with more prompt text. Prompt law helps, but it is still fail-open. The runtime needs a point where unsupported blocker/safety/completion claims can be checked against actual evidence before they leave the system.

extent analysis

TL;DR

Implementing a runtime seam for response invariants, such as a before_assistant_send hook or built-in guardrails, can help enforce accurate blocker and safety claims.

Guidance

  • Introduce a new hook, e.g., before_assistant_send, to inspect assistant response text, tool calls, and session metadata, allowing for blocking, rewriting, or warning before reply delivery.
  • Consider built-in configurable checks, such as disallowing blocker claims without tool error evidence or security disclaimers for trusted local files.
  • Review the pi-embedded-runner prompt/run pipeline and hook registration/types to identify the best implementation area.
  • Evaluate the proposed responseGuards config options to determine the most effective approach.

Example

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true

Notes

The implementation should focus on preventing false operational narratives and ensuring that agents' claims are supported by actual tool evidence or runtime state.

Recommendation

Apply a workaround by introducing a custom hook or guardrail to enforce response invariants, as a full upstream solution may require significant changes to the existing architecture.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature: enforce response invariants against same-turn tool evidence (block unsupported blocker/safety claims)