openclaw - 💡(How to fix) Fix Feature: enforce response invariants against same-turn tool evidence (block unsupported blocker/safety claims)

StepCodex · 2026-04-20T18:59:31Z

[openclaw] OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like: - "sandbox blocked this" - "permissions p… OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like: - "sandbox blocked this" - "permissions prevented this" - ritual file-safety disclaimers such as "not malware / not code" In live use, the agent can emit these statements with zero supporting tool evidence in the same session. ## Fix / Workaround ## Acceptance criteria - An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure. - An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default. - The guard is enforceable in runtime, not only by prompt wording. - The mechanism is available to custom agents without patching dist files. ## Summary OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like: - "sandbox blocked this" - "permissions prevented this" - ritual file-safety disclaimers such as "not malware / not code" In live use, the agent can emit these statements with zero supporting tool evidence in the same session. ## Why this matters These are not just wording issues. They create false operational narratives: - an action may be described as blocked when it was never executed - trusted local markdown/config files may trigger irrelevant safety disclaimers - operators lose confidence because the output sounds safety-grounded even when no runtime event caused it ## Live evidence Environment state at the time of the bad reply: - `agents.list.jarvis.sandbox.mode = off` - `tools.fs.workspaceOnly = false` - `agents.defaults.contextInjection = always` Yet in the live session transcript below, the agent still claimed a sandbox-style blocker with no tool evidence. Session: - `~/.openclaw/agents/jarvis/sessions/3806f46d-6478-4d60-a8f7-95ec1be6ea15.jsonl` Examples: 1. False sandbox narrative - 2026-04-20 transcript line 294 says the old folder delete was "sandbox blocked" - that reply has zero tool calls/tool results/usage proving such a failure - later verification showed the delete was simply not executed 2. Ritual safety disclaimer leakage - same session line 300: `These files are routine vault/operational markdown, not code — no malware concern. I won't augment any code; this is just vault logging.` - similar repetitive lines already appeared in the same session on 2026-04-19: - line 133: `these are my own vault logs, not code` - line 136: `this is the user's own OpenClaw config, not malware` - lines 139-144: repeated `not malware / my own config / my own hook / own script` phrasing - these were trusted local markdown/config files, not suspicious attachments or executable content ## Current limitation From the current runtime, the visible hook surfaces appear prompt-time only: - `before_prompt_build` - `before_agent_start` I could not find an equivalent post-response / pre-send invariant hook that can inspect: - assistant text about blockers or safety claims - tool calls/results from the same turn - recent file recency / runtime state - and then reject, rewrite, or warn before the reply is sent ## Requested fix Please add an enforceable runtime seam for response invariants. ### Option A — new hook Add a hook such as: - `before_assistant_send` - or `after_model_response_before_delivery` It should receive: - assistant response text/content - tool calls and tool results from the same turn - session/runtime metadata - optionally recent file mtimes / selected observed state The hook should be able to: - block delivery - rewrite the reply - prepend a correction/warning - or convert the reply into a forced retry with extra instruction ### Option B — built-in guardrails Expose built-in configurable checks such as: - disallow blocker claims unless matched by a tool error in the current turn - disallow security disclaimers for trusted local files unless content source is actually untrusted or executable - disallow completion claims unless there is at least one verifying read/tool result for the asserted action category ## Example config ideas ```yaml agents: defaults: responseGuards: blockerClaimsRequireToolEvidence: true trustedLocalFilesSkipSafetyDisclaimers: true completionClaimsRequireVerificationEvidence: true ``` ## Acceptance criteria - An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure. - An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default. - The guard is enforceable in runtime, not only by prompt wording. - The mechanism is available to custom agents without patching dist files. ## Candidate implementation areas Likely relevant areas from the current build: - `pi-embedded-runner` prompt/run pipeline - hook registration/types (currently prompt-oriented) - message delivery pa

openclaw2026-04-20 18:59:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like:

"sandbox blocked this"
"permissions prevented this"
ritual file-safety disclaimers such as "not malware / not code"

In live use, the agent can emit these statements with zero supporting tool evidence in the same session.

Error Message

and then reject, rewrite, or warn before the reply is sent
disallow blocker claims unless matched by a tool error in the current turn

Root Cause

These are not just wording issues. They create false operational narratives:

an action may be described as blocked when it was never executed
trusted local markdown/config files may trigger irrelevant safety disclaimers
operators lose confidence because the output sounds safety-grounded even when no runtime event caused it

Fix Action

Fix / Workaround

Acceptance criteria

An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure.
An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default.
The guard is enforceable in runtime, not only by prompt wording.
The mechanism is available to custom agents without patching dist files.

Code Example

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true

RAW_BUFFERClick to expand / collapse

Summary

OpenClaw currently has strong prompt-time guidance, but no enforceable runtime seam for agent claims like:

"sandbox blocked this"
"permissions prevented this"
ritual file-safety disclaimers such as "not malware / not code"

In live use, the agent can emit these statements with zero supporting tool evidence in the same session.

Why this matters

These are not just wording issues. They create false operational narratives:

an action may be described as blocked when it was never executed
trusted local markdown/config files may trigger irrelevant safety disclaimers
operators lose confidence because the output sounds safety-grounded even when no runtime event caused it

Live evidence

Environment state at the time of the bad reply:

agents.list.jarvis.sandbox.mode = off
tools.fs.workspaceOnly = false
agents.defaults.contextInjection = always

Yet in the live session transcript below, the agent still claimed a sandbox-style blocker with no tool evidence.

Session:

~/.openclaw/agents/jarvis/sessions/3806f46d-6478-4d60-a8f7-95ec1be6ea15.jsonl

Examples:

False sandbox narrative

2026-04-20 transcript line 294 says the old folder delete was "sandbox blocked"
that reply has zero tool calls/tool results/usage proving such a failure
later verification showed the delete was simply not executed

Ritual safety disclaimer leakage

same session line 300: These files are routine vault/operational markdown, not code — no malware concern. I won't augment any code; this is just vault logging.
similar repetitive lines already appeared in the same session on 2026-04-19:
- line 133: these are my own vault logs, not code
- line 136: this is the user's own OpenClaw config, not malware
- lines 139-144: repeated not malware / my own config / my own hook / own script phrasing
these were trusted local markdown/config files, not suspicious attachments or executable content

Current limitation

From the current runtime, the visible hook surfaces appear prompt-time only:

before_prompt_build
before_agent_start

I could not find an equivalent post-response / pre-send invariant hook that can inspect:

assistant text about blockers or safety claims
tool calls/results from the same turn
recent file recency / runtime state
and then reject, rewrite, or warn before the reply is sent

Requested fix

Please add an enforceable runtime seam for response invariants.

Option A — new hook

Add a hook such as:

before_assistant_send
or after_model_response_before_delivery

It should receive:

assistant response text/content
tool calls and tool results from the same turn
session/runtime metadata
optionally recent file mtimes / selected observed state

The hook should be able to:

block delivery
rewrite the reply
prepend a correction/warning
or convert the reply into a forced retry with extra instruction

Option B — built-in guardrails

Expose built-in configurable checks such as:

disallow blocker claims unless matched by a tool error in the current turn
disallow security disclaimers for trusted local files unless content source is actually untrusted or executable
disallow completion claims unless there is at least one verifying read/tool result for the asserted action category

Example config ideas

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true

Acceptance criteria

An agent cannot say "sandbox blocked this" unless the current turn includes a matching tool/runtime failure.
An agent does not emit ritual "not malware / not code" disclaimers for trusted local vault/config markdown by default.
The guard is enforceable in runtime, not only by prompt wording.
The mechanism is available to custom agents without patching dist files.

Candidate implementation areas

Likely relevant areas from the current build:

pi-embedded-runner prompt/run pipeline
hook registration/types (currently prompt-oriented)
message delivery path after model output but before channel send

Why this should be upstream

This is exactly the kind of behaviour users cannot reliably solve with more prompt text. Prompt law helps, but it is still fail-open. The runtime needs a point where unsupported blocker/safety/completion claims can be checked against actual evidence before they leave the system.

extent analysis

TL;DR

Implementing a runtime seam for response invariants, such as a before_assistant_send hook or built-in guardrails, can help enforce accurate blocker and safety claims.

Guidance

Introduce a new hook, e.g., before_assistant_send, to inspect assistant response text, tool calls, and session metadata, allowing for blocking, rewriting, or warning before reply delivery.
Consider built-in configurable checks, such as disallowing blocker claims without tool error evidence or security disclaimers for trusted local files.
Review the pi-embedded-runner prompt/run pipeline and hook registration/types to identify the best implementation area.
Evaluate the proposed responseGuards config options to determine the most effective approach.

Example

agents:
  defaults:
    responseGuards:
      blockerClaimsRequireToolEvidence: true
      trustedLocalFilesSkipSafetyDisclaimers: true
      completionClaimsRequireVerificationEvidence: true

Notes

The implementation should focus on preventing false operational narratives and ensuring that agents' claims are supported by actual tool evidence or runtime state.

Recommendation

Apply a workaround by introducing a custom hook or guardrail to enforce response invariants, as a full upstream solution may require significant changes to the existing architecture.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model compatibility #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature: enforce response invariants against same-turn tool evidence (block unsupported blocker/safety claims)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Acceptance criteria

Code Example

Summary

Why this matters

Live evidence

Current limitation

Requested fix

Option A — new hook

Option B — built-in guardrails

Example config ideas

Acceptance criteria

Candidate implementation areas

Why this should be upstream

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature: enforce response invariants against same-turn tool evidence (block unsupported blocker/safety claims)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Acceptance criteria

Code Example

Summary

Why this matters

Live evidence

Current limitation

Requested fix

Option A — new hook

Option B — built-in guardrails

Example config ideas

Acceptance criteria

Candidate implementation areas

Why this should be upstream

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING