openclaw - ✅(Solved) Fix Heartbeat runs defeat their own cache-keep-alive purpose via tool-set and system-prompt divergence [1 pull requests, 2 comments, 2 participants]

brianthinks · 2026-04-23T01:01:46Z

[openclaw] When a heartbeat fires on an agent that's also handling real user conversation turns e.g. a Telegram-bound agent , the heartbeat's request and the c… When a heartbeat fires on an agent that's also handling real user conversation turns (e.g. a Telegram-bound agent), the heartbeat's request and the conversation's request form **two separate prompt-cache chains** on Anthropic's side. The heartbeat therefore fails to refresh the conversation chain's TTL — which is the entire point of having heartbeats on a cacheRetention-long setup. Two distinct divergences cause this, captured and measured on a production instance: 1. **Tools array differs** — heartbeat runs strip 4 tools (`gateway`, `cron`, `nodes`, `whatsapp_login`) from the request's `tools` array. 2. **System prompt differs** — heartbeat runs render the `Runtime:` line and several channel-specific sections differently from normal runs (6+ persistent divergences). Either divergence alone is enough to break cache alignment; together they guarantee two parallel chains that never share entries. # PR #70602: fix(heartbeat): keep full tool array during heartbeat runs - Repository: openclaw/openclaw - Author: chinar-amrutkar - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/70602 ## Description (problem / solution / changelog) ## Summary Fixes heartbeat runs defeating their own cache-keep-alive purpose by restoring the tools-prefix cache alignment with conversation turns. ## Problem Heartbeat runs (where `senderIsOwner=false`) were filtering out owner-restricted tools (`gateway`, `cron`, `nodes`) entirely from the tools array, causing the tools-prefix hash to differ from conversation runs. This broke Anthropic's prompt cache — every heartbeat paid full price for a new cache chain instead of refreshing the conversation's TTL. ## Solution Instead of removing owner-only tools during heartbeat, keep them in the tool list with runtime guards that throw if invoked. This preserves the tools-prefix cache hash while maintaining the safety property that heartbeat runs cannot actually use privileged tools. ## Changes - **`src/agents/tool-policy.ts`** — Added optional `keepOwnerTools` flag to `applyOwnerOnlyToolPolicy`. When `true`, owner-only tools are kept (wrapped with guard) instead of being removed. - **`src/agents/pi-embedded-runner/effective-tool-policy.ts`** — Detect heartbeat runs via `bootstrapContextRunKind === "heartbeat"` and pass `keepOwnerTools: true`. - **`src/agents/pi-embedded-runner/run/attempt.ts`** — Forward `bootstrapContextRunKind` to `applyFinalEffectiveToolPolicy`. - **`src/agents/tool-policy.test.ts`** — Added regression test for `keepOwnerTools` behavior. ## Testing - `applyOwnerOnlyToolPolicy(tools, false, { keepOwnerTools: true })` → owner tools are kept (wrapped), not removed - `applyOwnerOnlyToolPolicy(tools, false)` → unchanged (owner tools removed for non-owner, as before) ## Related - Closes #70417 - Related: #70418 (orthogonal cache-warmer proposal, independent of heartbeats) ## Changed files - `src/agents/pi-embedded-runner/effective-tool-policy.ts` (modified, +2/-0) - `src/agents/pi-embedded-runner/run/attempt.ts` (modified, +1/-0) - `src/agents/tool-policy.test.ts` (modified, +13/-0) - `src/agents/tool-policy.ts` (modified, +15/-5) ## Fix / Workaround Stop filtering the `tools` array for heartbeat runs. All tools stay in the request, preserving the tools-prefix cache hash across modes. Add a runtime check at the tool-dispatch layer: if a heartbeat run attempts to invoke a tool in the "heartbeat-denied" set, reject it with a structured `tool_result` error: # Heartbeat runs defeat their own cache-keep-alive purpose via tool-set and system-prompt divergence ## Summary When a heartbeat fires on an agent that's also handling real user conversation turns (e.g. a Telegram-bound agent), the heartbeat's request and the conversation's request form **two separate prompt-cache chains** on Anthropic's side. The heartbeat therefore fails to refresh the conversation chain's TTL — which is the entire point of having heartbeats on a cacheRetention-long setup. Two distinct divergences cause this, captured and measured on a production instance: 1. **Tools array differs** — heartbeat runs strip 4 tools (`gateway`, `cron`, `nodes`, `whatsapp_login`) from the request's `tools` array. 2. **System prompt differs** — heartbeat runs render the `Runtime:` line and several channel-specific sections differently from normal runs (6+ persistent divergences). Either divergence alone is enough to break cache alignment; together they guarantee two parallel chains that never share entries. ## Evidence Captured via mitmproxy on `api.anthropic.com/v1/messages` over ~4 days, multiple agents, Opus 4.7 with `cacheRetention: "long"`. ### Divergence 1 — tools array Same agent, same day, same session: | Request kind | `tools` array length | Missing vs other | |---|---|---| | Conversation turn (last user msg from Telegram) | 29 tools | — |

openclaw2026-04-23 01:01:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70417•Fetched 2026-04-23 07:25:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

brianthinks

Participants

brianthinks

MaiHHConnect

Timeline (top)

commented ×2cross-referenced ×2

When a heartbeat fires on an agent that's also handling real user conversation turns (e.g. a Telegram-bound agent), the heartbeat's request and the conversation's request form two separate prompt-cache chains on Anthropic's side. The heartbeat therefore fails to refresh the conversation chain's TTL — which is the entire point of having heartbeats on a cacheRetention-long setup.

Two distinct divergences cause this, captured and measured on a production instance:

Tools array differs — heartbeat runs strip 4 tools (gateway, cron, nodes, whatsapp_login) from the request's tools array.
System prompt differs — heartbeat runs render the Runtime: line and several channel-specific sections differently from normal runs (6+ persistent divergences).

Either divergence alone is enough to break cache alignment; together they guarantee two parallel chains that never share entries.

Error Message

Stop filtering the tools array for heartbeat runs. All tools stay in the request, preserving the tools-prefix cache hash across modes. Add a runtime check at the tool-dispatch layer: if a heartbeat run attempts to invoke a tool in the "heartbeat-denied" set, reject it with a structured tool_result error:

Root Cause

Both divergences trace to the same design decision: heartbeat runs are synthesized as a pseudo-channel (channel=heartbeat, capabilities=none) rather than inheriting the channel context of the session they fire within.

Tools array divergence happens because heartbeat runs strip tools in group:automation + group:nodes + whatsapp_login, presumably as a safety measure against heartbeats triggering privileged operations.
System prompt divergence happens because channel=heartbeat cascades through several conditional sections (buildMessagingSection, reaction guidance, authorized senders, inbound-meta example, Runtime line).

Fix Action

Fix / Workaround

PR fix notes

PR #70602: fix(heartbeat): keep full tool array during heartbeat runs

Repository: openclaw/openclaw
Author: chinar-amrutkar
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/70602

Description (problem / solution / changelog)

Summary

Fixes heartbeat runs defeating their own cache-keep-alive purpose by restoring the tools-prefix cache alignment with conversation turns.

Problem

Heartbeat runs (where senderIsOwner=false) were filtering out owner-restricted tools (gateway, cron, nodes) entirely from the tools array, causing the tools-prefix hash to differ from conversation runs. This broke Anthropic's prompt cache — every heartbeat paid full price for a new cache chain instead of refreshing the conversation's TTL.

Solution

Instead of removing owner-only tools during heartbeat, keep them in the tool list with runtime guards that throw if invoked. This preserves the tools-prefix cache hash while maintaining the safety property that heartbeat runs cannot actually use privileged tools.

Changes

src/agents/tool-policy.ts — Added optional keepOwnerTools flag to applyOwnerOnlyToolPolicy. When true, owner-only tools are kept (wrapped with guard) instead of being removed.
src/agents/pi-embedded-runner/effective-tool-policy.ts — Detect heartbeat runs via bootstrapContextRunKind === "heartbeat" and pass keepOwnerTools: true.
src/agents/pi-embedded-runner/run/attempt.ts — Forward bootstrapContextRunKind to applyFinalEffectiveToolPolicy.
src/agents/tool-policy.test.ts — Added regression test for keepOwnerTools behavior.

Testing

applyOwnerOnlyToolPolicy(tools, false, { keepOwnerTools: true }) → owner tools are kept (wrapped), not removed
applyOwnerOnlyToolPolicy(tools, false) → unchanged (owner tools removed for non-owner, as before)

Closes #70417
Related: #70418 (orthogonal cache-warmer proposal, independent of heartbeats)

Changed files

src/agents/pi-embedded-runner/effective-tool-policy.ts (modified, +2/-0)
src/agents/pi-embedded-runner/run/attempt.ts (modified, +1/-0)
src/agents/tool-policy.test.ts (modified, +13/-0)
src/agents/tool-policy.ts (modified, +15/-5)

Code Example

@@ exec approval wording — varies by channel
- "rely on native approval card/buttons when they appear and do not also send plain chat /approve instructions."
+ "include the concrete /approve command from tool output as plain chat text for the user, and do not ask for a different or rotated code."

@@ Authorized Senders section — present in conversation, absent in heartbeat
- ## Authorized Senders
- Authorized senders: <redacted>. These senders are allowlisted; do not assume they are the owner.

@@ Reactions section — Telegram-specific, absent in heartbeat
- ## Reactions
- Reactions are enabled for Telegram in MINIMAL mode...

@@ Inline buttons — different text per mode
- Inline buttons supported. Use `action=send` with `buttons=[[{text,callback_data,style?}]]`...
+ Inline buttons not enabled for heartbeat. If you need them, ask to set heartbeat.capabilities.inlineButtons...

@@ inbound_meta schema example — channel field varies
-  "account_id": "default", "channel": "telegram", "provider": "telegram", "surface": "telegram",
+  "channel": "heartbeat", "provider": "heartbeat",

@@ Runtime line
- Runtime: agent=main | ... channel=telegram | capabilities=inlinebuttons | thinking=high
+ Runtime: agent=main | ... channel=heartbeat | capabilities=none | thinking=high

---

{
  "type": "tool_result",
  "tool_use_id": "...",
  "content": "This tool is not available during heartbeat runs. If you need to schedule work or restart the gateway, defer until the next user turn.",
  "is_error": true
}

RAW_BUFFERClick to expand / collapse

Heartbeat runs defeat their own cache-keep-alive purpose via tool-set and system-prompt divergence

Summary

Two distinct divergences cause this, captured and measured on a production instance:

Tools array differs — heartbeat runs strip 4 tools (gateway, cron, nodes, whatsapp_login) from the request's tools array.
System prompt differs — heartbeat runs render the Runtime: line and several channel-specific sections differently from normal runs (6+ persistent divergences).

Either divergence alone is enough to break cache alignment; together they guarantee two parallel chains that never share entries.

Evidence

Captured via mitmproxy on api.anthropic.com/v1/messages over ~4 days, multiple agents, Opus 4.7 with cacheRetention: "long".

Divergence 1 — tools array

Same agent, same day, same session:

Request kind	`tools` array length	Missing vs other
Conversation turn (last user msg from Telegram)	29 tools	—
Heartbeat turn (last user msg starts with `"Read HEARTBEAT.md..."`)	25 tools	drops `gateway`, `cron`, `nodes`, `whatsapp_login`

Also observed: even shared tools have different definitions between the two modes. Example: exec tool definition is ~415 tokens in conversation runs, ~392 tokens in heartbeat runs. Something in the tool description/schema embeds runtime context (channel, elevated state, capabilities) that varies per run kind, so the same-named tool hashes differently.

Divergence 2 — system prompt

On two consecutive main runs (one user turn, one heartbeat) in the same session, the system prompt differs in 6 persistent places (excluding the one-shot post-compaction block which is timing-specific):

@@ exec approval wording — varies by channel
- "rely on native approval card/buttons when they appear and do not also send plain chat /approve instructions."
+ "include the concrete /approve command from tool output as plain chat text for the user, and do not ask for a different or rotated code."

@@ Authorized Senders section — present in conversation, absent in heartbeat
- ## Authorized Senders
- Authorized senders: <redacted>. These senders are allowlisted; do not assume they are the owner.

@@ Reactions section — Telegram-specific, absent in heartbeat
- ## Reactions
- Reactions are enabled for Telegram in MINIMAL mode...

@@ Inline buttons — different text per mode
- Inline buttons supported. Use `action=send` with `buttons=[[{text,callback_data,style?}]]`...
+ Inline buttons not enabled for heartbeat. If you need them, ask to set heartbeat.capabilities.inlineButtons...

@@ inbound_meta schema example — channel field varies
-  "account_id": "default", "channel": "telegram", "provider": "telegram", "surface": "telegram",
+  "channel": "heartbeat", "provider": "heartbeat",

@@ Runtime line
- Runtime: agent=main | ... channel=telegram | capabilities=inlinebuttons | thinking=high
+ Runtime: agent=main | ... channel=heartbeat | capabilities=none | thinking=high

Cache-behavior consequence

Captured flow-level observations on the same agent across a day:

Heartbeat turns never observe cache_read > 0 on the conversation chain's entries (always cold-start or TTL-expired from the heartbeat chain).
Every heartbeat pays cache_creation_input_tokens for re-writing the heartbeat chain's stable prefix (~95K–285K tokens depending on point in session).
Conversely, conversation turns never benefit from the heartbeat's cache writes — heartbeat-written entries age out unused every 55 minutes.

Direct cost: at cacheRetention: "long" pricing (2× input = $10/M on Opus 4.7), every heartbeat that could have hit the conversation's cache but didn't costs an extra ~$0.50–$3 depending on session depth. Over 24 heartbeats/day that's roughly $12–70/day of avoidable cache-write spend per agent, for no functional benefit.

Root cause

Tools array divergence happens because heartbeat runs strip tools in group:automation + group:nodes + whatsapp_login, presumably as a safety measure against heartbeats triggering privileged operations.
System prompt divergence happens because channel=heartbeat cascades through several conditional sections (buildMessagingSection, reaction guidance, authorized senders, inbound-meta example, Runtime line).

Proposed fix

Two complementary changes, each independently valuable:

Fix A: Heartbeats keep the full tool array; privilege enforcement moves to runtime

{
  "type": "tool_result",
  "tool_use_id": "...",
  "content": "This tool is not available during heartbeat runs. If you need to schedule work or restart the gateway, defer until the next user turn.",
  "is_error": true
}

The model adapts naturally (typically replies HEARTBEAT_OK after a rejection instead of retrying the tool).

Bonus — add a small note to the dynamic section of the system prompt (below OPENCLAW_CACHE_BOUNDARY, so it doesn't defeat cache) listing the denied tools explicitly. Stops the model from trying in the first place.

Fix B: Heartbeats inherit the containing session's channel context

When a heartbeat fires within a session bound to a specific channel (telegram/discord/slack/etc.), build the system prompt as if it were a normal run on that channel. Runtime line, Reactions section, inline-buttons wording, inbound-meta example, and exec-approval wording all follow the session's real channel, not a synthetic heartbeat placeholder.

The only thing that distinguishes a heartbeat from a normal turn is the user message text ("Read HEARTBEAT.md...") — the rest of the prompt is identical.

Optional: Fix C — make shared tool schemas deterministic

Even with A and B, individual tool definitions show byte-level drift (the exec 415-vs-392 observation above). Root-cause and remove any run-kind-dependent templating from tool descriptions. Runtime state belongs in the system prompt, not in tool schemas.

Expected impact after fix

With A + B applied, a heartbeat's request prefix becomes byte-identical to the preceding conversation turn's prefix (up to the heartbeat's user message). Anthropic's 20-block lookback finds the conversation's cached entries → cache_read hits → TTL refreshes → heartbeat's original purpose (keep the conversation cache warm across idle periods) actually works.

Reproducer

Configure an openclaw agent with heartbeat.every: "55m" on Telegram (or any channel) with cacheRetention: "long".
Enable mitmproxy capture of api.anthropic.com traffic.
Drive a conversation with at least one user turn, wait for the next heartbeat.
Inspect the two requests: compare tools array by name-list and compare the system prompt text byte-for-byte.

Environment

openclaw 2026.4.20-beta.2
Claude Opus 4.7
cacheRetention: "long" (1h TTL)
Telegram channel binding

Scope of this issue

This issue is for the problem description and design alignment. Happy to follow up with two separate PRs (Fix A and Fix B) once the direction is agreed. Also happy to simplify to just Fix A if that's preferred as a first step — it alone recovers the tools-prefix cache and is the least architecturally invasive change.

Reported with diagnostic assistance from Claude.

extent analysis

TL;DR

To fix the issue of heartbeat runs defeating their own cache-keep-alive purpose, apply two complementary changes: stop filtering the tools array for heartbeat runs and add a runtime check for privilege enforcement, and make heartbeats inherit the containing session's channel context.

Guidance

Identify and address the two divergences causing the issue: tools array difference and system prompt difference.
Implement Fix A: stop filtering the tools array for heartbeat runs and add a runtime check for privilege enforcement.
Implement Fix B: make heartbeats inherit the containing session's channel context.
Verify the fix by inspecting the requests and comparing the tools array and system prompt text byte-for-byte.

Example

{
  "type": "tool_result",
  "tool_use_id": "...",
  "content": "This tool is not available during heartbeat runs. If you need to schedule work or restart the gateway, defer until the next user turn.",
  "is_error": true
}

This example shows the structured tool_result error that can be used to reject tool invocations during heartbeat runs.

Notes

The proposed fixes (Fix A and Fix B) are complementary and can be applied independently. Fix A alone can recover the tools-prefix cache, while Fix B ensures that heartbeats inherit the containing session's channel context.

Recommendation

Apply both Fix A and Fix B to fully address the issue and ensure that heartbeats keep the conversation cache warm across idle periods. This will help reduce avoidable cache-write spend per agent and improve the overall efficiency of the system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Heartbeat runs defeat their own cache-keep-alive purpose via tool-set and system-prompt divergence [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #70602: fix(heartbeat): keep full tool array during heartbeat runs

Description (problem / solution / changelog)

Summary

Problem

Solution

Changes

Testing

Related

Changed files

Code Example

Heartbeat runs defeat their own cache-keep-alive purpose via tool-set and system-prompt divergence

Summary

Evidence

Divergence 1 — tools array

Divergence 2 — system prompt

Cache-behavior consequence

Root cause

Proposed fix

Fix A: Heartbeats keep the full tool array; privilege enforcement moves to runtime

Fix B: Heartbeats inherit the containing session's channel context

Optional: Fix C — make shared tool schemas deterministic

Expected impact after fix

Reproducer

Environment

Scope of this issue

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING