openclaw - ✅(Solved) Fix System prompt assembled differently across code paths (chat/heartbeat/announce), causing continuous Anthropic cache invalidation [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63030Fetched 2026-04-09 07:59:20
View on GitHub
Comments
2
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×2cross-referenced ×1renamed ×1

The system prompt volatile suffix is assembled in a different order depending on which code path triggers a turn — normal chat, heartbeat, and ACP announce each produce different byte sequences for the same session. Since Anthropic prompt caching requires byte-identical prefixes, every path transition causes a full cache re-write of the entire context.

This is silently hemorrhaging money for anyone using heartbeats + chat on Anthropic models. Heartbeats warm one cache key, then the first user message writes a completely different one. Every. Single. Time.

Root Cause

  • #43148 — Reports system prompt instability across heartbeat/chat paths (same root cause, still open)

Fix Action

Fix / Workaround

Multiplied across agents

With 4 Anthropic leadership agents (Opus), the overnight heartbeat cache war alone was burning $20-60/day in unnecessary cache writes. We had to disable heartbeats entirely as a workaround.

  1. Normalize volatile section ordering across all code paths — sort sections deterministically before assembly
  2. Separate volatile context into its own message block — keep the system prompt stable, put per-turn metadata in a separate developer message (as suggested in #43148)
  3. Add notifyPolicy parameter to sessions_spawn — as a workaround for ACP, let callers suppress announce notifications at spawn time (the openclaw tasks notify silent CLI exists but has a race condition since the task completes before the policy can be set)

Current workarounds

PR fix notes

PR #63096: fix(gateway): stabilize inter-session completion wake prompts

Description (problem / solution / changelog)

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: inter-session task_completion wakes sent through the gateway agent path did not rebuild the session-backed Inbound Context / Group Chat Context suffix that normal chat turns include.
  • Why it matters: the volatile system-prompt suffix changed on ACP/subagent completion notifications, which busted Anthropic prompt cache reuse and caused expensive cache rewrites.
  • What changed: src/gateway/server-methods/agent.ts now synthesizes persisted session context for inter-session completion wakes when no explicit extraSystemPrompt is provided, and preserves explicit caller-provided prompt context when it is present.
  • What did NOT change (scope boundary): no gateway protocol/schema changes, no changes to the normal inbound chat prompt path, and no changes to internal event formatting beyond using the existing session metadata to rebuild prompt context.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #63030
  • Related #43148
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the completion-wake path entered through gateway agent with inputProvenance.kind="inter_session" and task_completion internal events, but unlike the normal chat path it did not reconstruct the session-derived extraSystemPrompt sections that contain Inbound Context and Group Chat Context.
  • Missing detection / guardrail: there was no gateway-seam regression test asserting that inter-session completion wakes preserve the same session-derived prompt context as ordinary chat turns.
  • Contributing context (if known): these sections live below the cache boundary and above ## Runtime, so missing them changes the system prompt digest even when session state and transcript history are otherwise unchanged.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-methods/agent.test.ts
  • Scenario the test should lock in: an inter-session task_completion wake for an existing session should rebuild persisted inbound/group prompt context when extraSystemPrompt is omitted, and should preserve explicit extraSystemPrompt when one is provided.
  • Why this is the smallest reliable guardrail: the bug lives at the gateway agent ingress seam, where session metadata, provenance, and internal events are combined before the run is dispatched.
  • Existing test that already covers this (if any): does not create task rows for inter-session completion wakes covers adjacent routing behavior, but not prompt-context reconstruction.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • ACP/subagent completion notifications routed through the gateway now reuse the same persisted session context shaping as normal chat turns.
  • For affected Anthropic sessions, completion wakes should stop causing unnecessary prompt-cache invalidation from missing session-derived suffix sections.
  • No config, CLI, or protocol behavior changed.

Diagram (if applicable)

Before:
[child task completes]
  -> [gateway agent wake with inter_session provenance]
  -> [internal events only]
  -> [system prompt misses inbound/group context]
  -> [systemDigest changes] -> [prompt cache bust]

After:
[child task completes]
  -> [gateway agent wake with inter_session provenance]
  -> [rebuild persisted inbound/group context from session entry]
  -> [system prompt matches normal session shaping]
  -> [stable systemDigest] -> [prompt cache reused]

## Changed files

- `src/gateway/server-methods/agent.test.ts` (modified, +203/-0)
- `src/gateway/server-methods/agent.ts` (modified, +76/-2)

Code Example

Overnight (no user messages, just heartbeats):
  10:27 heartbeat  → sysDigest=2d44ab1c → cache WRITE ~110k tokens
  11:22 heartbeat  → sysDigest=2d44ab1c → cache READ (warm from last heartbeat)  
  12:17 heartbeat  → sysDigest=2d44ab1c → cache READ
  ...pattern continues, heartbeats stay warm with each other...

Morning (user sends first message):
  09:29 user chat  → sysDigest=cb8a82a1 → cache WRITE ~110k tokens (BUST — different prefix than heartbeat)
  09:31 user chat  → sysDigest=cb8a82a1 → cache READ (warm now)

---

# Heartbeats (consistent with each other, but different from chat):
2026-04-08T10:27:37 | run=8461b3f1 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-64
2026-04-08T11:22:37 | run=bad90290 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-66
2026-04-08T12:17:37 | run=041ec4cd | sysDigest=2d44ab1ce72b8ae0 | msgs=63-68
2026-04-08T13:12:37 | run=cd72beab | sysDigest=2d44ab1ce72b8ae0 | msgs=63-70
2026-04-08T14:07:37 | run=97cdc066 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-72

# User chat (different digest):
2026-04-08T17:24:35 | run=52cdb6bc | sysDigest=cb8a82a10654fa98 | msgs=65-110
2026-04-08T17:38:06 | run=5f76d3d0 | sysDigest=cb8a82a10654fa98 | msgs=111-128

# ACP announce (yet another different digest from earlier testing):
2026-04-08T07:49:xx | run=announce  | sysDigest=3132b2a94c36e91d | msgs=21
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Summary

The system prompt volatile suffix is assembled in a different order depending on which code path triggers a turn — normal chat, heartbeat, and ACP announce each produce different byte sequences for the same session. Since Anthropic prompt caching requires byte-identical prefixes, every path transition causes a full cache re-write of the entire context.

This is silently hemorrhaging money for anyone using heartbeats + chat on Anthropic models. Heartbeats warm one cache key, then the first user message writes a completely different one. Every. Single. Time.

Three affected code paths (verified via diagnostics.cacheTrace)

All three paths target the same session but produce different systemDigest values:

Code pathWhen it firessystemDigest (first 16 chars)Volatile suffix order
Normal chatUser sends a messagecb8a82a10654fa98HEARTBEAT.md → Group Chat Context → Inbound Context → Runtime
HeartbeatEvery N minutes2d44ab1ce72b8ae0Different ordering of volatile sections
ACP announceBackground task completes3132b2a94c36e91dHEARTBEAT.md → Runtime → (missing sections)

The static prefix (tools, skills, workspace files) is byte-identical across all three — divergence starts in the volatile suffix below OPENCLAW_CACHE_BOUNDARY.

Real-world cost impact

Heartbeat cache war (the expensive one)

With heartbeats pointed at the Discord channel session (as recommended for cache warming):

Overnight (no user messages, just heartbeats):
  10:27 heartbeat  → sysDigest=2d44ab1c → cache WRITE ~110k tokens
  11:22 heartbeat  → sysDigest=2d44ab1c → cache READ (warm from last heartbeat)  
  12:17 heartbeat  → sysDigest=2d44ab1c → cache READ
  ...pattern continues, heartbeats stay warm with each other...

Morning (user sends first message):
  09:29 user chat  → sysDigest=cb8a82a1 → cache WRITE ~110k tokens (BUST — different prefix than heartbeat)
  09:31 user chat  → sysDigest=cb8a82a1 → cache READ (warm now)

Then the cycle repeats: next heartbeat busts the chat cache, next chat message busts the heartbeat cache. Every transition between heartbeat and chat is a full context re-write.

On Claude Opus 4.6 with ~110k context, each bust costs $0.69 in cache writes (110k × $6.25/MTok). With heartbeats every 55 minutes and intermittent chat, this compounds to $5-15/day per agent in pure waste.

ACP announce cache bust

Each ACP task completion notification produces yet another different system prompt, causing ~10k cache write tokens. In coding workflows with frequent Codex spawns, this adds $0.50-2.00/day per agent.

Multiplied across agents

With 4 Anthropic leadership agents (Opus), the overnight heartbeat cache war alone was burning $20-60/day in unnecessary cache writes. We had to disable heartbeats entirely as a workaround.

Steps to reproduce

  1. Configure an agent with cacheRetention: "long" on any Anthropic model
  2. Set heartbeat.session to point at the agent's Discord channel session (the session where chat happens)
  3. Enable diagnostics.cacheTrace
  4. Let a heartbeat fire, then send a chat message
  5. Compare systemDigest between the heartbeat turn and the chat turn
  6. Observe: different digests, full cache re-write on every path transition

Cache trace evidence

From /logs/cache-trace.jsonl on a real production deployment:

# Heartbeats (consistent with each other, but different from chat):
2026-04-08T10:27:37 | run=8461b3f1 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-64
2026-04-08T11:22:37 | run=bad90290 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-66
2026-04-08T12:17:37 | run=041ec4cd | sysDigest=2d44ab1ce72b8ae0 | msgs=63-68
2026-04-08T13:12:37 | run=cd72beab | sysDigest=2d44ab1ce72b8ae0 | msgs=63-70
2026-04-08T14:07:37 | run=97cdc066 | sysDigest=2d44ab1ce72b8ae0 | msgs=63-72

# User chat (different digest):
2026-04-08T17:24:35 | run=52cdb6bc | sysDigest=cb8a82a10654fa98 | msgs=65-110
2026-04-08T17:38:06 | run=5f76d3d0 | sysDigest=cb8a82a10654fa98 | msgs=111-128

# ACP announce (yet another different digest from earlier testing):
2026-04-08T07:49:xx | run=announce  | sysDigest=3132b2a94c36e91d | msgs=21

Expected behavior

All code paths for the same session should produce a byte-identical system prompt. The volatile suffix sections below OPENCLAW_CACHE_BOUNDARY must be assembled in the same deterministic order regardless of whether the turn was triggered by chat, heartbeat, or ACP announce.

Proposed fixes (any would work)

  1. Normalize volatile section ordering across all code paths — sort sections deterministically before assembly
  2. Separate volatile context into its own message block — keep the system prompt stable, put per-turn metadata in a separate developer message (as suggested in #43148)
  3. Add notifyPolicy parameter to sessions_spawn — as a workaround for ACP, let callers suppress announce notifications at spawn time (the openclaw tasks notify silent CLI exists but has a race condition since the task completes before the policy can be set)

Current workarounds

  • Heartbeats: Disabled for all Anthropic agents (loses cache warming and liveness monitoring)
  • ACP announces: Using PTY background exec instead of sessions_spawn (loses task tracking and completion notifications)
  • Both workarounds degrade the agent experience to avoid the cost penalty

Related

  • #43148 — Reports system prompt instability across heartbeat/chat paths (same root cause, still open)

Environment

  • OpenClaw: latest (2026.4.8, 9ece252)
  • OS: macOS Darwin 25.2.0 (arm64)
  • Models: anthropic/claude-opus-4-6, anthropic/claude-sonnet-4-6
  • Config: cacheRetention: "long", heartbeat every 55m targeting Discord channel session
  • Auth: Claude MAX (OAuth token) hitting api.anthropic.com

Severity

Critical cost impact — silently wastes significant money for any Anthropic user with heartbeats enabled (the default). The longer the context and the more agents you run, the worse it gets. Most users won't notice until they check their bill.

extent analysis

TL;DR

Normalize the volatile section ordering across all code paths to ensure a byte-identical system prompt for the same session.

Guidance

  1. Identify the volatile sections: Determine which sections are causing the divergence in the system prompt across different code paths.
  2. Sort sections deterministically: Implement a sorting mechanism to ensure that the volatile sections are assembled in the same order regardless of the code path.
  3. Verify the fix: Use the diagnostics.cacheTrace to verify that the system prompt is now byte-identical across all code paths for the same session.
  4. Monitor cost impact: Track the cost savings after implementing the fix to ensure that the waste in cache writes is eliminated.

Example

No specific code example can be provided without knowing the implementation details of the system prompt assembly. However, the general approach would involve modifying the code to sort the volatile sections before assembling the system prompt.

Notes

The proposed fixes, such as normalizing volatile section ordering or separating volatile context into its own message block, should be explored and tested to determine the most effective solution.

Recommendation

Apply the workaround of normalizing volatile section ordering, as it directly addresses the root cause of the issue and can help eliminate the waste in cache writes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

All code paths for the same session should produce a byte-identical system prompt. The volatile suffix sections below OPENCLAW_CACHE_BOUNDARY must be assembled in the same deterministic order regardless of whether the turn was triggered by chat, heartbeat, or ACP announce.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING