openclaw - ✅(Solved) Fix heartbeat: isolatedSession: true silently reuses the same transcript file across every run [4 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64795Fetched 2026-04-12 13:26:43
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Timeline (top)
cross-referenced ×4referenced ×1

agents.defaults.heartbeat.isolatedSession: true is documented as producing a fresh session (with a new sessionId and an empty transcript) on every heartbeat run, but in practice it only rolls the sessionId in the store entry — the persisted sessionFile path is preserved via a spread, so every run keeps appending to the same physical transcript file forever. Over time the file accumulates the full history of every prior heartbeat, and each new run sees all of it in its in-context window, which is the exact opposite of isolation.

This is provable from the code alone without any production evidence.

Root Cause

Because the entry always has a sessionFile after the first-ever run, the if (candidate) branch is taken and the function never falls through to computing a fresh path from sessionId.

Fix Action

Fix / Workaround

  1. isolatedSession: true silently does nothing after the first run. Every existing deployment that relies on it has been running without isolation.
  2. lightContext: true's documented token savings are optimistic by ~20x. The ~100K → ~2-5K figure only holds on the first run; every subsequent run incurs the full accumulated transcript.
  3. Model behavior drifts toward the reinforced pattern. Any acknowledgment, summary, or tool-call sequence from an early run gets few-shot-learned by later runs and becomes sticky — even after config or prompt changes intended to alter the behavior.
  4. Context-overflow cliff. On a long-running deployment the file eventually exceeds the model's context window. Compaction on a transcript that is mostly tool-call noise fires compaction-safeguard: no real conversation messages to summarize and only writes a boundary marker, giving little real relief.
  5. No user-visible workaround. There is no chat/CLI command that resets a non-current session, so affected users have to rm the file by hand from the pod's filesystem.

PR fix notes

PR #64797: fix(agents): clear sessionFile when rolling a fresh isolated session

Description (problem / solution / changelog)

Summary

  • Problem: resolveCronSession preserves the existing entry's sessionFile via ...entry whenever a new sessionId is generated (forceNew or stale). resolveSessionFilePath prefers entry.sessionFile over sessionId when deciding where to write, so every new session keeps appending to the same physical transcript file forever.
  • Why it matters: For heartbeats configured with isolatedSession: true (which is how the docs describe the "fresh session per run" behavior), the transcript file grew unbounded across every run, poisoning each new run with the in-context history of all prior runs — the exact opposite of isolation. lightContext: true's documented ~100K→~2–5K token savings silently regressed as the file grew.
  • What changed: One-line addition in src/cron/isolated-agent/session.tssessionFile: undefined in the existing isNewSession cleanup block (next to lastChannel, lastTo, lastThreadId, deliveryContext). The resolver now falls through to resolveSessionTranscriptPathInDir(sessionId, …) and produces a new file named after the new sessionId.
  • What did NOT change: Only the isNewSession cleanup block is touched. Non-forceNew reuse of fresh sessions continues to preserve sessionFile as before. Delivery routing clears, the ...entry spread of overrides, and every other field stays intact.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration
  • Memory / storage

Linked Issue/PR

  • Closes #64795
  • Related #64196
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: The isNewSession cleanup block in resolveCronSession clears delivery-routing fields but not sessionFile. Since the returned entry is built by spreading the old entry first, the stale sessionFile is always carried forward, and downstream resolveSessionFilePath / resolveAndPersistSessionFile both prefer a persisted sessionFile over recomputing from sessionId. Net effect: the logical session rolls but the physical transcript file never rotates, defeating isolatedSession: true.
  • Missing detection / guardrail: No regression test existed for rotating sessionFile when forceNew or stale reset generates a new sessionId. Existing tests cover delivery-routing clears (lastChannel, deliveryContext) but not filesystem path isolation.
  • Contributing context: sessionFile was (correctly) added to the SessionEntry persistence layer to keep the transcript path stable across reuse of fresh sessions. The isNewSession branch inherited the same preservation semantics without carving out forceNew/stale rotation.

Regression Test Plan

  • Coverage level: Unit test (pure, no filesystem)
  • Target test file: src/cron/isolated-agent/session.test.ts
  • Scenarios the test locks in:
    1. clears sessionFile when forceNew is true — asserts result.sessionEntry.sessionFile is undefined after forceNew: true against an entry with a populated sessionFile.
    2. clears sessionFile when session is stale — same assertion on the stale-freshness branch (no forceNew, but evaluateSessionFreshness returns { fresh: false }).
    3. preserves sessionFile when reusing fresh session — asserts that a reused fresh session still carries its sessionFile unchanged, so this fix doesn't regress the normal reuse path.
  • Why: these three cases cover every branch of resolveCronSession where the entry's sessionFile matters, and they're the smallest pure-unit guardrails that would have caught the original bug.

User-visible / Behavior Changes

For agents with isolatedSession: true, each heartbeat run will now correctly write to a new transcript file named after the current run's sessionId. The old frozen transcript file will be orphaned on disk and cleaned up by the session reaper on its next pass (or can be removed manually).

Diagram

Before:
[heartbeat t1] -> resolveCronSession(forceNew:true)
                  -> sessionId = uuid-A
                  -> entry = {...oldEntry, sessionFile: oldPath, sessionId: uuid-A}
                  -> transcript writer appends to oldPath

[heartbeat t2] -> resolveCronSession(forceNew:true)
                  -> sessionId = uuid-B (new!)
                  -> entry = {...t1Entry, sessionFile: oldPath, sessionId: uuid-B}
                  -> transcript writer appends to oldPath (same file!)

...many runs...
                  -> file contains all historical messages
                  -> model sees all prior HEARTBEAT replies as in-context history
                  -> few-shot-learns the pattern, reinforces it forever

After:
[heartbeat t1] -> resolveCronSession(forceNew:true)
                  -> sessionId = uuid-A
                  -> entry = {...oldEntry, sessionFile: undefined, sessionId: uuid-A}
                  -> resolver falls through to resolveSessionTranscriptPathInDir
                  -> transcript writer creates/appends uuid-A.jsonl

[heartbeat t2] -> resolveCronSession(forceNew:true)
                  -> sessionId = uuid-B
                  -> entry = {...t1Entry, sessionFile: undefined, sessionId: uuid-B}
                  -> resolver falls through again
                  -> transcript writer creates/appends uuid-B.jsonl  (fresh file!)

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

No risk.

Repro + Verification

Environment

  • OS: Linux (registry.xlab.now/clankertron:2026.4.10 on Kubernetes, Talos 1.11.2)
  • Runtime/container: Node.js via the openclaw container image (upstream base ghcr.io/openclaw/openclaw)
  • Model/provider: llamacpp-deep/qwen3.5-35b-a3b (backed by llama.cpp HTTP server, ctx_size=65536)
  • Integration/channel: Telegram heartbeat target, 15-minute interval
  • Relevant config:
    agents: {
      defaults: {
        heartbeat: {
          every: "15m",
          isolatedSession: true,
          lightContext: true,      // or false — both reproduce
          session: "main",         // default
          target: "telegram",
        }
      }
    }

Steps (without this fix)

  1. Run a heartbeat agent with isolatedSession: true for any length of time (multiple heartbeats).
  2. cat /data/agents/main/sessions/sessions.json and inspect the agent:main:main:heartbeat entry.
  3. Note that sessionId rolls on every run but sessionFile is stable.
  4. Inspect the transcript file at the path in sessionFile.
  5. Observe that grep -c '"type":"session"' <file> returns 1 — only one session record even though many runs have occurred.
  6. Observe that the file grows monotonically and accumulates the history of every heartbeat run.

Expected

Each forceNew heartbeat should produce a new transcript file (or at least a new session record in a rotated file), with no in-context history from prior runs.

Actual (before fix)

The same transcript file is reused across every run. The model sees ~100K tokens of accumulated prior-run history on every heartbeat. "Fresh session per run" is not achieved for any run past the first.

Actual (with this fix)

After the first heartbeat post-fix, the store entry's sessionFile is cleared on each forceNew run. resolveSessionFilePath falls through and allocates a new <sessionId>.jsonl file. Each run's transcript is isolated.

Evidence

Forensic data from a live deployment

sessions.json for agent:main:main:heartbeat on a live cluster (five hours after the pod restarted):

{
  "sessionId": "8f939e30-e4a3-479f-9eeb-2b21d4aaf57b",
  "sessionFile": "/data/agents/main/sessions/34db8152-a3ac-4e7a-8c4a-9a38d9525339.jsonl",
  "updatedAt": 1775908689677,
  "heartbeatIsolatedBaseSessionKey": "agent:main:main"
}

File on disk:

metricvalue
filename stem34db8152-a3ac-4e7a-8c4a-9a38d9525339
transcript header session idcbc883fc-6486-40d2-b0d1-6a102861f5df (first run ever, 2026-04-10T22:09:51Z)
distinct session records in file1
lines433
HEARTBEAT_OK ack tokens101
oldest message2026-04-10T22:09:51Z
newest message2026-04-11T11:58:09Z

Three distinct UUIDs observable for a single logical session:

  • 34db8152… — filename (set once, never rotated)
  • cbc883fc… — transcript-file session id (from the very first run)
  • 8f939e30… — current sessionId in the store (latest forceNew)

Log correlation (different sessionIds, same sessionFile)

From context-overflow diagnostics in the same deployment:

07:28  sessionId=da41fec7-997f-4a56-b5b5-a56cb3a11c28  sessionFile=…/34db8152-…jsonl
10:43  sessionId=f198e48b-84e6-4695-aabc-c4fed74d7cd1  sessionFile=…/34db8152-…jsonl
10:59  sessionKey=agent:main:main:heartbeat            sessionFile=…/34db8152-…jsonl  messages=97

4+ distinct sessionId values observed across 5 hours on the same sessionKey and the same sessionFile. The other 13 non-heartbeat session entries in the same store all have their filename stem matching their sessionId — only the heartbeat entry shows the mismatch, confirming the bug is scoped to the forceNew branch.

Provable from code alone

The bug is also provable purely by code walk:

  • resolveSessionFilePath(sessionId, entry) at src/config/sessions/paths.ts:263 prefers entry.sessionFile over the sessionId-derived path.
  • resolveAndPersistSessionFile at src/config/sessions/session-file.ts:17-27 only consults fallbackSessionFile when !baseEntry.sessionFile.
  • The return entry from resolveCronSession is built by spreading ...entry (which carries sessionFile) and the isNewSession cleanup block does not include sessionFile.
  • Therefore, after the first run of any session that ever goes through resolveCronSession, the stale sessionFile is returned on every subsequent forceNew/stale rotation.

Human Verification (required)

  • Verified scenarios:
    • Read every caller of resolveSessionFilePath and resolveAndPersistSessionFile in src/ to confirm no downstream path re-derives the file from the new sessionId after resolveCronSession returns.
    • Read the heartbeat-runner flow from resolveCronSession through saveSessionStore and the downstream transcript writer to confirm no intermediate reset exists.
    • Dumped sessions.json from a live cluster and checked all 14 entries — 13 non-heartbeat sessions have correctly matching sessionId↔filename, only the heartbeat entry has the mismatch, confirming the bug is scoped to forceNew only.
    • Verified the file header's session id (cbc883fc) differs from the store's current sessionId (8f939e30) and the filename stem (34db8152) — three distinct UUIDs for one logical session, exactly as the code predicts.
    • Verified that the same sessionFile appears in log lines with four different sessionId values over 5 hours.
    • Verified that sessionFile?: string in SessionEntry type allows the undefined assignment; TypeScript is satisfied.
  • Edge cases checked:
    • First-ever run (no existing entry) — entry is undefined, spread is a no-op, sessionFile is undefined by default, resolver computes from sessionId. Unchanged by this fix.
    • Reused fresh session (!forceNew && fresh) — the isNewSession cleanup block is skipped, sessionFile is preserved via the spread. Unchanged and covered by the new preserves sessionFile when reusing fresh session test.
    • Stale rotation (!forceNew && !fresh) — now also clears sessionFile (was broken before). New test covers this.
    • Non-heartbeat forceNew callers (webhook cron runs via the same resolveCronSession) — they get the same fix, which is also the documented behavior for sessionTarget: "isolated".
  • What I did NOT verify:
    • Actual vitest run of the new tests. This machine has no node/pnpm available, so the tests were written by hand against the existing session.test.ts conventions (same resolveWithStoredEntry helper, same mock shape). I am relying on CI to confirm the suite passes. Targeted test file: src/cron/isolated-agent/session.test.ts.
    • End-to-end live reproduction of the fix against our cluster — that requires a rebuild of the downstream clankertron image, which I have not done as part of this PR.

Compatibility / Migration

  • Backward compatible? Yes — the fix only changes behavior when isNewSession is true, which is exactly when the current behavior is wrong. No existing working path is altered.
  • Config/env changes? No
  • Migration needed? No for configuration, but operators of affected deployments will have a stale transcript file on disk that is no longer referenced. The session reaper should clean these up on its next pass. Manual cleanup is also safe (rm the orphaned file once the sessions store no longer references it).

Risks and Mitigations

  • Risk: Orphaned transcript files for existing deployments where the stale sessionFile path gets dropped from the store entry on the first post-fix run.
    • Mitigation: The files are only orphaned on disk; the session reaper (disk-budget maintenance) will reclaim them on its next pass. No user-visible regression. Operators can also manually delete the old transcript file — it has no operational value after the fix lands.
  • Risk: Any code path that still reads a session entry's sessionFile and expects it to be non-empty may see undefined on the first post-fix turn before the downstream resolveAndPersistSessionFile runs.
    • Mitigation: SessionEntry.sessionFile is already declared optional (sessionFile?: string) in src/config/sessions/types.ts. All call sites I inspected use optional chaining or explicit null-checks. No TypeScript errors surfaced from the change.

AI-assisted

  • Drafted with Claude Code (Claude Opus 4.6, 1M context)
  • Lightly tested — the tests were written but NOT run locally due to a missing Node toolchain on the authoring machine. Relying on CI for the targeted test run.
  • I understand what the code does — one-line addition to an existing conditional-clear block, plus three targeted regression tests.

🤖 Generated with Claude Code

Changed files

  • src/cron/isolated-agent/session.test.ts (modified, +58/-0)
  • src/cron/isolated-agent/session.ts (modified, +7/-0)

PR #64808: fix(agents): archive rotated heartbeat transcript on isolatedSession rotation

Description (problem / solution / changelog)

Summary

  • Problem: Even with the sessionFile clear from #64797, when an isolatedSession: true heartbeat rotates to a new session the PRIOR transcript file at the old path becomes orphaned — referenced by nothing in the session store. The only mechanism that reaps it today is enforceSessionDiskBudget in config/sessions/disk-budget.ts, which runs only when the budget is exceeded. On a deployment with a 15-minute heartbeat interval, orphaned transcripts accumulate to hundreds of MB before any cleanup happens.
  • Why it matters: Without immediate archival, operators see unbounded disk growth in the agent sessions directory even though each logical session is now correctly isolated. The file count and disk footprint still grow monotonically per heartbeat tick.
  • What changed: In src/infra/heartbeat-runner.ts, capture the prior entry's (sessionId, sessionFile) pair at the isolated-session key before the store update, feed it into a new resetSessionFiles map, and archive via archiveRemovedSessionTranscripts with reason: "reset" (rename to <file>.reset.<ts>, cleaned up later by cleanupArchivedSessionTranscripts after its retention window). Existing suffix-collapse case is split into a deletedSessionFiles map so it continues to use reason: "deleted" — the two cases have different semantics and should get different retention classes.
  • What did NOT change (scope boundary): No changes to resolveCronSession or session.ts (that's #64797). No changes to archiveRemovedSessionTranscripts, archiveSessionTranscripts, or cleanupArchivedSessionTranscripts — reusing the existing battle-tested archival path. No new retention knobs — uses the existing maintenance.resetArchiveRetentionMs for rotation archives. The runSessionKey = isolatedSessionKey; assignment and everything downstream of the saveSessionStore call is untouched.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration
  • Memory / storage

Linked Issue/PR

  • Closes #64795 (partially — the sessionFile-clear fix is #64797; this PR addresses the orphan-accumulation follow-on)
  • Related #64797 — this PR depends on #64797 landing first. Without the sessionFile clear, the heartbeat runner sees priorEntry.sessionFile still matching the inherited path after resolveCronSession returns, and archiving it would leave the store entry pointing at a renamed file. The dependency is purely ordering: this PR's logic is correct once #64797 is in.
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: resolveCronSession rotates to a new sessionId on every forceNew: true heartbeat run, and once #64797 lands it also rotates to a new sessionFile. The prior transcript file at the old path is then referenced by no store entry — the runtime has no code path that proactively archives it. The only cleanup today is disk-budget enforcement, which is a last-resort mechanism, not an incremental one.
  • Missing detection / guardrail: There's no test that verifies "on rotation, the prior transcript is cleaned up". Existing heartbeat tests verify session-key stability and delivery routing but not file lifecycle.
  • Contributing context: The existing archiveRemovedSessionTranscripts call in heartbeat-runner.ts:841 already handled the stale :heartbeat:heartbeat suffix-collapse case with reason: "deleted". Extending it to also archive rotation files is a natural extension of that pattern — the archival primitive is the same, only the input set and the semantics (reset vs deleted) differ.

Regression Test Plan

  • Coverage level: Unit test (sandbox-based, exercises the full heartbeat runner with a temp session store)
  • Target test file: src/infra/heartbeat-runner.isolated-key-stability.test.ts
  • New test: archives the prior transcript file as .reset when rotating to a fresh isolated session
  • Scenario the test locks in:
    1. Seed a session store with an isolatedSessionKey entry whose sessionFile points at an existing transcript on disk
    2. Run runHeartbeatOnce
    3. Assert the prior transcript file has been renamed to <id>.jsonl.reset.<ts> (no longer at the original path)
    4. Assert the store entry still exists at the same key but with a different sessionId
    5. Assert the new sessionFile (if defined) is different from the old one
  • Why: this is the smallest reliable end-to-end test that locks in the archive-on-rotation contract and catches regressions if any path in heartbeat-runner.ts bypasses the rotation archival.

User-visible / Behavior Changes

On deployments with isolatedSession: true, rotated transcript files now get immediately archived as <file>.reset.<ts> instead of sitting on disk indefinitely. The archival is reversible within the configured retention window (maintenance.resetArchiveRetentionMs). Operators who debugged prior heartbeat runs by reading the stable transcript file will instead find the archived rotations with timestamped suffixes, in chronological order.

Diagram

Before #64797:
[heartbeat t1] → transcript appended to sessions/foo.jsonl
[heartbeat t2] → same file, sessionId rolls in store but file keeps growing
[...many runs...] → one file, 100+ HEARTBEAT_OK replies poisoning each new run

After #64797 only:
[heartbeat t1] → sessions/sid-A.jsonl (fresh)
[heartbeat t2] → sessions/sid-B.jsonl (fresh, A is orphaned)
[heartbeat t3] → sessions/sid-C.jsonl (fresh, A+B orphaned)
[...many runs...] → N orphaned files, cleaned up eventually by disk-budget sweeper

After this PR (stacked on #64797):
[heartbeat t1] → sessions/sid-A.jsonl
[heartbeat t2] → sessions/sid-B.jsonl + sid-A.jsonl.reset.<t2-ts>
[heartbeat t3] → sessions/sid-C.jsonl + sid-B.jsonl.reset.<t3-ts>
                 (sid-A archive eventually cleaned by cleanupArchivedSessionTranscripts
                  after resetArchiveRetentionMs)

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

No risk. The archival path is the same one already used by the suffix-collapse case and by the cron reaper — the change here is only when it runs and what input it's given. restrictToStoreDir: true is preserved.

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node.js via the openclaw container image
  • Model/provider: any
  • Integration/channel: any heartbeat target
  • Relevant config:
    agents: {
      defaults: {
        heartbeat: {
          every: "15m",
          isolatedSession: true,
        }
      }
    }

Steps (after #64797 lands, without this PR)

  1. Run a heartbeat agent with isolatedSession: true for any number of heartbeats (>2).
  2. ls /data/agents/main/sessions/*.jsonl — observe a growing number of orphaned transcript files, one per prior heartbeat run, none referenced by the sessions.json store.
  3. Wait for enforceSessionDiskBudget to run only when the budget is exceeded — typically hours to days to weeks depending on maxDiskBytes.

Expected

Each rotation should immediately archive the prior transcript so disk usage stays bounded by the retention window, not by the disk budget.

Actual (without this PR)

Orphaned files accumulate. On a 15-min interval deployment, ~100 orphaned files per day, each 100-1000 bytes to kilobytes.

Actual (with this PR)

Each rotation immediately renames the prior file to <file>.reset.<ts>. After maintenance.resetArchiveRetentionMs elapses, the archive is cleaned up by cleanupArchivedSessionTranscripts on the next maintenance sweep.

Evidence

  • New test archives the prior transcript file as .reset when rotating to a fresh isolated session asserts the end-to-end behavior: prior transcript gone from its original path, archived file present at <id>.jsonl.reset.<ts>, store entry rotated to a new sessionId.
  • Existing test coverage (heartbeat-runner.isolated-key-stability.test.ts) verifies the suffix-collapse case still works with its own deletedSessionFiles map and reason: "deleted".

Human Verification (required)

  • Verified scenarios:
    • Traced cronSession.store[isolatedSessionKey] before the store[isolatedSessionKey] = cronSession.sessionEntry assignment. At that point, the old entry with old sessionId and old sessionFile is still there, so priorEntryAtKey captures exactly the values we want to archive.
    • Verified that referencedSessionIds (computed from Object.values(cronSession.store) AFTER the new entry is assigned) contains the NEW sessionId but not the OLD one, so archiveRemovedSessionTranscripts will not skip the archival for the old file.
    • Confirmed that the existing suffix-collapse case continues to use reason: "deleted" and is gated on staleIsolatedSessionKey being set, completely independent of the new rotation logic.
    • Read archiveSessionTranscriptsDetailed to confirm restrictToStoreDir: true constrains archival to the agent sessions dir via path.relative containment check — no risk of touching unrelated files.
  • Edge cases checked:
    • First-ever heartbeat run (no prior entry): priorEntryAtKey is undefined, no addition to resetSessionFiles, nothing to archive. Works.
    • Prior entry exists but has no sessionFile set: conditional priorEntryAtKey.sessionFile check skips the add. Nothing archived. Works.
    • Both staleIsolatedSessionKey AND isNewSession trigger on the same run: both maps get populated, both archival calls run, each uses its own reason. The two maps are disjoint by construction (different sessionIds).
    • Reused fresh session (isNewSession === false): the new conditional is skipped, no rotation archival happens, sessionFile continues to be used by the same entry. Works (the reuse path doesn't rotate the transcript).
  • What I did NOT verify:
    • Actual pnpm vitest run of the new test. This machine has no Node toolchain available, so I wrote the test by hand against the existing withTempHeartbeatSandbox convention and trust the harness. Targeted test file: src/infra/heartbeat-runner.isolated-key-stability.test.ts.
    • End-to-end verification on a live cluster. The cluster behavior can be observed after both PRs land and the image is rebuilt.

Compatibility / Migration

  • Backward compatible? Yes — the new behavior only kicks in when isNewSession === true, which already existed but had no archival. No existing working path is altered.
  • Config/env changes? No — reuses existing maintenance.resetArchiveRetentionMs / maintenance.pruneAfterMs knobs.
  • Migration needed? For existing deployments, orphaned transcript files from before both PRs land will still sit on disk until the first maintenance sweep after the upgrade. They're safe to rm manually if disk pressure is urgent. Everything from the first post-upgrade heartbeat onwards rotates cleanly.

Risks and Mitigations

  • Risk: If #64797 does not land before this PR, priorEntryAtKey.sessionFile is still inherited by the new entry via ...entry spread, and archiveRemovedSessionTranscripts would rename a file that the new entry is about to write to. On the next write, the store entry's sessionFile would point at a file that no longer exists at that path.
    • Mitigation: This PR is explicitly documented as depending on #64797. Do not merge this PR before #64797. The branch is stacked on top of the #64797 branch for exactly this reason.
  • Risk: reason: "reset" uses maintenance.resetArchiveRetentionMs for cleanup, which may be shorter than pruneAfterMs. Operators who relied on longer retention for debugging may find archives gone sooner.
    • Mitigation: "reset" is semantically correct for rotation (the entry persists, only the transcript rolls). Operators who want longer retention can tune maintenance.resetArchiveRetentionMs directly — it's an existing knob with existing docs.

AI-assisted

  • Drafted with Claude Code (Claude Opus 4.6, 1M context)
  • Lightly tested — the test was written but NOT run locally due to a missing Node toolchain on the authoring machine. Relying on CI for the targeted test run.
  • I understand what the code does — ~15 lines of additions to heartbeat-runner.ts plus a ~55-line end-to-end test. The archival primitive is unchanged; only the input set and the reason classification are new.

🤖 Generated with Claude Code

Changed files

  • src/cron/isolated-agent/session.test.ts (modified, +58/-0)
  • src/cron/isolated-agent/session.ts (modified, +7/-0)
  • src/infra/heartbeat-runner.isolated-key-stability.test.ts (modified, +73/-0)
  • src/infra/heartbeat-runner.ts (modified, +52/-12)

PR #64832: fix(agents): archive orphaned isolated-session transcripts after rotation

Description (problem / solution / changelog)

Summary

Complement to #65203. That PR cleared sessionFile on rotation so each isolatedSession: true run writes to a fresh path — but the prior transcript file at the old path is now orphaned: nothing in the store references it, and cleanupArchivedSessionTranscripts only scans for the .reset.<ts> suffix. Orphans accumulate forever.

This PR renames each prior transcript to <file>.reset.<ts> as part of the same rotation transaction, so it re-enters the retention window.

Change

Two helpers in src/cron/isolated-agent/session.ts:

  • capturePriorIsolatedEntryForArchival — snapshot prior (sessionId, sessionFile) before persist.
  • archivePriorIsolatedEntryAfterRotation — rename to <file>.reset.<ts> after persist, with reason: "reset" (honors maintenance.resetArchiveRetentionMs). Uses the existing archiveRemovedSessionTranscripts primitive.

Wired into both forceNew: true call sites:

  • src/infra/heartbeat-runner.ts — heartbeat rotation path. Also splits the existing :heartbeat:heartbeat suffix-collapse archival into its own call with reason: "deleted" so the two archival paths get their correct retention classes (pruneAfterMs vs resetArchiveRetentionMs).
  • src/cron/isolated-agent/run-session-state.ts — cron createPersistCronSessionEntry closure with a once-flag (persist is called multiple times per run: pre-run, skills refresh, finalize). Covers the cron: prefix case (runSessionKey = ...:run:<id>), where the session-reaper only archives the run-key entry on retention — not the prior agentSessionKey transcript. Without this path the cron orphan is never archived.

Tests lock in: capture timing (BEFORE persist), once-flag, cron-run-key path, archival failure handling, and a heartbeat-runner sandbox test that seeds a real prior transcript and asserts the rename.

Forensic evidence

xlab.now deployment running clankertron:2026.4.10 (pre-#65203) with a manual sessionFile clear as workaround: five heartbeat runs over seven hours each wrote to a fresh transcript, but five orphaned transcripts accumulated on disk (300 KB – 1.3 MB each). cleanupArchivedSessionTranscripts cannot see them — they don't carry the .reset.* suffix. At ~4 runs/hour × ~400 KB this is roughly 38 MB/day of unsweepable growth per deployment. This PR closes that gap at the rotation boundary.

Scope boundary

  • No changes to archiveRemovedSessionTranscripts, cleanupArchivedSessionTranscripts, or any archival primitive — reuses existing machinery from a new call site.
  • No new retention knobs.
  • No changes to the pure-reuse (!isNewSession) path.
  • restrictToStoreDir: true preserved throughout.

Safety

  • Archival is wrapped in try/catch at each call site; failure logs a warning but does not fail the run.
  • The referencedSessionIds safety guard inside archiveRemovedSessionTranscripts prevents archiving any sessionId still pointed at by another store entry. Callers compute this set from the post-update store.

Prior state of this PR

Originally scoped as the root-cause fix for sessionFile persistence. #65203 landed that root-cause fix independently on 2026-04-12 (the sessionFile: undefined line in resolveCronSession). This PR has been rebased on top of #65203 and narrowed to the archival flow only — the diff no longer contains the sessionFile: undefined change, only the orphan-cleanup machinery that neither #65203 nor any current code path provides.

🤖 Generated with Claude Code

Changed files

  • src/cron/isolated-agent/run-session-state.test.ts (added, +345/-0)
  • src/cron/isolated-agent/run-session-state.ts (modified, +42/-1)
  • src/cron/isolated-agent/run.test-harness.ts (modified, +5/-0)
  • src/cron/isolated-agent/run.ts (modified, +6/-0)
  • src/cron/isolated-agent/session.test.ts (modified, +225/-1)
  • src/cron/isolated-agent/session.ts (modified, +65/-0)
  • src/infra/heartbeat-runner.isolated-key-stability.test.ts (modified, +115/-0)
  • src/infra/heartbeat-runner.ts (modified, +42/-7)

PR #64873: fix(cron): clear sessionFile on forceNew so isolated runs don't share transcripts

Description (problem / solution / changelog)

Summary

  • Add sessionFile: undefined to the isNewSession cleanup block in resolveCronSession so that forced-new isolated runs don't inherit and keep writing to the previous transcript file.
  • Add three targeted tests in src/cron/isolated-agent/session.test.ts covering forceNew, stale-session, and fresh-reuse paths.

Fixes #64795.

Root cause

resolveCronSession in src/cron/isolated-agent/session.ts rolls a new sessionId when forceNew: true (or when the stored session is stale), but it builds the returned entry by spreading the previous entry first:

const sessionEntry: SessionEntry = {
  ...entry,                 // ← spreads sessionFile from prior entry
  sessionId,                // ← overridden with the new uuid
  updatedAt: params.nowMs,
  systemSent,
  ...(isNewSession && {
    lastChannel: undefined,
    lastTo: undefined,
    lastAccountId: undefined,
    lastThreadId: undefined,
    deliveryContext: undefined,
    // sessionFile was NOT here — it survives into the new entry
  }),
};

Downstream, resolveSessionFilePath in src/config/sessions/paths.ts:263 prefers a persisted entry.sessionFile over recomputing a fresh path from sessionId:

export function resolveSessionFilePath(
  sessionId: string,
  entry?: { sessionFile?: string },
  opts?: SessionFilePathOptions,
): string {
  const sessionsDir = resolveSessionsDir(opts);
  const candidate = entry?.sessionFile?.trim();
  if (candidate) {
    try {
      return resolvePathWithinSessionsDir(sessionsDir, candidate, { agentId: opts?.agentId });
    } catch { /* … */ }
  }
  return resolveSessionTranscriptPathInDir(sessionId, sessionsDir);
}

So the returned entry ends up with a new sessionId but the old sessionFile. Every forced-new run appends to the same physical transcript file as the previous run, indefinitely.

Impact

This defeats two documented features:

  1. agents.defaults.heartbeat.isolatedSession: truedocs/gateway/heartbeat.md promises "each heartbeat runs in a fresh session with no prior conversation history" and quotes ~100K tokens down to ~2-5K per run. Neither holds while the stale sessionFile survives — every heartbeat turn reads the full accumulated transcript on context load.

  2. Cron sessionTarget: "isolated" — isolated cron runs take the same forceNew: true path via resolveCronSession, so the same transcript pollution affects cron jobs configured for full isolation.

The silent failure mode is particularly painful because it looks like heartbeat isolatedSession is working (new sessionId, new store entry, no forceNew warnings) while the underlying transcript file continues to accumulate every prior run's history.

Fix

One field added to the existing isNewSession cleanup block, with a comment explaining the downstream interaction with resolveSessionFilePath:

     ...(isNewSession && {
       lastChannel: undefined,
       lastTo: undefined,
       lastAccountId: undefined,
       lastThreadId: undefined,
       deliveryContext: undefined,
+      sessionFile: undefined,
     }),

With sessionFile cleared on the isNewSession branch, resolveSessionFilePath falls through candidate = entry?.sessionFile?.trim() (which is now undefined) and computes the path from the new sessionId via resolveSessionTranscriptPathInDir. This is the exact intent of the existing cleanup block: strip anything that leaked from the prior session into a fresh one.

Test coverage

Added to the existing describe("session reuse for webhooks/cron") block in src/cron/isolated-agent/session.test.ts:

  • clears sessionFile when forceNew is true — covers the heartbeat isolatedSession and isolated cron paths.
  • clears sessionFile when session is stale — covers the freshness-expiry path for direct-style cron/webhook sessions.
  • preserves sessionFile when reusing a fresh session — locks in the negative: reuse must keep the transcript, because the whole point of reuse is that the transcript keeps accumulating.

Verification

  • npx vitest run src/cron/isolated-agent/session.test.ts13/13 pass (the 3 new ones included).
  • npx vitest run src/cron/isolated-agent/session.test.ts src/config/sessions/sessions.test.ts src/config/sessions/store.lock.test.ts src/config/sessions/transcript.test.ts41/41 pass across the four related test files, no regressions in transcript/store code.

What this does NOT change

  • No changes to heartbeat-runner.ts or its forceNew: true call — the fix is purely in the session-roll helper so every caller benefits (isolated cron, heartbeat, webhook).
  • No changes to resolveSessionFilePath — it still honors a persisted sessionFile when present, which is the correct behavior for non-isolated session reuse.
  • No config schema or docs changes — the documented behavior is now actually what the code does.

🤖 Generated with Claude Code

Changed files

  • src/cron/isolated-agent/session.test.ts (modified, +51/-0)
  • src/cron/isolated-agent/session.ts (modified, +5/-0)

Code Example

if (useIsolatedSession) {
  const cronSession = resolveCronSession({
    cfg,
    sessionKey: isolatedSessionKey,
    agentId,
    nowMs: startedAt,
    forceNew: true,
  });

---

if (!params.forceNew && entry?.sessionId) {
  // reuse-or-roll logic
} else {
  // No existing session or forced new
  sessionId = crypto.randomUUID();       // ✓ new id
  isNewSession = true;
  systemSent = false;
}


const sessionEntry: SessionEntry = {
  // Preserve existing per-session overrides even when rolling to a new sessionId.
  ...entry,                // ← this carries sessionFile forward
  sessionId,               // ← overridden
  updatedAt: params.nowMs,
  systemSent,
  // When starting a fresh session (forceNew / isolated), clear delivery routing…
  ...(isNewSession && {
    lastChannel: undefined,
    lastTo: undefined,
    lastAccountId: undefined,
    lastThreadId: undefined,
    deliveryContext: undefined,
    // sessionFile is intentionally NOT in this clear list
  }),
};

---

export function resolveSessionFilePath(
  sessionId: string,
  entry?: { sessionFile?: string },
  opts?: SessionFilePathOptions,
): string {
  const sessionsDir = resolveSessionsDir(opts);
  const candidate = entry?.sessionFile?.trim();
  if (candidate) {
    try {
      return resolvePathWithinSessionsDir(sessionsDir, candidate, { agentId: opts?.agentId });
    } catch {
      // Keep handlers alive when persisted metadata is stale/corrupt.
    }
  }
  return resolveSessionTranscriptPathInDir(sessionId, sessionsDir);
}

---

const baseEntry = params.sessionEntry ?? sessionStore[sessionKey] ?? { sessionId, updatedAt: Date.now() };
const fallbackSessionFile = params.fallbackSessionFile?.trim();
const entryForResolve =
  !baseEntry.sessionFile && fallbackSessionFile
    ? { ...baseEntry, sessionFile: fallbackSessionFile }
    : baseEntry;
const sessionFile = resolveSessionFilePath(sessionId, entryForResolve, {
  agentId: params.agentId,
  sessionsDir: params.sessionsDir,
});

---

{
  "sessionId": "f198e48b-84e6-4695-aabc-c4fed74d7cd1",
  "sessionFile": "/data/agents/main/sessions/34db8152-a3ac-4e7a-8c4a-9a38d9525339.jsonl",
  "updatedAt": 1775904199350
}
RAW_BUFFERClick to expand / collapse

Summary

agents.defaults.heartbeat.isolatedSession: true is documented as producing a fresh session (with a new sessionId and an empty transcript) on every heartbeat run, but in practice it only rolls the sessionId in the store entry — the persisted sessionFile path is preserved via a spread, so every run keeps appending to the same physical transcript file forever. Over time the file accumulates the full history of every prior heartbeat, and each new run sees all of it in its in-context window, which is the exact opposite of isolation.

This is provable from the code alone without any production evidence.

Docs intent

Both docs/gateway/heartbeat.md and docs/gateway/configuration-reference.md describe the same behavior. Direct quotes from the repo:

  • docs/gateway/heartbeat.md:42isolatedSession: true, // optional: fresh session each run (no conversation history)
  • docs/gateway/heartbeat.md:227isolatedSession: when true, each heartbeat runs in a fresh session with no prior conversation history. Uses the same isolation pattern as cron sessionTarget: "isolated". Dramatically reduces per-heartbeat token cost. Combine with lightContext: true for maximum savings. Delivery routing still uses the main session context.
  • docs/gateway/heartbeat.md:441Use isolatedSession: true to avoid sending full conversation history (~100K tokens down to ~2-5K per run).
  • docs/gateway/configuration-reference.md:1240when true, each heartbeat runs in a fresh session with no prior conversation history. Same isolation pattern as cron sessionTarget: "isolated". Reduces per-heartbeat token cost from ~100K to ~2-5K tokens.

The ~100K → ~2–5K promise only makes sense if each run actually starts from an empty transcript.

Code walk — why the bug is provable without runtime evidence

Step 1: heartbeat-runner.ts passes forceNew: true unconditionally

src/infra/heartbeat-runner.ts:808-824:

if (useIsolatedSession) {
  const cronSession = resolveCronSession({
    cfg,
    sessionKey: isolatedSessionKey,
    agentId,
    nowMs: startedAt,
    forceNew: true,
  });

Where useIsolatedSession = heartbeat?.isolatedSession === true. So when our config sets isolatedSession: true, forceNew is always true.

Step 2: resolveCronSession generates a new sessionId but preserves the old sessionFile via spread

src/cron/isolated-agent/session.ts (abbreviated to the relevant branch):

if (!params.forceNew && entry?.sessionId) {
  // reuse-or-roll logic
} else {
  // No existing session or forced new
  sessionId = crypto.randomUUID();       // ✓ new id
  isNewSession = true;
  systemSent = false;
}


const sessionEntry: SessionEntry = {
  // Preserve existing per-session overrides even when rolling to a new sessionId.
  ...entry,                // ← this carries sessionFile forward
  sessionId,               // ← overridden
  updatedAt: params.nowMs,
  systemSent,
  // When starting a fresh session (forceNew / isolated), clear delivery routing…
  ...(isNewSession && {
    lastChannel: undefined,
    lastTo: undefined,
    lastAccountId: undefined,
    lastThreadId: undefined,
    deliveryContext: undefined,
    // sessionFile is intentionally NOT in this clear list
  }),
};

The isNewSession cleanup block clears delivery-routing state but not sessionFile. The returned entry has a new sessionId and the OLD sessionFile.

Step 3: resolveSessionFilePath prefers persisted sessionFile over recomputing from sessionId

src/config/sessions/paths.ts:263:

export function resolveSessionFilePath(
  sessionId: string,
  entry?: { sessionFile?: string },
  opts?: SessionFilePathOptions,
): string {
  const sessionsDir = resolveSessionsDir(opts);
  const candidate = entry?.sessionFile?.trim();
  if (candidate) {
    try {
      return resolvePathWithinSessionsDir(sessionsDir, candidate, { agentId: opts?.agentId });
    } catch {
      // Keep handlers alive when persisted metadata is stale/corrupt.
    }
  }
  return resolveSessionTranscriptPathInDir(sessionId, sessionsDir);
}

Because the entry always has a sessionFile after the first-ever run, the if (candidate) branch is taken and the function never falls through to computing a fresh path from sessionId.

Step 4: resolveAndPersistSessionFile fallback is also gated on !baseEntry.sessionFile

src/config/sessions/session-file.ts:17-27:

const baseEntry = params.sessionEntry ?? sessionStore[sessionKey] ?? { sessionId, updatedAt: Date.now() };
const fallbackSessionFile = params.fallbackSessionFile?.trim();
const entryForResolve =
  !baseEntry.sessionFile && fallbackSessionFile
    ? { ...baseEntry, sessionFile: fallbackSessionFile }
    : baseEntry;
const sessionFile = resolveSessionFilePath(sessionId, entryForResolve, {
  agentId: params.agentId,
  sessionsDir: params.sessionsDir,
});

fallbackSessionFile is only used when !baseEntry.sessionFile. Same gate — stale sessionFile wins.

Step 5: heartbeat-runner.ts' own cleanup does not rotate the current file

src/infra/heartbeat-runner.ts:825-861 — the archiveRemovedSessionTranscripts call only processes files from removedSessionFiles, which is populated exclusively from staleIsolatedSessionKey (the separate :heartbeat:heartbeat suffix-collapse case). It never touches cronSession.sessionEntry.sessionFile even though that file has, from the runner's perspective, been logically "rolled". So the old transcript stays on disk and stays in the store entry.

End-to-end consequence

After the first run that ever creates the heartbeat session entry:

  • sessionId rolls on every invocation ✓
  • sessionFile is frozen forever ✗
  • The transcript writer appends every new run's messages to the same physical file
  • The transcript reader loads the whole file on every run, so the model sees all prior runs as in-context history

"Fresh session per run" does not hold for any run past the first. lightContext: true's documented ~100K → ~2–5K token savings silently regresses as the file grows.

Forensic evidence from a live deployment

This was caught on a production heartbeat agent running 15-minute ticks with isolatedSession: true, lightContext: true.

sessions.json entry for agent:main:main:heartbeat:

{
  "sessionId": "f198e48b-84e6-4695-aabc-c4fed74d7cd1",
  "sessionFile": "/data/agents/main/sessions/34db8152-a3ac-4e7a-8c4a-9a38d9525339.jsonl",
  "updatedAt": 1775904199350
}

Three different UUIDs are observable across one logical session:

  • 34db8152… — the filename (set when the file was first created, never rotated)
  • cbc883fc-6486-40d2-b0d1-6a102861f5df — the session id written into the transcript header (original first run, timestamp: "2026-04-10T22:09:51.937Z")
  • f198e48b…sessionId currently in the store (generated by the most recent forceNew)

File stats after ~13 hours of 15-minute heartbeats:

metricvalue
lines390
size~1.05 MB
distinct session records in the file1
occurrences of the acknowledgment sentinel string87
oldest transcript entry2026-04-10T22:09:51Z
newest transcript entry2026-04-11T10:43:04Z

The model began few-shot-learning from its own ~87 prior responses on every run. Thinking traces reference the accumulated precedents as if they were protocol ("per the protocol", "the standard response") — it is following patterns set by a polluted transcript rather than the heartbeat prompt.

Impact

  1. isolatedSession: true silently does nothing after the first run. Every existing deployment that relies on it has been running without isolation.
  2. lightContext: true's documented token savings are optimistic by ~20x. The ~100K → ~2-5K figure only holds on the first run; every subsequent run incurs the full accumulated transcript.
  3. Model behavior drifts toward the reinforced pattern. Any acknowledgment, summary, or tool-call sequence from an early run gets few-shot-learned by later runs and becomes sticky — even after config or prompt changes intended to alter the behavior.
  4. Context-overflow cliff. On a long-running deployment the file eventually exceeds the model's context window. Compaction on a transcript that is mostly tool-call noise fires compaction-safeguard: no real conversation messages to summarize and only writes a boundary marker, giving little real relief.
  5. No user-visible workaround. There is no chat/CLI command that resets a non-current session, so affected users have to rm the file by hand from the pod's filesystem.

Suggested fix

Add sessionFile: undefined to the isNewSession cleanup block in src/cron/isolated-agent/session.ts, right next to the existing delivery-routing clears. When the entry is returned with sessionFile undefined, resolveSessionFilePath correctly falls through to resolveSessionTranscriptPathInDir(sessionId, …) and a fresh transcript file is created for the new sessionId.

A PR with the one-line fix plus three regression tests (clears sessionFile when forceNew is true, clears sessionFile when session is stale, preserves sessionFile when reusing fresh session) is incoming from alexander-applyinnovations/openclaw:fix/heartbeat-isolated-session-file-rotation.

Related

  • #64196 — the llama.cpp overflow detection fix that ended up masking how severe the transcript accumulation is; without that fix the same deployment was wedging silently on raw 400s instead of compacting into noise.

AI-assisted

  • Drafted with Claude Code (Claude Opus 4.6, 1M context), reviewed and verified by the author.
  • Code walk reviewed line-by-line against main at the time of filing.

extent analysis

TL;DR

The issue can be fixed by adding sessionFile: undefined to the isNewSession cleanup block in src/cron/isolated-agent/session.ts to ensure a fresh transcript file is created for each new session.

Guidance

  • Review the src/cron/isolated-agent/session.ts file and add sessionFile: undefined to the isNewSession cleanup block to fix the issue.
  • Verify that the fix works by checking that a new transcript file is created for each new session and that the old transcript file is not appended to.
  • Test the fix with regression tests, such as clears sessionFile when forceNew is true, clears sessionFile when session is stale, and preserves sessionFile when reusing fresh session.
  • Consider reviewing related issues, such as #64196, to ensure that the fix does not introduce any new problems.

Example

if (isNewSession) {
  // ...
  sessionFile: undefined, // add this line to fix the issue
  // ...
}

Notes

  • The fix is specific to the src/cron/isolated-agent/session.ts file and may not apply to other parts of the codebase.
  • The issue is caused by the sessionFile not being cleared when a new session is created, resulting in the transcript file being appended to instead of rotated.

Recommendation

Apply the suggested fix by adding sessionFile: undefined to the isNewSession cleanup block in src/cron/isolated-agent/session.ts. This fix is specific to the issue described and should resolve the problem of transcript files not being rotated correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING