openclaw - 💡(How to fix) Fix session-file-repair writes unbounded `.bak-*` snapshots per repair invocation (2 GB / 24h on a single stuck session) [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

repairSessionFileIfNeeded writes a <sessionFile>.bak-<pid>-<timestamp> snapshot every time it runs on a file that needs repair, with no rotation, no max-count, no TTL, and no deduplication. When a session has a persistently malformed JSONL line and any caller keeps re-invoking repair (spawn-attempt, compaction), the backup directory grows without bound.

In the field this produced 2.1 GB in ~/.openclaw/agents/operations/sessions/ from 2,180 backup files for one stuck session.

Error Message

  • Session contents are repeated [heartbeat poll] turns with provider error: Your authentication token has been invalidated (openai-codex / gpt-5.4).

Root Cause

src/agents/session-file-repair.ts (around L401 on main):

const cleaned = `${entries.map((entry) => JSON.stringify(entry)).join(\"\\n\")}\\n`;
const backupPath = `${sessionFile}.bak-${process.pid}-${Date.now()}`;
try {
  const stat = await fs.stat(sessionFile).catch(() => null);
  await fs.writeFile(backupPath, content, \"utf-8\");      // unconditional write, no caps
  await replaceFileAtomic({});
}

The function only short-circuits before the backup write when all repair conditions are zero. If a single malformed line persists across repair calls (because new corrupt entries keep being appended, or because the dropping logic doesn't remove the offending entry), every subsequent call writes another full-content backup.

Callers that re-invoke repair on the same session:

  • src/agents/pi-embedded-runner/run/attempt.ts:1328 — each spawn-attempt
  • src/agents/pi-embedded-runner/compact.ts:807 — each compaction

Fix Action

Fixed

Code Example

059a9c5f-…jsonl.bak-1220-1778414172476  (1.9 MB)
059a9c5f-…jsonl.bak-1220-1778414187559  (1.9 MB)
  (2,178 more)059a9c5f-…jsonl.bak-2640-1778421172468  (1.9 MB)

---

const cleaned = `${entries.map((entry) => JSON.stringify(entry)).join(\"\\n\")}\\n`;
const backupPath = `${sessionFile}.bak-${process.pid}-${Date.now()}`;
try {
  const stat = await fs.stat(sessionFile).catch(() => null);
  await fs.writeFile(backupPath, content, \"utf-8\");      // unconditional write, no caps
  await replaceFileAtomic({});
}
RAW_BUFFERClick to expand / collapse

Summary

repairSessionFileIfNeeded writes a <sessionFile>.bak-<pid>-<timestamp> snapshot every time it runs on a file that needs repair, with no rotation, no max-count, no TTL, and no deduplication. When a session has a persistently malformed JSONL line and any caller keeps re-invoking repair (spawn-attempt, compaction), the backup directory grows without bound.

In the field this produced 2.1 GB in ~/.openclaw/agents/operations/sessions/ from 2,180 backup files for one stuck session.

Reproduction (field observation)

Single orphan session 059a9c5f-eed4-4826-95ce-f32c820f5784 in the operations agent:

  • Session created 2026-05-05; session.ended emitted 2026-05-07 (terminated cleanly with status: success per trajectory).
  • Between 2026-05-10 18:56 and 2026-05-11 20:05 (~25 hours, three days after the session ended), the directory accumulated 2,180 *.jsonl.bak-<pid>-<ts> files, each ~1.8 MB.
  • File-name PIDs show only two process generations (1220 and 2640) producing >1000 backups each — i.e. repair fired thousands of times within a single gateway process, not just on gateway restart.
  • Session contents are repeated [heartbeat poll] turns with provider error: Your authentication token has been invalidated (openai-codex / gpt-5.4).

Sample filenames:

059a9c5f-…jsonl.bak-1220-1778414172476  (1.9 MB)
059a9c5f-…jsonl.bak-1220-1778414187559  (1.9 MB)
…  (2,178 more) …
059a9c5f-…jsonl.bak-2640-1778421172468  (1.9 MB)

Root cause

src/agents/session-file-repair.ts (around L401 on main):

const cleaned = `${entries.map((entry) => JSON.stringify(entry)).join(\"\\n\")}\\n`;
const backupPath = `${sessionFile}.bak-${process.pid}-${Date.now()}`;
try {
  const stat = await fs.stat(sessionFile).catch(() => null);
  await fs.writeFile(backupPath, content, \"utf-8\");      // unconditional write, no caps
  await replaceFileAtomic({});
}

The function only short-circuits before the backup write when all repair conditions are zero. If a single malformed line persists across repair calls (because new corrupt entries keep being appended, or because the dropping logic doesn't remove the offending entry), every subsequent call writes another full-content backup.

Callers that re-invoke repair on the same session:

  • src/agents/pi-embedded-runner/run/attempt.ts:1328 — each spawn-attempt
  • src/agents/pi-embedded-runner/compact.ts:807 — each compaction

Suggested fix

Either approach (or both) prevents the unbounded growth:

  1. Delete the backup after successful atomic repair (after replaceFileAtomic) — the snapshot has already served its safety purpose. This is what PR #77945 proposes.
  2. Cap per-session backup count — glob existing ${sessionFile}.bak-*, sort by mtime, unlink older than N (e.g. N=3) before writing the new one.

Either fix is local to session-file-repair.ts.

Related

  • PR #77945 (open, by @tynamite, 2026-05-05) — implements approach 1. CI currently failing on "Real behavior proof"; no issue currently links it. This issue can serve as the tracking issue for #77945.
  • #63998 (open) — different cause (large-transcript doomloop on crash-restart) but same symptom family (unbounded session-state growth).
  • #77228 — referenced in isStructurallyInvalidMessageEntry doc comments; also about JSONL corruption surviving repair.

Secondary observation (likely separate bug)

The orphan session emitted session.ended on 2026-05-07, yet spawn-attempt / compaction continued to invoke repair on it 3-4 days later. That suggests stale-session reachability from the registry / heartbeat loop even after terminal state. Not addressed by #77945. Flagged here for visibility; can be split into a separate issue if useful.

Environment

  • OpenClaw: v2026.4.24 (release tag) + 172 plan-mode commits (rebase of #70071 work)
  • Affected directory: ~/.openclaw/agents/operations/sessions/
  • macOS, Node 22+

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING