openclaw - ✅(Solved) Fix session.maintenance has no size cap for transcript .jsonl files — unbounded growth causes gateway CPU 100% [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66360Fetched 2026-04-15 06:26:24
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1referenced ×1

session.maintenance controls (rotateBytes, maxDiskBytes, maxEntries) apply to sessions.json (the session index file) but have no effect on individual transcript .jsonl files. These files can grow without bound, eventually causing gateway CPU 100% and unresponsiveness.

Error Message

With the above config, individual .jsonl transcript files grew to:

Root Cause

rotateBytes rotates sessions.json (the index), not transcript .jsonl files. There is no mechanism to cap or rotate individual transcript files. For long-running topic-bound sessions (group chats, forum topics), these files grow indefinitely because:

  1. Every message, tool call, and tool result is appended in full
  2. Large tool results (e.g. sessions_list, file reads) are not truncated before write
  3. No maintenance hook checks transcript file sizes

This is distinct from issue #18572 (which is about sessions.json rotation race condition).

Fix Action

Workaround

Running a nightly cron that scans all agent session directories and archives .jsonl files exceeding 10 MB.

PR fix notes

PR #66546: feat(sessions): add transcriptRotateBytes and transcriptMaxLines to cap .jsonl growth

Description (problem / solution / changelog)

Summary

  • Problem: session.maintenance controls (rotateBytes, maxDiskBytes) only affect sessions.json, not individual transcript .jsonl files. These files grow unbounded in long-lived group chats, reaching 100MB+ and causing gateway CPU 100%.
  • Why it matters: Users who configure rotateBytes reasonably expect transcript files to be capped, but .jsonl files are completely uncontrolled — a silent critical failure.
  • What changed: Added transcriptRotateBytes and transcriptMaxLines config options. When enabled, transcript rotation archives oversized .jsonl files as .jsonl.bak.<timestamp> and writes a replacement with the session header + last N lines. Old backups are pruned to keep the 3 most recent per transcript.
  • What did NOT change (scope boundary): No changes to existing rotateBytes, maxDiskBytes, pruneAfter, or enforceSessionDiskBudget logic. New fields default to null (disabled), so behavior is unchanged without explicit configuration.

Design Decisions

Why transcriptRotateBytes instead of reusing rotateBytes?

Three approaches were considered:

ApproachDescription
A. Reuse rotateBytesApply the same threshold to both sessions.json and .jsonl
B. New independent fieldsAdd transcriptRotateBytes + transcriptMaxLines
C. Global default + overriderotateBytes as default, transcriptRotateBytes as override

Chose B because:

  1. Different thresholds: sessions.json is a structured index — 10MB is already large. .jsonl is a message stream — 10MB may be just a few thousand lines, and users may want 50MB before rotation. A shared threshold forces a compromise.
  2. Different control dimensions: .jsonl rotation needs transcriptMaxLines (tail lines to keep for context window), which sessions.json does not need.
  3. Backward compatibility: Reusing rotateBytes would make rotateBytes: "10mb" suddenly affect all .jsonl files — a behavioral change. Independent fields default to null (disabled).
  4. Simplicity: Approach C adds mental overhead (priority rules between two fields) for little gain.

Why check only the active session's transcript on the hot path?

The original implementation called rotateTranscriptFiles() on every saveSessionStoreUnlocked(), which does a full directory walk (stat every .jsonl file). For a gateway with 100+ sessions, this means 100+ stat() calls per message.

Current design: the hot path (saveSessionStore with activeSessionKey) only stats the single active transcript. Bulk rotation (full directory walk) is gated behind bulkTranscriptRotation, which is only set by explicit maintenance entry points (sessions-cleanup). Runtime call sites without activeSessionKey (e.g. heartbeat-runner) skip transcript rotation entirely.

When both bulkTranscriptRotation and activeSessionKey are provided (e.g. sessions cleanup --active-key), the bulk path takes priority so all oversized transcripts are covered, not just the active one.

Why transcript rotation does not execute under mode: "warn"

By design, session maintenance uses a dual-mode system: "warn" (default) and "enforce".

  • warn mode (safe-by-default): Logs warnings for prune/cap thresholds but does not execute any destructive operations. This prevents accidental data loss when users upgrade and gain new maintenance features.
  • enforce mode: Executes all maintenance operations — pruning, capping, transcript rotation, and reset archiving.

This mirrors the existing pattern for maxSessions pruning and maxSessionEntries capping, which also require mode: "enforce" to take effect. Transcript rotation follows the same safe-by-default principle.

Known gap: The warn branch currently does not log a warning when transcript rotation thresholds are exceeded. This is a candidate for a follow-up enhancement.

Concurrency safety: rotation rollback

The rename → write replacement sequence has a TOCTOU window where a concurrent transcript append can recreate the file. The implementation handles this with layered defenses:

  1. O_EXCL write: If a concurrent writer recreated the file, the replacement write throws EEXIST and is skipped. The archive is safe regardless.
  2. weCreatedReplacement flag: Tracks whether we created the replacement file via O_EXCL. On rollback:
    • If true (open succeeded, writeFile failed): safe to unlink the partial file before restoring from archive.
    • If false (failure before open): the file at transcriptPath, if any, belongs to a concurrent writer and must not be deleted.

Self-Review Checklist

Issues found and fixed during development and bot review:

#SeverityIssueFound byFix
1🔴 CriticalSession header dropped on rotation (replacement wrote only tail lines)Codex P1Replacement now reads header from archive; falls back to buildDefaultSessionHeader()
2🔴 CriticalwriteFile failure left partial replacement file, blocked rollbackCodex P1weCreatedReplacement flag tracks file ownership; only unlink when we created it
3🔴 CriticalBulk rotation triggered on every saveSessionStore without activeSessionKey (heartbeat, etc.)Codex P1bulkTranscriptRotation opt-in flag, only set by sessions-cleanup
4🔴 CriticalbulkTranscriptRotation + activeSessionKey mutually exclusive — cleanup skipped inactive transcriptsCodex P2bulkTranscriptRotation now takes priority over activeSessionKey
5🔴 CriticalUnconditional unlink on rollback could delete concurrent writer's valid fileCodex P1Only unlink when weCreatedReplacement === true
6🟡 ImportantReplacement file permissions widened from 0o600 to default umaskCodex P2Explicit 0o600 mode in fs.promises.open()
7🟡 ImportantreadLastNLines used Array.shift() — O(n) per callGreptile P2Replaced with circular buffer (readHeaderAndTailLines) — O(1) per iteration
8🟡 ImportantreadHeaderAndTailLines stream not explicitly closed on early exitManual reviewtry/finally with rl.close() + stream.destroy()
9💡 MinormaxBackups = 3 hardcoded in two placesManual reviewExtracted MAX_ROTATION_BACKUPS constant
10💡 MinortranscriptRotateBytes: 0 or negative passed throughManual reviewresolveTranscriptRotateBytes() returns null for <= 0

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #66360
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: rotateBytes only targets sessions.json (the session index). No code path checks or limits the size of individual transcript .jsonl files. For long-running topic-bound sessions (Telegram forum topics, group chats), every message/tool call/tool result is appended in full with no cap.
  • Missing detection / guardrail: No file-size check on transcript files during session store saves or maintenance sweeps.
  • Contributing context (if known): The sessions.json rotation was added in an earlier release; transcript files were assumed to be bounded by pruneAfter session expiry, but topic-bound sessions are never pruned because they remain active.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/config/sessions/store-maintenance.transcript-rotation.test.ts (50 tests)
  • Scenario the test should lock in: Transcript file exceeding transcriptRotateBytes is archived and replaced with header + tail lines; rollback restores original on write failure (including partial write with weCreatedReplacement tracking); EEXIST path marks rotation as succeeded; bulk rotation covers all transcripts; bulkTranscriptRotation takes priority over activeSessionKey.
  • Why this is the smallest reliable guardrail: Unit tests with fs mocks cover all rotation/rollback paths without requiring a running gateway.
  • Existing test that already covers this (if any): None — this is a new feature.
  • Known test gap: The EEXIST test verifies the code path does not error, but does not simulate a true concurrent writer recreating the file between rename and open. The O_EXCL defense is verified structurally, not via a race simulation.

User-visible / Behavior Changes

  • Two new optional config fields under session.maintenance:
    • transcriptRotateBytes (string | number | null): Rotate transcript .jsonl files exceeding this size. Example: "10mb". Default: null (disabled).
    • transcriptMaxLines (number | null): Number of most recent lines to keep after rotation. Default: null (archive entire file, replacement contains only the session header).
  • Rotation creates .jsonl.bak.<timestamp> archives, auto-pruned to 3 most recent per transcript.
  • Requires session.maintenance.mode: "enforce" to take effect (consistent with existing maintenance behavior).

Diagram (if applicable)

Hot path (per message write):
  saveSessionStore(store, path, { activeSessionKey })
    → stat(active transcript)
    → if size > transcriptRotateBytes:
        rename(transcript → .bak)
        → open(O_EXCL) → writeFile(header + tail)
        → cleanup old .bak files (keep 3)
        on write failure:
          → if weCreatedReplacement: unlink partial file
          → rename(.bak → transcript)  // restore original

Bulk path (sessions-cleanup CLI only):
  saveSessionStore(store, path, { bulkTranscriptRotation: true })
    → readdir(sessions dir)
    → for each .jsonl: stat → rotate if oversized

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Darwin 25.4.0 (arm64, Apple Silicon), Docker (ubuntu:24.04)
  • Runtime/container: Node v25.8.1
  • Model/provider: N/A (infrastructure change)
  • Integration/channel (if any): Telegram forum topics (long-lived sessions)
  • Relevant config (redacted): transcriptRotateBytes: "10mb", transcriptMaxLines: 500

Steps

  1. Create a session entry with a 15MB .jsonl transcript (42,714 lines)
  2. Configure transcriptRotateBytes: "10mb" and transcriptMaxLines: 500
  3. Run openclaw sessions cleanup --enforce

Expected

  • topic-2.jsonl reduced to ~180KB (500 lines + session header)
  • Old data archived as topic-2.jsonl.bak.<timestamp>

Actual

  • Confirmed: 15MB → 180KB, 42,714 → 500 lines, archive created

Evidence

  • Trace/log snippets
  • Failing test/log before + passing after
  • Screenshot/recording
  • Perf numbers (if relevant)

Docker verification: 15MB topic-2.jsonl rotated to 180KB with .bak archive preserved.

Human Verification (required)

  • Verified scenarios: pnpm build passes, pnpm check (lint + format + import cycles) passes, pnpm test -- src/config/sessions/store-maintenance.transcript-rotation.test.ts (50 tests) passes, pnpm test -- src/config/sessions/store.pruning.integration.test.ts (12 tests) passes, pnpm test -- src/config/schema.base.generated.test.ts (5 tests, schema drift) passes
  • Edge cases checked: null/disabled defaults (no rotation), 0/negative values (treated as disabled), file under threshold (no-op), multiple .jsonl files in same directory, single-file hot path (rotateTranscriptFile) vs full walk (rotateTranscriptFiles), session header preservation after rotation, 0o600 file permissions, O_EXCL concurrent-write detection (structural, not race-simulated), partial-write rollback with weCreatedReplacement tracking, bulkTranscriptRotation priority over activeSessionKey
  • What I did not verify: Full gateway integration test with live channels; extremely large files (>1GB); Windows file locking behavior; true concurrent-writer race simulation for EEXIST path
  • Note: activeSessionKey was added to most runtime updateSessionStore call sites that have the key in scope. A few call sites (model-selection.ts, session-reset-model.ts, one path in commands-session-store.ts) do not pass it because the session key is not readily available or the update is fire-and-forget. These paths fall back to the pre-existing behavior (no transcript rotation on that write), which is safe.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? Yes (two new optional fields, default null — no behavior change without explicit opt-in)
  • Migration needed? No

Risks and Mitigations

  • Risk: Rotation during active write could lose in-flight messages
    • Mitigation: O_EXCL write detects concurrent recreation and skips replacement. weCreatedReplacement flag ensures rollback only deletes files we created, not concurrent writers'. The session store lock serializes saveSessionStore calls.
  • Risk: .bak files accumulate on disk
    • Mitigation: Automatic pruning keeps only the 3 most recent backups per transcript file (MAX_ROTATION_BACKUPS)
  • Risk: Non-active sessions with large transcripts are not rotated on the hot path
    • Mitigation: bulkTranscriptRotation flag enables full-directory sweep from sessions-cleanup CLI. Takes priority over activeSessionKey so cleanup always covers all transcripts.
  • Risk: transcriptRotateBytes: 0 or negative values could cause unexpected behavior
    • Mitigation: resolveTranscriptRotateBytes() returns null for <= 0, and the hot path guard checks != null && > 0
  • Risk: A few updateSessionStore call sites do not pass activeSessionKey
    • Mitigation: These paths skip hot-path transcript rotation (equivalent to pre-existing behavior). Oversized transcripts from these paths are caught by bulk rotation via sessions-cleanup.

AI-Assisted

  • This PR was developed with AI assistance (CodeBuddy)
  • Testing level: Fully tested (50 unit tests + 12 integration tests + 5 schema drift tests)
  • I understand what the code does

Changed files

  • docs/.generated/config-baseline.sha256 (modified, +3/-3)
  • src/auto-reply/reply/abort-cutoff.runtime.ts (modified, +13/-9)
  • src/auto-reply/reply/abort.ts (modified, +18/-14)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +32/-20)
  • src/auto-reply/reply/agent-runner-session-reset.ts (modified, +7/-3)
  • src/auto-reply/reply/body.ts (modified, +15/-11)
  • src/auto-reply/reply/commands-acp/lifecycle.ts (modified, +15/-11)
  • src/auto-reply/reply/commands-session-store.ts (modified, +21/-13)
  • src/auto-reply/reply/directive-handling.impl.ts (modified, +7/-3)
  • src/auto-reply/reply/directive-handling.persist.ts (modified, +7/-3)
  • src/auto-reply/reply/get-reply-run.ts (modified, +7/-3)
  • src/auto-reply/reply/session-updates.ts (modified, +17/-9)
  • src/commands/sessions-cleanup.ts (modified, +1/-0)
  • src/config/schema.base.generated.ts (modified, +31/-0)
  • src/config/schema.help.ts (modified, +4/-0)
  • src/config/schema.labels.ts (modified, +2/-0)
  • src/config/sessions/runtime-types.ts (modified, +2/-0)
  • src/config/sessions/store-maintenance.transcript-rotation.test.ts (added, +955/-0)
  • src/config/sessions/store-maintenance.ts (modified, +348/-3)
  • src/config/sessions/store.pruning.integration.test.ts (modified, +125/-0)
  • src/config/sessions/store.ts (modified, +40/-1)
  • src/config/types.base.ts (modified, +14/-0)
  • src/config/zod-schema.session-maintenance-extensions.test.ts (modified, +85/-0)
  • src/config/zod-schema.session.ts (modified, +15/-0)
  • src/gateway/server-methods/agent.ts (modified, +14/-10)
  • src/gateway/server-methods/sessions.ts (modified, +138/-98)
  • src/gateway/server-node-events.ts (modified, +26/-22)
  • src/gateway/session-compaction-checkpoints.ts (modified, +18/-14)
  • src/gateway/session-reset-service.ts (modified, +107/-103)
  • src/gateway/sessions-resolve.test.ts (modified, +5/-1)
  • src/gateway/sessions-resolve.ts (modified, +10/-6)

Code Example

{
  "pruneAfter": "30d",
  "maxEntries": 500,
  "rotateBytes": "10mb",
  "maxDiskBytes": "500mb",
  "highWaterBytes": "400mb",
  "mode": "enforce"
}
RAW_BUFFERClick to expand / collapse

Summary

session.maintenance controls (rotateBytes, maxDiskBytes, maxEntries) apply to sessions.json (the session index file) but have no effect on individual transcript .jsonl files. These files can grow without bound, eventually causing gateway CPU 100% and unresponsiveness.

Environment

  • OpenClaw version: v2026.4.12
  • Node: v25.8.1
  • OS: Darwin 25.4.0 (arm64, Apple Silicon)
  • Affected session types: topic-bound group chat sessions (e.g. Telegram forum topics)

Current config (session.maintenance)

{
  "pruneAfter": "30d",
  "maxEntries": 500,
  "rotateBytes": "10mb",
  "maxDiskBytes": "500mb",
  "highWaterBytes": "400mb",
  "mode": "enforce"
}

Observed behavior

With the above config, individual .jsonl transcript files grew to:

  • daily-devops topic-2.jsonl → 113 MB (25,927 lines)
  • dev-tl session.jsonl → 266 MB
  • dev-tl session.jsonl → 56 MB
  • daily-collector topic-8.jsonl → 36 MB

The rotateBytes: "10mb" setting had zero effect on any of these files.

Impact

When daily-devops/topic-2.jsonl reached 113 MB, gateway entered CPU 100% and became completely unresponsive. Manual intervention was required: archive the file and restart gateway.

Root cause

rotateBytes rotates sessions.json (the index), not transcript .jsonl files. There is no mechanism to cap or rotate individual transcript files. For long-running topic-bound sessions (group chats, forum topics), these files grow indefinitely because:

  1. Every message, tool call, and tool result is appended in full
  2. Large tool results (e.g. sessions_list, file reads) are not truncated before write
  3. No maintenance hook checks transcript file sizes

This is distinct from issue #18572 (which is about sessions.json rotation race condition).

Expected behavior

session.maintenance should include controls for transcript .jsonl files, such as:

  • transcriptRotateBytes: rotate (archive) a transcript when it exceeds a size threshold
  • transcriptMaxLines: cap lines per transcript file
  • Or: apply rotateBytes to transcripts as well, not just sessions.json

Workaround

Running a nightly cron that scans all agent session directories and archives .jsonl files exceeding 10 MB.

Related

  • Issue #18572: sessions.json rotation race condition (different issue, same config surface)

extent analysis

TL;DR

Implement a maintenance mechanism to control the size of individual transcript .jsonl files, such as rotating or capping them, to prevent unbounded growth and gateway unresponsiveness.

Guidance

  • Review the current session.maintenance configuration and consider adding custom logic to rotate or cap transcript .jsonl files based on size or line count thresholds.
  • Implement a nightly cron job, as described in the workaround, to scan and archive .jsonl files exceeding a certain size threshold (e.g., 10 MB).
  • Investigate modifying the session.maintenance configuration to include controls for transcript .jsonl files, such as transcriptRotateBytes or transcriptMaxLines.
  • Consider truncating large tool results before writing them to transcript files to prevent excessive growth.

Example

// potential new configuration options
{
  "transcriptRotateBytes": "10mb",
  "transcriptMaxLines": 10000
}

Note: This example is speculative, as the actual implementation details are not provided.

Notes

The provided workaround using a nightly cron job may not be sufficient for all use cases, and a more robust solution may be required. The session.maintenance configuration options mentioned in the expected behavior section are not currently implemented.

Recommendation

Apply the workaround by implementing a nightly cron job to scan and archive .jsonl files exceeding a certain size threshold, as this provides a temporary solution to prevent gateway unresponsiveness. A more permanent solution would require modifying the session.maintenance configuration or implementing custom logic to control transcript file growth.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

session.maintenance should include controls for transcript .jsonl files, such as:

  • transcriptRotateBytes: rotate (archive) a transcript when it exceeds a size threshold
  • transcriptMaxLines: cap lines per transcript file
  • Or: apply rotateBytes to transcripts as well, not just sessions.json

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix session.maintenance has no size cap for transcript .jsonl files — unbounded growth causes gateway CPU 100% [1 pull requests, 1 participants]