openclaw - ✅(Solved) Fix fix: synchronous session transcript reads block Gateway event loop (WS handshake timeouts, Telegram unresponsive) [2 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75656Fetched 2026-05-02 05:32:13
View on GitHub
Comments
4
Participants
4
Timeline
12
Reactions
2
Author
Timeline (top)
cross-referenced ×6commented ×4closed ×1subscribed ×1

The OpenClaw Gateway event loop is severely blocked by synchronous fs.readFileSync() calls during agent session preparation. This manifests as WebSocket preauth handshake timeouts (same symptom as #74135, different root cause) and Telegram becoming completely unresponsive for minutes at a time.

The #74135 fix (fix(gateway): refresh model catalog off request path) addressed model-catalog blocking, but the session-transcript read path is a separate, still-unfixed blocker.

Error Message

[ws] handshake timeout conn=... handshakeMs >> 10000 [telegram] connect error: gateway request timeout for connect

Root Cause

session-utils.fs-BgGqlqA-.jsreadSessionMessages() uses synchronous fs.readFileSync(file).split(...) + JSON parsing of entire transcript files. This is called during agent prompt-token estimation / preflight compaction and during multiple gateway methods (chat.history, session preview, session event sequencing).

With large or accumulated session transcripts, this blocks the Node.js event loop for tens of seconds, preventing any WebSocket handshakes, Telegram API calls, or RPCs from being processed.

Fix Action

Fix / Workaround

[agent/embedded] [trace:embedded-run] startup stages:
  phase=attempt-dispatch totalMs=83348
  stages=
    model-resolution:42519ms,   ← 42 seconds
    auth:21119ms,
    attempt-dispatch:19703ms

PR fix notes

PR #75672: feat(sessions): clean up archived transcript files on cleanup (#75658)

Description (problem / solution / changelog)

Fixes #75658.

Problem

openclaw sessions cleanup only manages entries in sessions.json (the index). The on-disk archive artefacts that the gateway leaves behind during normal use accumulate indefinitely:

  • <sessionId>.jsonl.deleted.<ts> (explicitly-deleted sessions)
  • <sessionId>.jsonl.reset.<ts> (post-/reset snapshots)
  • <sessionId>.checkpoint.<uuid>.jsonl for sessions that are no longer in the index (orphan compaction checkpoints)

The reporter saw 63 MB of archived files vs 36 MB of live state on a busy Pi install. openclaw sessions cleanup --all-agents --dry-run reported Would prune missing transcripts: 0 despite 86 archive files.

This compounds the synchronous-fs.readFileSync event-loop blocking discussed in #75656 — the bigger the directory, the worse the wedges.

Why the gateway already had the helper

cleanupArchivedSessionTranscripts() (in src/gateway/session-archive.runtime.ts) already implements the timestamp-based pruning logic. The gateway calls it inline during normal maintenance (src/config/sessions/store.ts lines 347–363). The CLI just never wired it up.

Fix

Add planArchivedSessionFileCleanup() to the sessions cleanup command:

  1. Scan the directory containing the session store.
  2. .deleted.<ts> older than maintenance.pruneAfterMs → remove.
  3. .reset.<ts> older than maintenance.resetArchiveRetentionMs → remove (skipped when retention is null, matching existing runtime behaviour — operators opt in via sessionMaintenance.resetArchiveRetention).
  4. <sessionId>.checkpoint.<uuid>.jsonl whose sessionId is no longer in the live index → remove.
  5. Honours --dry-run (preview) and --enforce (apply).

The plan is surfaced in:

  • The dry-run summary log: Would remove archived transcripts: N (deleted=, reset=, orphan-checkpoint=, B bytes) (or Archived transcripts scanned: N (none past retention) when nothing qualifies).
  • A new archivedFiles field on SessionCleanupSummary (and therefore the JSON output).

Tests

New sessions-cleanup.archived-files.test.ts covers four cases:

  • dry-run reports candidates without deleting any files
  • apply removes .deleted, .reset, and orphan checkpoints, while preserving in-window archives, live checkpoints, the unrelated sessions.json, and non-session files
  • resetArchiveRetentionMs: null preserves all .reset files but .deleted files are still removed via pruneAfterMs
  • a missing store directory returns an empty plan instead of throwing

The logic was first validated with a standalone fs-based simulator before being mirrored into the test file. All cases pass.

The pre-existing sessions-cleanup.test.ts continues to work — it already supplies resetArchiveRetentionMs in the mocked maintenance config and the new helper handles missing on-disk directories silently.

Risk

Medium-low. The helper:

  • Uses the existing archive-timestamp parser (parseSessionArchiveTimestamp) so it cannot mis-identify non-archive files.
  • Only acts inside the directory of the configured store.
  • Honours both retention thresholds that the gateway already enforces via cleanupArchivedSessionTranscripts.
  • Treats the store directory not existing as a no-op.
  • Treats unlink failures as best-effort (matches cleanupArchivedSessionTranscripts).

The directory walk is fs.readdirSync + fs.statSync per file. Pi/ARM users with hundreds of MB of archive files will see a one-time pause during cleanup; this matches the gateway-side maintenance path that is already on the same code shape.

Changed files

  • CHANGELOG.md (modified, +2/-0)
  • docs/cli/sessions.md (modified, +25/-0)
  • src/commands/sessions-cleanup.archived-files.test.ts (added, +180/-0)
  • src/commands/sessions-cleanup.ts (modified, +224/-2)

PR #75875: fix(gateway): async session transcript IO

Description (problem / solution / changelog)

(No description)

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/codex/src/app-server/transcript-mirror.test.ts (modified, +79/-0)
  • extensions/codex/src/app-server/transcript-mirror.ts (modified, +212/-3)
  • src/agents/main-session-restart-recovery.ts (modified, +6/-2)
  • src/agents/subagent-orphan-recovery.test.ts (modified, +3/-3)
  • src/agents/subagent-orphan-recovery.ts (modified, +6/-2)
  • src/agents/tools/embedded-gateway-stub.runtime.ts (modified, +2/-2)
  • src/agents/tools/embedded-gateway-stub.test.ts (modified, +4/-4)
  • src/agents/tools/embedded-gateway-stub.ts (modified, +13/-5)
  • src/agents/tools/sessions-list-tool.ts (modified, +12/-25)
  • src/auto-reply/reply/agent-runner-memory.ts (modified, +6/-6)
  • src/auto-reply/reply/session-fork.runtime.test.ts (modified, +1/-1)
  • src/auto-reply/reply/session-fork.runtime.ts (modified, +5/-5)
  • src/config/sessions/transcript-append.ts (added, +229/-0)
  • src/config/sessions/transcript.test.ts (modified, +121/-0)
  • src/config/sessions/transcript.ts (modified, +14/-5)
  • src/gateway/managed-image-attachments.test.ts (modified, +1/-1)
  • src/gateway/managed-image-attachments.ts (modified, +2/-2)
  • src/gateway/server-methods/artifacts.test.ts (modified, +4/-4)
  • src/gateway/server-methods/artifacts.ts (modified, +14/-12)
  • src/gateway/server-methods/chat-transcript-inject.ts (modified, +10/-97)
  • src/gateway/server-methods/chat.inject.parentid.test.ts (modified, +4/-4)
  • src/gateway/server-methods/chat.ts (modified, +43/-36)
  • src/gateway/server-methods/server-methods.test.ts (modified, +4/-5)
  • src/gateway/server-methods/sessions.ts (modified, +8/-7)
  • src/gateway/server-session-events.ts (modified, +65/-51)
  • src/gateway/session-history-state.ts (modified, +44/-0)
  • src/gateway/session-reset-service.ts (modified, +5/-5)
  • src/gateway/session-transcript-index.fs.ts (added, +247/-0)
  • src/gateway/session-utils.fs.test.ts (modified, +212/-0)
  • src/gateway/session-utils.fs.ts (modified, +437/-4)
  • src/gateway/session-utils.ts (modified, +57/-31)
  • src/gateway/sessions-history-http.revocation.test.ts (modified, +2/-2)
  • src/gateway/sessions-history-http.ts (modified, +5/-5)
  • src/status/status-message.ts (modified, +3/-2)
  • src/tui/embedded-backend.test.ts (modified, +2/-2)
  • src/tui/embedded-backend.ts (modified, +7/-5)

Code Example

[agent/embedded] [trace:embedded-run] prep stages:
  runId=63650e03 sessionId=24fb8e6e phase=stream-ready
  totalMs=125276
  stages=
    workspace-sandbox:224ms,
    skills:16ms,
    core-plugin-tools:40499ms,40 seconds
    bootstrap-context:4571ms,
    bundle-tools:6077ms,
    system-prompt:31392ms,31 seconds
    session-resource-loader:10390ms,
    agent-session:19ms,
    stream-setup:32088ms         ← 32 seconds

---

[agent/embedded] [trace:embedded-run] startup stages:
  phase=attempt-dispatch totalMs=83348
  stages=
    model-resolution:42519ms,42 seconds
    auth:21119ms,
    attempt-dispatch:19703ms

---

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=93s
  eventLoopDelayP99Ms=699.4
  eventLoopDelayMaxMs=69860.369.8 second max delay
  eventLoopUtilization=0.914
  cpuCoreRatio=0.932

---

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=32s
  eventLoopDelayP99Ms=18555.6
  eventLoopDelayMaxMs=18555.618.5 second delay
  eventLoopUtilization=1
  cpuCoreRatio=1.05

---

[ws] handshake timeout conn=... handshakeMs >> 10000
  [telegram] connect error: gateway request timeout for connect

---

[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
  [telegram] sendVoice failed: Network request for 'sendVoice' failed!
  [telegram] sendMessage failed: Network request for 'sendMessage' failed!
  [telegram] message processing failed
  typing TTL reached (2m); stopping typing indicator

---

main/sessions/:        116 MB
large JSONL files:     14 files > 1MB
trajectory files:      21 files
deleted/reset files:   86 files (not GC'd by `sessions cleanup`)
sessions.json index:   722 KB (read synchronously)
RAW_BUFFERClick to expand / collapse

Summary

The OpenClaw Gateway event loop is severely blocked by synchronous fs.readFileSync() calls during agent session preparation. This manifests as WebSocket preauth handshake timeouts (same symptom as #74135, different root cause) and Telegram becoming completely unresponsive for minutes at a time.

The #74135 fix (fix(gateway): refresh model catalog off request path) addressed model-catalog blocking, but the session-transcript read path is a separate, still-unfixed blocker.

Environment

  • OS: Raspberry Pi, Linux 6.12.75+rpt-rpi-v8, arm64/aarch64
  • Node.js: v24.14.1
  • OpenClaw: 2026.4.29 (a448042)
  • Gateway bound on LAN, port 18789
  • memorySearch: enabled (memory-core plugin, dreaming enabled)

Root cause

session-utils.fs-BgGqlqA-.jsreadSessionMessages() uses synchronous fs.readFileSync(file).split(...) + JSON parsing of entire transcript files. This is called during agent prompt-token estimation / preflight compaction and during multiple gateway methods (chat.history, session preview, session event sequencing).

With large or accumulated session transcripts, this blocks the Node.js event loop for tens of seconds, preventing any WebSocket handshakes, Telegram API calls, or RPCs from being processed.

Evidence

Agent prep stage trace (from gateway logs)

[agent/embedded] [trace:embedded-run] prep stages:
  runId=63650e03 sessionId=24fb8e6e phase=stream-ready
  totalMs=125276
  stages=
    workspace-sandbox:224ms,
    skills:16ms,
    core-plugin-tools:40499ms,   ← 40 seconds
    bootstrap-context:4571ms,
    bundle-tools:6077ms,
    system-prompt:31392ms,       ← 31 seconds
    session-resource-loader:10390ms,
    agent-session:19ms,
    stream-setup:32088ms         ← 32 seconds

Total agent preparation: 125 seconds before any model call.

Second run (same session, different trigger)

[agent/embedded] [trace:embedded-run] startup stages:
  phase=attempt-dispatch totalMs=83348
  stages=
    model-resolution:42519ms,   ← 42 seconds
    auth:21119ms,
    attempt-dispatch:19703ms

Event loop diagnostics

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=93s
  eventLoopDelayP99Ms=699.4
  eventLoopDelayMaxMs=69860.3   ← 69.8 second max delay
  eventLoopUtilization=0.914
  cpuCoreRatio=0.932
[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  interval=32s
  eventLoopDelayP99Ms=18555.6
  eventLoopDelayMaxMs=18555.6   ← 18.5 second delay
  eventLoopUtilization=1
  cpuCoreRatio=1.05

Resulting symptoms

  • WebSocket handshake timeouts (same as #74135 symptom):
    [ws] handshake timeout conn=... handshakeMs >> 10000
    [telegram] connect error: gateway request timeout for connect
  • Telegram completely unresponsive during blocking windows:
    [telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
    [telegram] sendVoice failed: Network request for 'sendVoice' failed!
    [telegram] sendMessage failed: Network request for 'sendMessage' failed!
    [telegram] message processing failed
    typing TTL reached (2m); stopping typing indicator
  • openclaw logs --follow drops mid-stream (also observed in #74135)
  • openclaw tui disconnects with "gateway not reachable"

Session store size correlation

The blocking duration correlates directly with session store size. Before cleanup:

main/sessions/:        116 MB
large JSONL files:     14 files > 1MB
trajectory files:      21 files
deleted/reset files:   86 files (not GC'd by `sessions cleanup`)
sessions.json index:   722 KB (read synchronously)

After manual deletion of .deleted.* and .reset.* files: 116 MB → 36 MB.

Note: openclaw sessions cleanup --dry-run reported 0 files to remove despite 86 physical .deleted.*/.reset.* files on disk — the cleanup command only manages the index, not the physical files.

Suggested fix direction

  1. Convert readSessionMessages() to async I/O (fs.promises.readFile)
  2. Avoid full-file reads where partial/streaming reads suffice
  3. Cache token counts to avoid repeated full transcript reads during preflight compaction
  4. Have sessions cleanup also remove physical .deleted.*/.reset.* files (separate issue filed)

Related

  • #74135 — same WS timeout symptom, different root cause (model catalog, now fixed)

extent analysis

TL;DR

Convert readSessionMessages() to use asynchronous I/O to prevent blocking the Node.js event loop.

Guidance

  • Identify and replace synchronous fs.readFileSync() calls with asynchronous fs.promises.readFile() to prevent event loop blocking.
  • Consider implementing partial or streaming reads where full file reads are not necessary to reduce the load on the event loop.
  • Implement caching for token counts to avoid repeated full transcript reads during preflight compaction.
  • Enhance the sessions cleanup command to remove physical .deleted.* and .reset.* files in addition to managing the index.

Example

// Before
const sessionData = fs.readFileSync(file).split(...);

// After
const sessionData = await fs.promises.readFile(file);
const sessionDataArray = sessionData.split(...);

Notes

The provided solution direction focuses on converting synchronous I/O operations to asynchronous ones and optimizing file reads. However, the effectiveness of these changes may depend on the specific requirements and constraints of the OpenClaw Gateway application.

Recommendation

Apply the suggested fix direction, starting with converting readSessionMessages() to use asynchronous I/O, to address the event loop blocking issue and improve the application's responsiveness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING