openclaw - ✅(Solved) Fix pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77124Fetched 2026-05-05 05:51:51
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
2
Author
Timeline (top)
mentioned ×2subscribed ×2closed ×1commented ×1

When a session accumulates a large trajectory file (50MB+), pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

  1. Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.

PR fix notes

PR #77154: fix: bound trajectory runtime flush

Description (problem / solution / changelog)

Summary

  • Replace #77133 with a bounded trajectory-runtime fix for #77124.
  • Bound runtime trajectory payload shaping before redaction/stringify, including tool definitions built before recordEvent.
  • Stop live capture once the runtime sidecar write budget is reached, reserve room for a trace.truncated marker, and keep queued file writes from growing beyond the same budget while yielding before sidecar appends.
  • Split live capture and export limits: capture stops at 10 MiB, while export keeps accepting existing runtime sidecars up to 50 MiB.

Closes #77124.

Verification

  • pnpm test src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts src/trajectory/export.test.ts
  • Testbox pnpm check:changed: tbx_01kqs0dbsp3yy5k9cxgq5vc8jw, exit 0

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/tools/trajectory.md (modified, +1/-1)
  • src/agents/queued-file-writer.test.ts (modified, +12/-0)
  • src/agents/queued-file-writer.ts (modified, +27/-3)
  • src/auto-reply/reply/followup-delivery.test.ts (modified, +5/-1)
  • src/trajectory/export.test.ts (modified, +2/-2)
  • src/trajectory/paths.ts (modified, +1/-0)
  • src/trajectory/runtime.test.ts (modified, +41/-3)
  • src/trajectory/runtime.ts (modified, +165/-25)

Code Example

agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)

---

agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)
RAW_BUFFERClick to expand / collapse

Summary

When a session accumulates a large trajectory file (50MB+), pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.

Environment

  • OpenClaw: 2026.5.2 (installed via npm)
  • OS: macOS 15.4, Apple Silicon (arm64)
  • Node: v22
  • Gateway: local, port 18789
  • Model: deepseek/deepseek-v4-flash

Root Cause Analysis

The chain of failures

  1. Session accumulates massive trajectory: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a 51MB trajectory file with 749 events.

  2. Flush exceeds timeout: At turn end, pi-trajectory-flush tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the hardcoded 10s timeout in runAgentCleanupStep.

  3. Timeout doesn't abort the flush: The Promise.race in runAgentCleanupStep only logs a warning — the underlying trajectoryRecorder.flush() promise continues running indefinitely.

  4. Event loop saturation: The safeJsonStringify serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching 34,728ms.

  5. Gateway unresponsive: New messages arrive as queued instead of immediate. The session lane remains occupied by cleanup maintenance. Total downtime: 25+ minutes until forced restart (SIGKILL required).

Code paths involved

  • runAgentCleanupStep (attempt.tool-run-context-B2TarhD3.js:440): Hardcoded 10s timeout, no abort mechanism
  • QueuedFileWriter.flush() (runtime-qu4g1jFz.js): Drains entire promise chain, no backpressure
  • safeJsonStringify (safe-json-DCDclho7.js:80): Synchronous serialization of large event objects
  • createTrajectoryRuntimeRecorder (runtime-qu4g1jFz.js:143): maxFileBytes=52428800 (50MB cap exists but doesn't prevent large files)

Key diagnostic logs

agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)

Session data scale

agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)

Related issues

  • #75839 — Same flush timeout, different perspective
  • #76340 — Event loop regression tracking
  • #77115 — Stuck session ghost with similar event loop symptoms
  • #76421 — Gateway timeout after event loop stall

Proposed solutions

  1. Make cleanup abortable: Pass an AbortSignal to runAgentCleanupStep so the flush can be stopped after timeout, rather than continuing in the background.

  2. Streaming/batched writes for trajectory: Replace per-event appendFile with a WriteStream that buffers writes and yields the event loop between batches.

  3. Dynamic timeout: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).

  4. Trajectory rotation: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.

  5. Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.

extent analysis

TL;DR

Increase the timeout or make the cleanup abortable to prevent the event loop from becoming saturated when dealing with large trajectory files.

Guidance

  • Consider implementing an abort mechanism for the runAgentCleanupStep function to stop the flush after a timeout, rather than letting it continue in the background.
  • Explore using a WriteStream with buffering to replace per-event appendFile calls, allowing the event loop to yield between batches.
  • Evaluate the proposed solutions, such as making cleanup abortable, using streaming/batched writes, or implementing dynamic timeouts, to determine the best approach for your specific use case.
  • As a short-term workaround, consider exposing the OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS environment variable to allow users to increase the timeout for large sessions.
  • Review related issues (#75839, #76340, #77115, #76421) to ensure that the chosen solution addresses the root cause of the problem.

Example

No code example is provided, as the issue requires a more in-depth analysis of the proposed solutions and their implementation.

Notes

The chosen solution should be carefully evaluated to ensure it does not introduce new issues or performance regressions. It is essential to consider the trade-offs between increasing the timeout, making the cleanup abortable, and implementing streaming/batched writes.

Recommendation

Apply a workaround, such as increasing the timeout via the OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS environment variable, to immediately alleviate the issue while a more permanent solution is developed and tested.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout [1 pull requests, 1 comments, 2 participants]