openclaw - ✅(Solved) Fix pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout [1 pull requests, 1 comments, 2 participants]

loyur · 2026-05-04T05:28:13Z

[openclaw] When a session accumulates a large trajectory file 50MB+ , pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running… When a session accumulates a large trajectory file (50MB+), `pi-trajectory-flush` exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages. # PR #77154: fix: bound trajectory runtime flush - Repository: openclaw/openclaw - Author: steipete - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/77154 ## Description (problem / solution / changelog) ## Summary - Replace #77133 with a bounded trajectory-runtime fix for #77124. - Bound runtime trajectory payload shaping before redaction/stringify, including tool definitions built before `recordEvent`. - Stop live capture once the runtime sidecar write budget is reached, reserve room for a `trace.truncated` marker, and keep queued file writes from growing beyond the same budget while yielding before sidecar appends. - Split live capture and export limits: capture stops at 10 MiB, while export keeps accepting existing runtime sidecars up to 50 MiB. Closes #77124. ## Verification - `pnpm test src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts src/trajectory/export.test.ts` - Testbox `pnpm check:changed`: `tbx_01kqs0dbsp3yy5k9cxgq5vc8jw`, exit 0 ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `docs/tools/trajectory.md` (modified, +1/-1) - `src/agents/queued-file-writer.test.ts` (modified, +12/-0) - `src/agents/queued-file-writer.ts` (modified, +27/-3) - `src/auto-reply/reply/followup-delivery.test.ts` (modified, +5/-1) - `src/trajectory/export.test.ts` (modified, +2/-2) - `src/trajectory/paths.ts` (modified, +1/-0) - `src/trajectory/runtime.test.ts` (modified, +41/-3) - `src/trajectory/runtime.ts` (modified, +165/-25) ## Fix / Workaround 5. **Short-term workaround**: Expose `OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS` env var to allow users to increase the timeout for large sessions. ## Summary When a session accumulates a large trajectory file (50MB+), `pi-trajectory-flush` exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages. ## Environment - OpenClaw: 2026.5.2 (installed via npm) - OS: macOS 15.4, Apple Silicon (arm64) - Node: v22 - Gateway: local, port 18789 - Model: deepseek/deepseek-v4-flash ## Root Cause Analysis ### The chain of failures 1. **Session accumulates massive trajectory**: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a **51MB trajectory file with 749 events**. 2. **Flush exceeds timeout**: At turn end, `pi-trajectory-flush` tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the **hardcoded 10s timeout** in `runAgentCleanupStep`. 3. **Timeout doesn't abort the flush**: The `Promise.race` in `runAgentCleanupStep` only logs a warning — the underlying `trajectoryRecorder.flush()` promise continues running indefinitely. 4. **Event loop saturation**: The `safeJsonStringify` serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching **34,728ms**. 5. **Gateway unresponsive**: New messages arrive as `queued` instead of `immediate`. The session lane remains occupied by cleanup maintenance. Total downtime: **25+ minutes** until forced restart (SIGKILL required). ### Code paths involved - `runAgentCleanupStep` (`attempt.tool-run-context-B2TarhD3.js:440`): Hardcoded 10s timeout, no abort mechanism - `QueuedFileWriter.flush()` (`runtime-qu4g1jFz.js`): Drains entire promise chain, no backpressure - `safeJsonStringify` (`safe-json-DCDclho7.js:80`): Synchronous serialization of large event objects - `createTrajectoryRuntimeRecorder` (`runtime-qu4g1jFz.js:143`): `maxFileBytes=52428800` (50MB cap exists but doesn't prevent large files) ### Key diagnostic logs ``` agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000 liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1 liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1 (25 min later, still stalled) ``` ### Session data scale ``` agent/main/sessions/: 220MB total 8f42a5aa-*.trajectory.jsonl: 51MB (749 events) 7695a95f-*.trajectory.jsonl: 24MB (572 events) 425f129b-*.trajectory.jsonl: 17MB (480 events) ``` ### Related issues - #75839 — Same flush timeout, different perspective - #76340 — Event loop regression tracking - #77115 — Stuck session ghost with similar event loop symptoms - #76421 — Gateway timeout after event loop stall ## Proposed solutions 1. **Make cle

openclaw2026-05-04 05:28:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77124•Fetched 2026-05-05 05:51:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

loyur

Participants

clawsweeper[bot]

loyur

Timeline (top)

mentioned ×2subscribed ×2closed ×1commented ×1

When a session accumulates a large trajectory file (50MB+), pi-trajectory-flush exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.

PR fix notes

PR #77154: fix: bound trajectory runtime flush

Repository: openclaw/openclaw
Author: steipete
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/77154

Description (problem / solution / changelog)

Summary

Replace #77133 with a bounded trajectory-runtime fix for #77124.
Bound runtime trajectory payload shaping before redaction/stringify, including tool definitions built before recordEvent.
Stop live capture once the runtime sidecar write budget is reached, reserve room for a trace.truncated marker, and keep queued file writes from growing beyond the same budget while yielding before sidecar appends.
Split live capture and export limits: capture stops at 10 MiB, while export keeps accepting existing runtime sidecars up to 50 MiB.

Closes #77124.

Verification

pnpm test src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts src/trajectory/export.test.ts
Testbox pnpm check:changed: tbx_01kqs0dbsp3yy5k9cxgq5vc8jw, exit 0

Changed files

CHANGELOG.md (modified, +1/-0)
docs/tools/trajectory.md (modified, +1/-1)
src/agents/queued-file-writer.test.ts (modified, +12/-0)
src/agents/queued-file-writer.ts (modified, +27/-3)
src/auto-reply/reply/followup-delivery.test.ts (modified, +5/-1)
src/trajectory/export.test.ts (modified, +2/-2)
src/trajectory/paths.ts (modified, +1/-0)
src/trajectory/runtime.test.ts (modified, +41/-3)
src/trajectory/runtime.ts (modified, +165/-25)

Code Example

agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)

---

agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.5.2 (installed via npm)
OS: macOS 15.4, Apple Silicon (arm64)
Node: v22
Gateway: local, port 18789
Model: deepseek/deepseek-v4-flash

Root Cause Analysis

The chain of failures

Session accumulates massive trajectory: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a 51MB trajectory file with 749 events.
Flush exceeds timeout: At turn end, pi-trajectory-flush tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the hardcoded 10s timeout in runAgentCleanupStep.
Timeout doesn't abort the flush: The Promise.race in runAgentCleanupStep only logs a warning — the underlying trajectoryRecorder.flush() promise continues running indefinitely.
Event loop saturation: The safeJsonStringify serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching 34,728ms.
Gateway unresponsive: New messages arrive as queued instead of immediate. The session lane remains occupied by cleanup maintenance. Total downtime: 25+ minutes until forced restart (SIGKILL required).

Code paths involved

runAgentCleanupStep (attempt.tool-run-context-B2TarhD3.js:440): Hardcoded 10s timeout, no abort mechanism
QueuedFileWriter.flush() (runtime-qu4g1jFz.js): Drains entire promise chain, no backpressure
safeJsonStringify (safe-json-DCDclho7.js:80): Synchronous serialization of large event objects
createTrajectoryRuntimeRecorder (runtime-qu4g1jFz.js:143): maxFileBytes=52428800 (50MB cap exists but doesn't prevent large files)

Key diagnostic logs

agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)

Session data scale

agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)

Related issues

#75839 — Same flush timeout, different perspective
#76340 — Event loop regression tracking
#77115 — Stuck session ghost with similar event loop symptoms
#76421 — Gateway timeout after event loop stall

Proposed solutions

Make cleanup abortable: Pass an AbortSignal to runAgentCleanupStep so the flush can be stopped after timeout, rather than continuing in the background.
Streaming/batched writes for trajectory: Replace per-event appendFile with a WriteStream that buffers writes and yields the event loop between batches.
Dynamic timeout: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).
Trajectory rotation: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.
Short-term workaround: Expose OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS env var to allow users to increase the timeout for large sessions.

extent analysis

TL;DR

Increase the timeout or make the cleanup abortable to prevent the event loop from becoming saturated when dealing with large trajectory files.

Guidance

Consider implementing an abort mechanism for the runAgentCleanupStep function to stop the flush after a timeout, rather than letting it continue in the background.
Explore using a WriteStream with buffering to replace per-event appendFile calls, allowing the event loop to yield between batches.
Evaluate the proposed solutions, such as making cleanup abortable, using streaming/batched writes, or implementing dynamic timeouts, to determine the best approach for your specific use case.
As a short-term workaround, consider exposing the OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS environment variable to allow users to increase the timeout for large sessions.
Review related issues (#75839, #76340, #77115, #76421) to ensure that the chosen solution addresses the root cause of the problem.

Example

No code example is provided, as the issue requires a more in-depth analysis of the proposed solutions and their implementation.

Notes

The chosen solution should be carefully evaluated to ensure it does not introduce new issues or performance regressions. It is essential to consider the trade-offs between increasing the timeout, making the cleanup abortable, and implementing streaming/batched writes.

Recommendation

Apply a workaround, such as increasing the timeout via the OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS environment variable, to immediately alleviate the issue while a more permanent solution is developed and tested.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

PR fix notes

PR #77154: fix: bound trajectory runtime flush

Description (problem / solution / changelog)

Summary

Verification

Changed files

Code Example

Summary

Environment

Root Cause Analysis

The chain of failures

Code paths involved

Key diagnostic logs

Session data scale

Related issues

Proposed solutions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING