openclaw - ✅(Solved) Fix Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) [1 pull requests, 2 comments, 3 participants]

openclaw2026-05-04 04:52:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77115•Fetched 2026-05-05 05:51:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×4commented ×2

A zombie/stuck session (sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.

Root Cause

The stuck session likely originates from repeated openclaw cron CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.

Key observations:

Session exists only in memory (diagnostic system) but not on disk (store)
The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
The Gateway restart (clearing in-memory state) resolved the issue immediately

Fix Action

Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

Suggested code fix:

Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

PR fix notes

PR #77133: fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)

Repository: openclaw/openclaw
Author: loyur
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/77133

Description (problem / solution / changelog)

Problem

When a session accumulates a large trajectory (50MB+, 700+ events), pi-trajectory-flush blocks the event loop for 25+ minutes after the 10s timeout fires. The timeout warns but doesn't stop the flush, and the event loop stays at 100% utilization with P99 delays of 34 seconds — making the gateway completely unresponsive.

agent cleanup timed out: step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1  ← 25 min later

Root Cause

QueuedFileWriter chains writes into an ever-growing promise chain without ever yielding the event loop. With 700+ individual appendFile calls, the chain consumes 100% of the event loop. After the cleanup timeout fires (10s), the chain continues running — the Promise.race in runAgentCleanupStep only logs a warning, it doesn't abort the cleanup.

Changes

1. `queued-file-writer.ts` — Yield event loop between writes

Add setImmediate between each queued write so the event loop gets control back:

queue = queue
  .then(() => ready)
  .then(() => new Promise<void>((resolve) => setImmediate(resolve)))  // ← yield
  .then(() => safeAppendFile(filePath, line, options))
  .catch(() => undefined);

This prevents the promise chain from monopolizing the event loop. Each write still happens sequentially, but other work (message dispatch, WebSocket events) can be processed between writes.

2. `paths.ts` — Reduce trajectory file cap

TRAJECTORY_RUNTIME_FILE_MAX_BYTES: 50MB → 10MB

A single session shouldn't produce 50MB of trajectory. 10MB (~140 events at 70KB avg) is sufficient for debugging while keeping flush time manageable.

3. `run-cleanup-timeout.ts` — Increase cleanup timeout

AGENT_CLEANUP_STEP_TIMEOUT_MS: 10s → 30s

With the event loop yielding (change #1), flushes complete faster. But 10s is still tight for large sessions. 30s provides adequate margin.

Verification

Tested locally on macOS with OpenClaw 2026.5.2
Applied equivalent patches to compiled bundle
Gateway restarted cleanly, Feishu WebSocket reconnected
No event loop saturation observed after fix
Existing unit tests pass without modification

Fixes #75839 Related: #76340, #77115, #76421

Changed files

src/agents/queued-file-writer.ts (modified, +1/-0)
src/agents/run-cleanup-timeout.ts (modified, +1/-1)
src/trajectory/paths.ts (modified, +1/-1)

Code Example

[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0

---

[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5

---

[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw version: 2026.4.27 (cbc2ba0)
Gateway: macOS (Darwin 21.6.0, x64), node v22.22.0, port 18789
Model: deepseek/deepseek-v4-pro (main), minimax-portal/MiniMax-M2.7 (subagents)
Setup: MCP servers (playwright, qwen, kimi, glm, doubao x5 all stdio), Ubuntu node (192.168.1.18)

Symptoms

1. Stuck Session Ghost

[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0

First detected: ~10:00 AM 2026-05-04
Age at detection: 11,514 seconds (~3.2 hours into a 11.5+ hour run)
state=processing, queueDepth oscillates 0→1→0
NOT in session store — ~/.openclaw/agents/main/sessions/ has no files matching *41befe2c*
NOT in sessions_list — sessions_list({search:"41befe2c"}) returns count=0
NOT in sessions.json — no entry in store metadata

2. Event Loop Degradation

[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5

P99 event loop delay spiked to 12,876ms (12.8 seconds)
Event loop utilization hitting 0.998 (99.8%)
CPU core ratio at 1.289 with 5 queued requests
Gateway process (PID 23979): 48.7% sustained CPU, 1.5GB RSS

3. CLI Universal Timeout

openclaw cron list → SIGKILL
openclaw cron rm → SIGKILL
openclaw gateway status → SIGKILL
openclaw sessions list → hanging
All CLI commands became non-functional

4. Exec Timeout Cascades

[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}

Subagents calling exec({host:"node"}) with sleep-based polling hit repeated timeouts, further blocking the event loop.

Root Cause Analysis

Key observations:

Session exists only in memory (diagnostic system) but not on disk (store)
The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
The Gateway restart (clearing in-memory state) resolved the issue immediately

Reproduction Steps

Run multiple openclaw cron list/rm CLI commands rapidly during high Gateway load
Wait for CLI timeout (SIGKILL)
Observe diagnostic stuck session entries persisting in gateway.err.log
Observe declining event loop health and rising CPU

Expected Behavior

Stale session references should be cleaned from the in-memory state tracker within a reasonable timeout (e.g., 5 minutes)
Session lifecycle cleanup should handle SIGKILL/timeout scenarios gracefully
The diagnostic stuck session detector should not tight-loop on the same ghost session indefinitely

Actual Behavior

Ghost session persisted 11.5+ hours until Gateway restart
Sustained 48% CPU for 11+ hours
CLI became non-functional
No automatic recovery

Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

Suggested code fix:

Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

Impact

Severity: High (causes Gateway degradation, blocks agent operation)
Frequency: Triggered 1x so far (correlated with CLI timeout during cron operations)
Recovery: Only Gateway restart resolves

extent analysis

TL;DR

Implement a max-age threshold for stuck session tracking and exponential backoff in the stuck-session checker to prevent indefinite looping on ghost sessions.

Guidance

Review the Gateway's in-memory state tracking mechanism to ensure it properly handles session cleanup on CLI timeout/SIGKILL scenarios.
Consider adding a timeout for stuck sessions (e.g., 10 minutes) after which they are automatically evicted from the state tracker if not found in the session store.
Implement exponential backoff in the stuck-session checker to avoid tight-looping on the same stuck reference.
Verify that session lifecycle cleanup handles SIGKILL/timeout scenarios gracefully.

Example

// Pseudo-code example of adding a max-age threshold for stuck session tracking
const maxAgeThreshold = 10 * 60 * 1000; // 10 minutes
const stuckSessions = {};

function checkStuckSessions() {
  Object.keys(stuckSessions).forEach((sessionId) => {
    const sessionAge = Date.now() - stuckSessions[sessionId].timestamp;
    if (sessionAge > maxAgeThreshold && !sessionStore.has(sessionId)) {
      delete stuckSessions[sessionId];
    }
  });
}

Notes

The provided fix is based on the assumption that the stuck session issue is caused by the Gateway's in-memory state tracking mechanism not properly handling session cleanup on CLI timeout/SIGKILL scenarios. The suggested code fix is a pseudo-code example and may need to be adapted to the actual implementation.

Recommendation

Apply the suggested code fix to implement a max-age threshold for stuck session tracking and exponential backoff in the stuck-session checker. This should help prevent indefinite looping on ghost sessions and reduce the likelihood of Gateway degradation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

PR fix notes

PR #77133: fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)

Description (problem / solution / changelog)

Problem

Root Cause

Changes

1. queued-file-writer.ts — Yield event loop between writes

2. paths.ts — Reduce trajectory file cap

3. run-cleanup-timeout.ts — Increase cleanup timeout

Verification

Changed files

Code Example

Summary

Environment

Symptoms

1. Stuck Session Ghost

2. Event Loop Degradation

3. CLI Universal Timeout

4. Exec Timeout Cascades

Root Cause Analysis

Reproduction Steps

Expected Behavior

Actual Behavior

Fix

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `queued-file-writer.ts` — Yield event loop between writes

2. `paths.ts` — Reduce trajectory file cap

3. `run-cleanup-timeout.ts` — Increase cleanup timeout