openclaw - ✅(Solved) Fix Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77115Fetched 2026-05-05 05:51:59
View on GitHub
Comments
2
Participants
3
Timeline
6
Reactions
2
Author
Timeline (top)
cross-referenced ×4commented ×2

A zombie/stuck session (sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.

Root Cause

The stuck session likely originates from repeated openclaw cron CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.

Key observations:

  • Session exists only in memory (diagnostic system) but not on disk (store)
  • The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
  • The Gateway restart (clearing in-memory state) resolved the issue immediately

Fix Action

Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

Suggested code fix:

  1. Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
  2. Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
  3. Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

PR fix notes

PR #77133: fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout)

Description (problem / solution / changelog)

Problem

When a session accumulates a large trajectory (50MB+, 700+ events), pi-trajectory-flush blocks the event loop for 25+ minutes after the 10s timeout fires. The timeout warns but doesn't stop the flush, and the event loop stays at 100% utilization with P99 delays of 34 seconds — making the gateway completely unresponsive.

agent cleanup timed out: step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1  ← 25 min later

Root Cause

QueuedFileWriter chains writes into an ever-growing promise chain without ever yielding the event loop. With 700+ individual appendFile calls, the chain consumes 100% of the event loop. After the cleanup timeout fires (10s), the chain continues running — the Promise.race in runAgentCleanupStep only logs a warning, it doesn't abort the cleanup.

Changes

1. queued-file-writer.ts — Yield event loop between writes

Add setImmediate between each queued write so the event loop gets control back:

queue = queue
  .then(() => ready)
  .then(() => new Promise<void>((resolve) => setImmediate(resolve)))  // ← yield
  .then(() => safeAppendFile(filePath, line, options))
  .catch(() => undefined);

This prevents the promise chain from monopolizing the event loop. Each write still happens sequentially, but other work (message dispatch, WebSocket events) can be processed between writes.

2. paths.ts — Reduce trajectory file cap

TRAJECTORY_RUNTIME_FILE_MAX_BYTES: 50MB → 10MB

A single session shouldn't produce 50MB of trajectory. 10MB (~140 events at 70KB avg) is sufficient for debugging while keeping flush time manageable.

3. run-cleanup-timeout.ts — Increase cleanup timeout

AGENT_CLEANUP_STEP_TIMEOUT_MS: 10s → 30s

With the event loop yielding (change #1), flushes complete faster. But 10s is still tight for large sessions. 30s provides adequate margin.

Verification

  • Tested locally on macOS with OpenClaw 2026.5.2
  • Applied equivalent patches to compiled bundle
  • Gateway restarted cleanly, Feishu WebSocket reconnected
  • No event loop saturation observed after fix
  • Existing unit tests pass without modification

Fixes #75839 Related: #76340, #77115, #76421

Changed files

  • src/agents/queued-file-writer.ts (modified, +1/-0)
  • src/agents/run-cleanup-timeout.ts (modified, +1/-1)
  • src/trajectory/paths.ts (modified, +1/-1)

Code Example

[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0

---

[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5

---

[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}
RAW_BUFFERClick to expand / collapse

Summary

A zombie/stuck session (sessionId=41befe2c-) persists in the Gateway's in-memory state machine for 11.5+ hours while absent from the session store. The diagnostic system repeatedly detects it as stuck, causing sustained 48% CPU, event loop P99 delays up to 12.8s, and CLI commands receiving SIGKILL.

Environment

  • OpenClaw version: 2026.4.27 (cbc2ba0)
  • Gateway: macOS (Darwin 21.6.0, x64), node v22.22.0, port 18789
  • Model: deepseek/deepseek-v4-pro (main), minimax-portal/MiniMax-M2.7 (subagents)
  • Setup: MCP servers (playwright, qwen, kimi, glm, doubao x5 all stdio), Ubuntu node (192.168.1.18)

Symptoms

1. Stuck Session Ghost

[diagnostic] stuck session: sessionId=41befe2c-8c01-4f75-a662-2d47eb34d7bc
  sessionKey=agent:main:main state=processing age=41514s queueDepth=0
  • First detected: ~10:00 AM 2026-05-04
  • Age at detection: 11,514 seconds (~3.2 hours into a 11.5+ hour run)
  • state=processing, queueDepth oscillates 0→1→0
  • NOT in session store~/.openclaw/agents/main/sessions/ has no files matching *41befe2c*
  • NOT in sessions_listsessions_list({search:"41befe2c"}) returns count=0
  • NOT in sessions.json — no entry in store metadata

2. Event Loop Degradation

[diagnostic] liveness warning: eventLoopDelayP99Ms=12876.5 eventLoopUtilization=0.998
  cpuCoreRatio=1.289 active=1 waiting=0 queued=5
  • P99 event loop delay spiked to 12,876ms (12.8 seconds)
  • Event loop utilization hitting 0.998 (99.8%)
  • CPU core ratio at 1.289 with 5 queued requests
  • Gateway process (PID 23979): 48.7% sustained CPU, 1.5GB RSS

3. CLI Universal Timeout

  • openclaw cron list → SIGKILL
  • openclaw cron rm → SIGKILL
  • openclaw gateway status → SIGKILL
  • openclaw sessions list → hanging
  • All CLI commands became non-functional

4. Exec Timeout Cascades

[tools] exec failed: TIMEOUT: node invoke timed out
  raw_params={"command":"sleep 300 && cat /tmp/...progress.json","host":"node","timeout":310}

Subagents calling exec({host:"node"}) with sleep-based polling hit repeated timeouts, further blocking the event loop.

Root Cause Analysis

The stuck session likely originates from repeated openclaw cron CLI invocations that timed out during cron list/rm operations. Each CLI invocation creates a Gateway operator session. When multiple CLI calls timeout simultaneously, the session lifecycle cleanup may fail to remove the session from the Gateway's internal state tracking, even though the session store is properly cleaned.

Key observations:

  • Session exists only in memory (diagnostic system) but not on disk (store)
  • The diagnostic stuck session checker runs every 30 seconds and detects it, likely doing a tight-loop scan each time
  • The Gateway restart (clearing in-memory state) resolved the issue immediately

Reproduction Steps

  1. Run multiple openclaw cron list/rm CLI commands rapidly during high Gateway load
  2. Wait for CLI timeout (SIGKILL)
  3. Observe diagnostic stuck session entries persisting in gateway.err.log
  4. Observe declining event loop health and rising CPU

Expected Behavior

  • Stale session references should be cleaned from the in-memory state tracker within a reasonable timeout (e.g., 5 minutes)
  • Session lifecycle cleanup should handle SIGKILL/timeout scenarios gracefully
  • The diagnostic stuck session detector should not tight-loop on the same ghost session indefinitely

Actual Behavior

  • Ghost session persisted 11.5+ hours until Gateway restart
  • Sustained 48% CPU for 11+ hours
  • CLI became non-functional
  • No automatic recovery

Fix

Gateway restart cleared all in-memory state and restored normal operation (CPU dropped to normal, CLI responsive).

Suggested code fix:

  1. Add a max-age threshold for stuck session tracking (e.g., if a session has been "stuck" in processing state for > 10 minutes and doesn't exist in the session store, evict it from the state tracker)
  2. Add exponential backoff to the stuck-session checker to avoid tight-looping on the same stuck reference
  3. Ensure session cleanup on CLI timeout/SIGKILL scenarios covers both store AND in-memory state

Impact

  • Severity: High (causes Gateway degradation, blocks agent operation)
  • Frequency: Triggered 1x so far (correlated with CLI timeout during cron operations)
  • Recovery: Only Gateway restart resolves

extent analysis

TL;DR

Implement a max-age threshold for stuck session tracking and exponential backoff in the stuck-session checker to prevent indefinite looping on ghost sessions.

Guidance

  • Review the Gateway's in-memory state tracking mechanism to ensure it properly handles session cleanup on CLI timeout/SIGKILL scenarios.
  • Consider adding a timeout for stuck sessions (e.g., 10 minutes) after which they are automatically evicted from the state tracker if not found in the session store.
  • Implement exponential backoff in the stuck-session checker to avoid tight-looping on the same stuck reference.
  • Verify that session lifecycle cleanup handles SIGKILL/timeout scenarios gracefully.

Example

// Pseudo-code example of adding a max-age threshold for stuck session tracking
const maxAgeThreshold = 10 * 60 * 1000; // 10 minutes
const stuckSessions = {};

function checkStuckSessions() {
  Object.keys(stuckSessions).forEach((sessionId) => {
    const sessionAge = Date.now() - stuckSessions[sessionId].timestamp;
    if (sessionAge > maxAgeThreshold && !sessionStore.has(sessionId)) {
      delete stuckSessions[sessionId];
    }
  });
}

Notes

The provided fix is based on the assumption that the stuck session issue is caused by the Gateway's in-memory state tracking mechanism not properly handling session cleanup on CLI timeout/SIGKILL scenarios. The suggested code fix is a pseudo-code example and may need to be adapted to the actual implementation.

Recommendation

Apply the suggested code fix to implement a max-age threshold for stuck session tracking and exponential backoff in the stuck-session checker. This should help prevent indefinite looping on ghost sessions and reduce the likelihood of Gateway degradation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Stuck session ghost blocks event loop, causes CLI timeout and sustained high CPU (11.5hr+) [1 pull requests, 2 comments, 3 participants]