openclaw - 💡(How to fix) Fix [Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart [6 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73581Fetched 2026-04-29 06:17:57
View on GitHub
Comments
6
Participants
6
Timeline
13
Reactions
0
Author
Timeline (top)
commented ×6cross-referenced ×5mentioned ×1subscribed ×1

Error Message

[error]: [ '[ws]', 'write EPIPE' ]

  1. Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
  2. Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

Root Cause

This is a race condition between plugin initialization and cron service startup:

  • memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
  • The cron service takes longer to become fully available
  • The plugin's initial registration attempt fails
  • Even though cron becomes available later, the plugin does not appear to retry

Fix Action

Fix / Workaround

  • Not a model issue: Occurs with both infini/minimax-m2.7 and bailian/qwen3.6-plus
  • Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
  • Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
  • Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.

Current Mitigation

Workaround Applied

Code Example

{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}

---

[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]

---

memory-core: managed dreaming cron could not be reconciled (cron service unavailable).
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw Version: v2026.4.26 (be8c246)
  • Node: v22.22.2
  • OS: Linux 6.17.0-22-generic (Ubuntu/x64)
  • Channel: Feishu (WebSocket mode)
  • Model: bailian/qwen3.6-plus
  • Config: systemd user service with Restart=always

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

The main agent session periodically enters a stuck session state where state=processing persists for 2-4+ minutes with queueDepth=1. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.

Reproduction

This is intermittent but has occurred 6 times today under various conditions:

TimeSession KeyDurationTrigger
18:46agent:main:main241sWS timeout + send data failed
19:28agent:main:main256swrite EPIPE (dashboard WS disconnect)
20:01feishu session136sHeavy web_fetch (multiple concurrent page loads)
20:08feishu session143→173sMultiple concurrent gh CLI exec commands
20:32feishu session163sPost-restart cold start (first message after Gateway restart)
20:40feishu session161sStill in recovery from previous restart

Diagnostic Log Evidence

{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}

Preceding errors (both cases):

[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]

Analysis

  • Not a model issue: Occurs with both infini/minimax-m2.7 and bailian/qwen3.6-plus
  • Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
  • Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
  • Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.

Current Mitigation

The user must restart the Gateway via systemd (systemctl --user restart openclaw-gateway.service) to clear stuck sessions. This is disruptive because it drops all active connections.

Requested Fix

  1. Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
  2. Automatic recovery: When the diagnostic watchdog detects a stuck session, attempt automatic lane reset instead of just logging a warning.
  3. Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

Note on Related Issues

I found related but distinct issues:

  • #53008: Compaction blocking main lane — different trigger (compaction), same symptom
  • #68649: PDF tool hanging — tool-specific, not general lane stall
  • #53889: Session deadlock with dangling toolCall — specific to toolCall/result mismatch
  • #72810: Discord session routable after timeout — channel-specific

This issue is broader: any long-running lane operation can stall the session, and there is no automatic recovery path.


Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

memory-core: managed dreaming cron could not be reconciled (cron service unavailable).

This occurs during Gateway startup/restart. The memory-core plugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.

Reproduction

  1. Restart Gateway: systemctl --user restart openclaw-gateway.service
  2. Check logs ~7 minutes later for the warning

Analysis

This is a race condition between plugin initialization and cron service startup:

  • memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
  • The cron service takes longer to become fully available
  • The plugin's initial registration attempt fails
  • Even though cron becomes available later, the plugin does not appear to retry

Impact

The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.

Requested Fix

The memory-core plugin should either:

  1. Delay cron registration until the cron service reports ready, or
  2. Implement a retry/backoff mechanism for cron job reconciliation

Workaround Applied

Reduced agents.defaults.compaction.timeoutSeconds from 900 to 300 to limit the maximum stall duration, but this does not address the root cause.

extent analysis

TL;DR

Implement a lane-level timeout and automatic recovery mechanism for stuck agent processing sessions to prevent indefinite stalls.

Guidance

  • Introduce a configurable timeout (e.g., 60-120s) for agent processing lanes to detect and release stuck sessions.
  • Develop an automatic recovery mechanism that resets the lane when a stuck session is detected by the diagnostic watchdog.
  • Consider implementing a retry/backoff mechanism for cron job reconciliation in the memory-core plugin to address the race condition during Gateway startup.
  • Review and adjust the agents.defaults.compaction.timeoutSeconds setting to optimize compaction timeout duration.

Example

// Example configuration for lane-level timeout
{
  "laneTimeoutSeconds": 90
}

Notes

The provided workaround (reducing agents.defaults.compaction.timeoutSeconds) only mitigates the issue and does not address the root cause. A more comprehensive solution involving lane-level timeouts and automatic recovery is necessary.

Recommendation

Apply a workaround by introducing a lane-level timeout and implementing automatic recovery for stuck sessions, as this directly addresses the root cause of the issue and provides a more reliable solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart [6 comments, 6 participants]