openclaw - 💡(How to fix) Fix [Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart [6 comments, 6 participants]

Error Message

[error]: [ '[ws]', 'write EPIPE' ]

Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

Root Cause

This is a race condition between plugin initialization and cron service startup:

memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
The cron service takes longer to become fully available
The plugin's initial registration attempt fails
Even though cron becomes available later, the plugin does not appear to retry

Fix Action

Fix / Workaround

Not a model issue: Occurs with both infini/minimax-m2.7 and bailian/qwen3.6-plus
Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.

Current Mitigation

Workaround Applied

Code Example

{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}

---

[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]

---

memory-core: managed dreaming cron could not be reconciled (cron service unavailable).

Environment

OpenClaw Version: v2026.4.26 (be8c246)
Node: v22.22.2
OS: Linux 6.17.0-22-generic (Ubuntu/x64)
Channel: Feishu (WebSocket mode)
Model: bailian/qwen3.6-plus
Config: systemd user service with Restart=always

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

The main agent session periodically enters a stuck session state where state=processing persists for 2-4+ minutes with queueDepth=1. During this time, the session cannot process new messages. The Gateway itself remains alive and other sessions work fine.

Reproduction

This is intermittent but has occurred 6 times today under various conditions:

Time	Session Key	Duration	Trigger
18:46	`agent:main:main`	241s	WS timeout + send data failed
19:28	`agent:main:main`	256s	write EPIPE (dashboard WS disconnect)
20:01	feishu session	136s	Heavy web_fetch (multiple concurrent page loads)
20:08	feishu session	143→173s	Multiple concurrent `gh` CLI exec commands
20:32	feishu session	163s	Post-restart cold start (first message after Gateway restart)
20:40	feishu session	161s	Still in recovery from previous restart

Diagnostic Log Evidence

{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=241s queueDepth=1","time":"2026-04-28T18:46:20.982+08:00"}
{"subsystem":"diagnostic","message":"stuck session: sessionId=unknown sessionKey=agent:main:main state=processing age=256s queueDepth=1","time":"2026-04-28T19:28:22.921+08:00"}

Preceding errors (both cases):

[error]: [ '[ws]', 'write EPIPE' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1 times")' ]

Analysis

Not a model issue: Occurs with both infini/minimax-m2.7 and bailian/qwen3.6-plus
Not a network issue: Feishu WebSocket remains connected; messages are received but never dispatched
Triggered by: Heavy concurrent tool use, WebSocket instability, post-restart cold start
Root cause hypothesis: The agent processing lane lacks a timeout/failover mechanism. When a tool call or LLM request hangs (or the WebSocket connection to the internal control UI drops mid-stream), the lane remains in processing state indefinitely. The diagnostic watchdog detects it but does not automatically recover the stuck session.

Current Mitigation

The user must restart the Gateway via systemd (systemctl --user restart openclaw-gateway.service) to clear stuck sessions. This is disruptive because it drops all active connections.

Requested Fix

Lane-level timeout: Add a configurable timeout for agent processing lanes (e.g., 60-120s). If a lane exceeds this, forcibly release it and mark the session as error/ready.
Automatic recovery: When the diagnostic watchdog detects a stuck session, attempt automatic lane reset instead of just logging a warning.
Graceful error injection: If the underlying cause is a WebSocket disconnect, inject a synthetic error into the agent turn so it can complete rather than hanging.

Note on Related Issues

I found related but distinct issues:

#53008: Compaction blocking main lane — different trigger (compaction), same symptom
#68649: PDF tool hanging — tool-specific, not general lane stall
#53889: Session deadlock with dangling toolCall — specific to toolCall/result mismatch
#72810: Discord session routable after timeout — channel-specific

This issue is broader: any long-running lane operation can stall the session, and there is no automatic recovery path.

Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

memory-core: managed dreaming cron could not be reconciled (cron service unavailable).

This occurs during Gateway startup/restart. The memory-core plugin attempts to register its managed dreaming cron jobs, but the OpenClaw cron service is not yet ready.

Reproduction

Restart Gateway: systemctl --user restart openclaw-gateway.service
Check logs ~7 minutes later for the warning

Analysis

This is a race condition between plugin initialization and cron service startup:

memory-core plugin initializes during Gateway boot and immediately attempts to reconcile managed cron jobs
The cron service takes longer to become fully available
The plugin's initial registration attempt fails
Even though cron becomes available later, the plugin does not appear to retry

Impact

The dreaming system (automatic memory consolidation at 3:00 AM) does not run after a Gateway restart. Manual cron jobs created by the user still work fine.

Requested Fix

The memory-core plugin should either:

Delay cron registration until the cron service reports ready, or
Implement a retry/backoff mechanism for cron job reconciliation

Workaround Applied

Reduced agents.defaults.compaction.timeoutSeconds from 900 to 300 to limit the maximum stall duration, but this does not address the root cause.

extent analysis

TL;DR

Implement a lane-level timeout and automatic recovery mechanism for stuck agent processing sessions to prevent indefinite stalls.

Guidance

Introduce a configurable timeout (e.g., 60-120s) for agent processing lanes to detect and release stuck sessions.
Develop an automatic recovery mechanism that resets the lane when a stuck session is detected by the diagnostic watchdog.
Consider implementing a retry/backoff mechanism for cron job reconciliation in the memory-core plugin to address the race condition during Gateway startup.
Review and adjust the agents.defaults.compaction.timeoutSeconds setting to optimize compaction timeout duration.

Example

// Example configuration for lane-level timeout
{
  "laneTimeoutSeconds": 90
}

Notes

The provided workaround (reducing agents.defaults.compaction.timeoutSeconds) only mitigates the issue and does not address the root cause. A more comprehensive solution involving lane-level timeouts and automatic recovery is necessary.

Recommendation

Apply a workaround by introducing a lane-level timeout and implementing automatic recovery for stuck sessions, as this directly addresses the root cause of the issue and provides a more reliable solution.

openclaw - 💡(How to fix) Fix [Bug]: Agent processing lane can stall for minutes without timeout recovery, plus memory-core dreaming cron race condition on Gateway restart [6 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current Mitigation

Workaround Applied

Code Example

Environment

Issue 1: Agent Processing Lane Stalls Without Timeout Recovery

Symptom

Reproduction

Diagnostic Log Evidence

Analysis

Current Mitigation

Requested Fix

Note on Related Issues

Issue 2: Memory-Core Dreaming Cron Fails to Register on Gateway Restart

Symptom

Reproduction

Analysis

Impact

Requested Fix

Workaround Applied

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING