openclaw - 💡(How to fix) Fix A single stalled agent session blocks the entire Gateway event loop (isolation failure)

Root Cause

The stalled session's lock contention triggers a retry storm (session write lock -> fail -> retry -> fail). While the actual API calls are async, the lock acquisition/release + retry logic generates synchronous overhead on the event loop, eventually blocking it entirely.

Related issues:

#83510 — Session takeover lock contention
#84250 — Tolerate in-process session writes
#78123 — Feishu dispatch completes with replies=0
#32903 — dispatch replies=0 silent drop

Fix Action

Fix / Workaround

One agent (agent-architect) spawns a subagent that enters a model call
The model call hangs (likely due to session write lock contention / retry storm)
The session is marked as stalled after ~6 minutes
Gateway event loop reaches 100% utilization
All other sessions stop processing inbound messages
Other sessions' dispatches are aborted silently

Other agent sessions in the same Gateway could not process new messages
Dispatched messages returned queuedFinal=false, replies=0 (silent drop)
User messages were aborted before processing
Gateway RSS memory grew from ~1.8GB to 4.4GB during the stall

Related issues:

#83510 — Session takeover lock contention
#84250 — Tolerate in-process session writes
#78123 — Feishu dispatch completes with replies=0
#32903 — dispatch replies=0 silent drop

Code Example

16:44:02 liveness warning: eventLoopDelayP99Ms=2548 eventLoopUtilization=1 cpuCoreRatio=1.358
16:46:18 liveness warning: eventLoopDelayP99Ms=2514 eventLoopUtilization=1 cpuCoreRatio=1.291
16:48:37 liveness warning: eventLoopDelayP99Ms=2933 eventLoopUtilization=1 cpuCoreRatio=1.341
16:50:56 liveness warning: eventLoopDelayP99Ms=1019 eventLoopUtilization=1 cpuCoreRatio=1.288
16:53:19 liveness warning: eventLoopDelayP99Ms=19562 eventLoopMaxMs=19562 eventLoopUtilization=1

---

stalled session: sessionId=f9274af7 sessionKey=agent:agent-architect:sub:r1-15-skill-audit
  state=processing
  age=1016s (17 minutes!)
  reason=active_work_without_progress
  activeWorkKind=model_call

---

stuck session recovery: action=abort_embedded_run aborted=true

Description

A single agent's stalled session (model call hung due to lock contention) blocks the entire Gateway event loop, causing all other sessions to stop processing messages. This is a session isolation failure — one agent's hang should not affect the availability of other agents or the Gateway itself.

Environment

OpenClaw version: 2026.5.20-beta.1 (also observed in 2026.5.19-beta.2)
OS: Linux (WSL2) 6.6.114.1-microsoft-standard-WSL2 x64
Node.js: v22.22.0
Channel: Feishu (飞书) group chat
Model: zai/glm-5-turbo

Steps to Reproduce

One agent (agent-architect) spawns a subagent that enters a model call
The model call hangs (likely due to session write lock contention / retry storm)
The session is marked as stalled after ~6 minutes
Gateway event loop reaches 100% utilization
All other sessions stop processing inbound messages
Other sessions' dispatches are aborted silently

Observed Behavior

Event Loop Degradation

16:44:02 liveness warning: eventLoopDelayP99Ms=2548 eventLoopUtilization=1 cpuCoreRatio=1.358
16:46:18 liveness warning: eventLoopDelayP99Ms=2514 eventLoopUtilization=1 cpuCoreRatio=1.291
16:48:37 liveness warning: eventLoopDelayP99Ms=2933 eventLoopUtilization=1 cpuCoreRatio=1.341
16:50:56 liveness warning: eventLoopDelayP99Ms=1019 eventLoopUtilization=1 cpuCoreRatio=1.288
16:53:19 liveness warning: eventLoopDelayP99Ms=19562 eventLoopMaxMs=19562 eventLoopUtilization=1

Stalled Session

stalled session: sessionId=f9274af7 sessionKey=agent:agent-architect:sub:r1-15-skill-audit
  state=processing
  age=1016s (17 minutes!)
  reason=active_work_without_progress
  activeWorkKind=model_call

Impact on Other Sessions

Other agent sessions in the same Gateway could not process new messages
Dispatched messages returned queuedFinal=false, replies=0 (silent drop)
User messages were aborted before processing
Gateway RSS memory grew from ~1.8GB to 4.4GB during the stall

Recovery

Gateway's stuck session recovery eventually aborted the embedded run after ~17 minutes:

stuck session recovery: action=abort_embedded_run aborted=true

But event loop remained blocked even after recovery — required full Gateway restart.

Root Cause Analysis

Related issues:

#83510 — Session takeover lock contention
#84250 — Tolerate in-process session writes
#78123 — Feishu dispatch completes with replies=0
#32903 — dispatch replies=0 silent drop

Expected Behavior

A single agent session stall should NOT:

Block the event loop (other sessions should continue normally)
Cause other sessions to silently drop messages
Require a full Gateway restart to recover

Suggested Fix

Per-session timeout budgets: Abort a session's embedded run after N seconds of no progress, independent of the model call timeout
Better async isolation: Ensure lock contention retries don't block the main event loop (use setImmediate/yield between retries)
Circuit breaker: After repeated lock failures, skip the write instead of retrying
Per-session resource limits: Cap CPU and memory usage per session so one runaway session can't starve others

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix A single stalled agent session blocks the entire Gateway event loop (isolation failure)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Description

Environment

Steps to Reproduce

Observed Behavior

Event Loop Degradation

Stalled Session

Impact on Other Sessions

Recovery

Root Cause Analysis

Expected Behavior

Suggested Fix

Still need to ship something?

TRENDING