openclaw - 💡(How to fix) Fix A single stalled agent session blocks the entire Gateway event loop (isolation failure)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A single agent's stalled session (model call hung due to lock contention) blocks the entire Gateway event loop, causing all other sessions to stop processing messages. This is a session isolation failure — one agent's hang should not affect the availability of other agents or the Gateway itself.

Root Cause

The stalled session's lock contention triggers a retry storm (session write lock -> fail -> retry -> fail). While the actual API calls are async, the lock acquisition/release + retry logic generates synchronous overhead on the event loop, eventually blocking it entirely.

Related issues:

  • #83510 — Session takeover lock contention
  • #84250 — Tolerate in-process session writes
  • #78123 — Feishu dispatch completes with replies=0
  • #32903 — dispatch replies=0 silent drop

Fix Action

Fix / Workaround

  1. One agent (agent-architect) spawns a subagent that enters a model call
  2. The model call hangs (likely due to session write lock contention / retry storm)
  3. The session is marked as stalled after ~6 minutes
  4. Gateway event loop reaches 100% utilization
  5. All other sessions stop processing inbound messages
  6. Other sessions' dispatches are aborted silently
  • Other agent sessions in the same Gateway could not process new messages
  • Dispatched messages returned queuedFinal=false, replies=0 (silent drop)
  • User messages were aborted before processing
  • Gateway RSS memory grew from ~1.8GB to 4.4GB during the stall

Related issues:

  • #83510 — Session takeover lock contention
  • #84250 — Tolerate in-process session writes
  • #78123 — Feishu dispatch completes with replies=0
  • #32903 — dispatch replies=0 silent drop

Code Example

16:44:02 liveness warning: eventLoopDelayP99Ms=2548 eventLoopUtilization=1 cpuCoreRatio=1.358
16:46:18 liveness warning: eventLoopDelayP99Ms=2514 eventLoopUtilization=1 cpuCoreRatio=1.291
16:48:37 liveness warning: eventLoopDelayP99Ms=2933 eventLoopUtilization=1 cpuCoreRatio=1.341
16:50:56 liveness warning: eventLoopDelayP99Ms=1019 eventLoopUtilization=1 cpuCoreRatio=1.288
16:53:19 liveness warning: eventLoopDelayP99Ms=19562 eventLoopMaxMs=19562 eventLoopUtilization=1

---

stalled session: sessionId=f9274af7 sessionKey=agent:agent-architect:sub:r1-15-skill-audit
  state=processing
  age=1016s (17 minutes!)
  reason=active_work_without_progress
  activeWorkKind=model_call

---

stuck session recovery: action=abort_embedded_run aborted=true
RAW_BUFFERClick to expand / collapse

Description

A single agent's stalled session (model call hung due to lock contention) blocks the entire Gateway event loop, causing all other sessions to stop processing messages. This is a session isolation failure — one agent's hang should not affect the availability of other agents or the Gateway itself.

Environment

  • OpenClaw version: 2026.5.20-beta.1 (also observed in 2026.5.19-beta.2)
  • OS: Linux (WSL2) 6.6.114.1-microsoft-standard-WSL2 x64
  • Node.js: v22.22.0
  • Channel: Feishu (飞书) group chat
  • Model: zai/glm-5-turbo

Steps to Reproduce

  1. One agent (agent-architect) spawns a subagent that enters a model call
  2. The model call hangs (likely due to session write lock contention / retry storm)
  3. The session is marked as stalled after ~6 minutes
  4. Gateway event loop reaches 100% utilization
  5. All other sessions stop processing inbound messages
  6. Other sessions' dispatches are aborted silently

Observed Behavior

Event Loop Degradation

16:44:02 liveness warning: eventLoopDelayP99Ms=2548 eventLoopUtilization=1 cpuCoreRatio=1.358
16:46:18 liveness warning: eventLoopDelayP99Ms=2514 eventLoopUtilization=1 cpuCoreRatio=1.291
16:48:37 liveness warning: eventLoopDelayP99Ms=2933 eventLoopUtilization=1 cpuCoreRatio=1.341
16:50:56 liveness warning: eventLoopDelayP99Ms=1019 eventLoopUtilization=1 cpuCoreRatio=1.288
16:53:19 liveness warning: eventLoopDelayP99Ms=19562 eventLoopMaxMs=19562 eventLoopUtilization=1

Stalled Session

stalled session: sessionId=f9274af7 sessionKey=agent:agent-architect:sub:r1-15-skill-audit
  state=processing
  age=1016s (17 minutes!)
  reason=active_work_without_progress
  activeWorkKind=model_call

Impact on Other Sessions

  • Other agent sessions in the same Gateway could not process new messages
  • Dispatched messages returned queuedFinal=false, replies=0 (silent drop)
  • User messages were aborted before processing
  • Gateway RSS memory grew from ~1.8GB to 4.4GB during the stall

Recovery

Gateway's stuck session recovery eventually aborted the embedded run after ~17 minutes:

stuck session recovery: action=abort_embedded_run aborted=true

But event loop remained blocked even after recovery — required full Gateway restart.

Root Cause Analysis

The stalled session's lock contention triggers a retry storm (session write lock -> fail -> retry -> fail). While the actual API calls are async, the lock acquisition/release + retry logic generates synchronous overhead on the event loop, eventually blocking it entirely.

Related issues:

  • #83510 — Session takeover lock contention
  • #84250 — Tolerate in-process session writes
  • #78123 — Feishu dispatch completes with replies=0
  • #32903 — dispatch replies=0 silent drop

Expected Behavior

A single agent session stall should NOT:

  1. Block the event loop (other sessions should continue normally)
  2. Cause other sessions to silently drop messages
  3. Require a full Gateway restart to recover

Suggested Fix

  1. Per-session timeout budgets: Abort a session's embedded run after N seconds of no progress, independent of the model call timeout
  2. Better async isolation: Ensure lock contention retries don't block the main event loop (use setImmediate/yield between retries)
  3. Circuit breaker: After repeated lock failures, skip the write instead of retrying
  4. Per-session resource limits: Cap CPU and memory usage per session so one runaway session can't starve others

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING