openclaw - 💡(How to fix) Fix Single-threaded agent model_call blocks all 83 agents — eventLoop utilization sustained 1.0 even after cleaning sessions and patching memory thresholds (not Telegram, not session leak, pure dispatch bottleneck)

openclaw2026-05-26 03:56:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

#78808 was closed because Telegram polling was moved to a worker thread. But we are a Feishu (Lark) deployment with 83 agents — no Telegram, no polling bottlenecks. Our Gateway's single-threaded event loop is saturated by agent model_call contention, not channel I/O.

Fix Action

Fix / Workaround

We cleaned 796 completed subagent sessions (see #86745 for the subagent session leak). After cleanup:

RSS dropped from 3.2GB to 2.1GB
session-locks dropped from 36s to 64ms
eventLoop utilization remained 0.85-1.0 — proving the bottleneck is NOT session accumulation, it's the single-threaded agent dispatch

Agent model_call dispatch needs to be offloaded from the main event loop. Suggestions:

Worker-thread pool for agent model_call — similar to how Telegram polling was isolated, but for the actual agent inference dispatch
Per-agent event loop isolation — each agent gets its own worker thread or at minimum, model calls are dispatched to a thread pool
Configurable max concurrent model_calls — if architecture can't change, at least prevent all 83 agents from competing for the event loop simultaneously

OpenClaw: 2026.5.24-beta.2
Node.js: v22.22.0
Agents: 83 (82 on GLM-5-Turbo, 1 on DeepSeek V4 Pro)
Channel: Feishu (Lark) WebSocket — NOT polling-based
NODE_OPTIONS=--max-old-space-size=8192 configured
Memory pressure thresholds already patched to 8GB/10GB
796 stale subagent sessions already cleaned
OS: Linux (WSL2), 31GB RAM, RTX 3090

Code Example

work=[active=agent:agent-architect:feishu:group:...(processing/model_call,q=1,age=25s)]

RAW_BUFFERClick to expand / collapse

This is a follow-up to #78808 (closed as implemented for Telegram), #78861, #84903, and #86745.

The fundamental problem has NOT been solved for non-Telegram deployments

Real-world evidence (83 agents, v2026.5.24-beta.2)

Time	eventLoop util	delayMax	active agents	Cause
10:33	0.983	14.7s	1	agent-architect model_call stuck
10:39	1.0	24.7s	1	agent-architect subagent model_call
10:50	1.0	13.1s	2	architect + butler model_calls
11:32	0.996	12.5s	2	trading agents model_calls

One agent's slow API response (GLM-5-Turbo taking 15-25s) blocks the ENTIRE Gateway. q=1 in the work queue confirms downstream agents are queued waiting.

This is not a model speed problem — DeepSeek V4 Pro also shows the same pattern. This is an architectural problem: all agent runs share one event loop, and model_call is synchronous-blocking within that loop.

Agent model_call blocking the gateway

From diagnostic liveness logs:

work=[active=agent:agent-architect:feishu:group:...(processing/model_call,q=1,age=25s)]

q=1 means downstream requests are queued. One agent's model_call that takes 20 seconds blocks all other 82 agents for 20 seconds.

After cleaning 796 stale subagent sessions

We cleaned 796 completed subagent sessions (see #86745 for the subagent session leak). After cleanup:

RSS dropped from 3.2GB to 2.1GB
session-locks dropped from 36s to 64ms
eventLoop utilization remained 0.85-1.0 — proving the bottleneck is NOT session accumulation, it's the single-threaded agent dispatch

Comparison: 18-agent deployment runs perfectly

A separate 18-agent deployment on identical hardware (31GB, same Node version) runs for 77 days with:

Gateway RSS: 667MB
Zero memory pressure alerts
eventLoop utilization: normal
System load: 0.02

The 83-agent deployment has the exact same per-agent workload but 4.6x the agents → 4.6x the model_call contention → eventLoop saturation.

What we need

Agent model_call dispatch needs to be offloaded from the main event loop. Suggestions:

Worker-thread pool for agent model_call — similar to how Telegram polling was isolated, but for the actual agent inference dispatch
Per-agent event loop isolation — each agent gets its own worker thread or at minimum, model calls are dispatched to a thread pool
Configurable max concurrent model_calls — if architecture can't change, at least prevent all 83 agents from competing for the event loop simultaneously

Environment

OpenClaw: 2026.5.24-beta.2
Node.js: v22.22.0
Agents: 83 (82 on GLM-5-Turbo, 1 on DeepSeek V4 Pro)
Channel: Feishu (Lark) WebSocket — NOT polling-based
NODE_OPTIONS=--max-old-space-size=8192 configured
Memory pressure thresholds already patched to 8GB/10GB
796 stale subagent sessions already cleaned
OS: Linux (WSL2), 31GB RAM, RTX 3090

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering