openclaw - 💡(How to fix) Fix [Bug]: Embedded agent initialization blocks event loop for 33s, causing WebSocket handshake timeouts and 45-85% CPU spike [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75689Fetched 2026-05-02 05:31:42
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
3
Timeline (top)
mentioned ×3subscribed ×3labeled ×2closed ×1

Embedded agent initialization takes ~33 seconds of synchronous execution on the Node.js main thread, completely blocking the event loop. This causes WebSocket handshake timeouts, "gateway connect failed" errors, stuck sessions, and CPU spikes to 45-85%.

Error Message

[gateway/ws] handshake timeout conn=... durationMs=20733 [gateway/ws] closed before connect code=1000 error: gateway connect failed: Error: gateway closed (1000)

Root Cause

Related issues with the same root cause:

Fix Action

Fix / Workaround

[embedded-run] startup stages: attempt-dispatch totalMs=16844
  stages: workspace:0ms, runtime-plugins:2ms, hooks:0ms, 
          model-resolution:3843ms, auth:7003ms, context-engine:0ms, 
          attempt-dispatch:5996ms

**Workarounds attempted:**
- Reduced `bootstrapMaxChars` from 30000 to 10000: helped slightly (CPU from 68.7% to 40.9%) but core issue remains
- Gateway restart: does not help, new process immediately encounters same 33s block

Code Example



---

[embedded-run] startup stages: attempt-dispatch totalMs=16844
  stages: workspace:0ms, runtime-plugins:2ms, hooks:0ms, 
          model-resolution:3843ms, auth:7003ms, context-engine:0ms, 
          attempt-dispatch:5996ms

[embedded-run] prep stages: stream-ready totalMs=33258
  stages: workspace-sandbox:24ms, skills:1ms, 
          core-plugin-tools:9474ms, bootstrap-context:242ms, 
          bundle-tools:2131ms, system-prompt:9420ms, 
          session-resource-loader:2405ms, agent-session:3ms, 
          stream-setup:9420ms

---

[diagnostic] liveness warning: 
  eventLoopDelayP99Ms=12826.2, eventLoopUtilization=0.989, cpuCoreRatio=1.028

---

[gateway/ws] handshake timeout conn=... durationMs=20733
[gateway/ws] closed before connect code=1000
error: gateway connect failed: Error: gateway closed (1000)

---

Load average: 0.80 / 1.10 / 1.12
Memory: 1.5G / 3.5G (43% used, 2.0G available)
Disk: 25G / 60G (41%)
Disk IO: 0.23% util
Network: 0.048ms localhost latency
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Summary

Embedded agent initialization takes ~33 seconds of synchronous execution on the Node.js main thread, completely blocking the event loop. This causes WebSocket handshake timeouts, "gateway connect failed" errors, stuck sessions, and CPU spikes to 45-85%.

Steps to reproduce

Steps to reproduce

  1. Start OpenClaw Gateway 2026.4.29 with embedded agents enabled
  2. Send any message to trigger session initialization
  3. Observe Gateway logs:
    • embedded-run startup stages begin
    • eventLoopDelayP99 spikes to 10-16s
    • handshake timeout warnings appear
    • gateway connect failed errors occur
  4. Check CPU — Gateway process shows 45-85% CPU

Reproducibility: 100% (every embedded run triggers the same 33s block)

Expected behavior

Expected behavior

Embedded agent initialization should run asynchronously (Worker Threads / child_process) or yield to event loop periodically. Gateway should remain responsive during agent initialization.

Actual behavior

Actual behavior

Gateway becomes completely unresponsive for ~33 seconds during each embedded run initialization:

  • WebSocket connections timeout and close (code 1000/1005)
  • Health checks fail
  • Session messages queue up (queueDepth=1)
  • CPU spikes to 45-85%

OpenClaw version

2026.4.29

Operating system

OpenCloudOS 9.4 (Linux 6.6.117-45.1.oc9.x86_64, x64)

Install method

npm global

Model

kimi/k2.6 (also tested with openai-codex/gpt-5.4)

Provider / routing chain

openclaw -> kimi:default (API key, direct)

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

Embedded run startup stages (33 seconds synchronous):

[embedded-run] startup stages: attempt-dispatch totalMs=16844
  stages: workspace:0ms, runtime-plugins:2ms, hooks:0ms, 
          model-resolution:3843ms, auth:7003ms, context-engine:0ms, 
          attempt-dispatch:5996ms

[embedded-run] prep stages: stream-ready totalMs=33258
  stages: workspace-sandbox:24ms, skills:1ms, 
          core-plugin-tools:9474ms, bootstrap-context:242ms, 
          bundle-tools:2131ms, system-prompt:9420ms, 
          session-resource-loader:2405ms, agent-session:3ms, 
          stream-setup:9420ms

Event loop metrics during block:

[diagnostic] liveness warning: 
  eventLoopDelayP99Ms=12826.2, eventLoopUtilization=0.989, cpuCoreRatio=1.028

WebSocket handshake timeout:

[gateway/ws] handshake timeout conn=... durationMs=20733
[gateway/ws] closed before connect code=1000
error: gateway connect failed: Error: gateway closed (1000)

Error counts (single day):

  • handshake timeout/failed: 51
  • gateway connect failed: 33
  • stuck session: 83
  • embedded run startup: 49

System resources (all normal):

Load average: 0.80 / 1.10 / 1.12
Memory: 1.5G / 3.5G (43% used, 2.0G available)
Disk: 25G / 60G (41%)
Disk IO: 0.23% util
Network: 0.048ms localhost latency

Impact and severity

Affected: Any deployment using embedded agents (not just active-memory)

Severity: Critical — Gateway becomes unresponsive during every session initialization

Frequency: 100% reproducible

Consequence: Messages delayed 2-3 minutes or lost, repeated restarts, stuck sessions

Additional information

Related issues with the same root cause:

  • #65517 — [Bug]: Active-memory embedded sub-agent run blocks event loop, starving Telegram polling
  • #56733 — Gateway process alive but event loop frozen — all HTTP requests silently timeout
  • #43178 — Telegram polling watchdog triggers full gateway restart under concurrent multi-agent load
  • #52231 — Embedded run timeout leaves zombie handle blocking heartbeat delivery
  • #65309 — Active Memory blocks direct-chat turns for ~30s and times out
  • #72606 — Active Memory timeoutMs clock starts at plugin level, not at LLM call — embedded run setup overhead causes 100% timeout

The common root cause: synchronous embedded operations on the single-process Node.js event loop starve all I/O operations.

Workarounds attempted:

  • Reduced bootstrapMaxChars from 30000 to 10000: helped slightly (CPU from 68.7% to 40.9%) but core issue remains
  • Gateway restart: does not help, new process immediately encounters same 33s block

Proposed solutions:

  1. Move embedded initialization to Worker Thread
  2. Async initialization with yielding (setImmediate between stages)
  3. Pre-warm embedded agents at Gateway startup
  4. Separate Gateway and Agent processes

extent analysis

TL;DR

The most likely fix is to move embedded agent initialization to an asynchronous process, such as a Worker Thread, to prevent blocking the Node.js event loop.

Guidance

  • Identify the specific stages of embedded agent initialization that are causing the blockage, as shown in the provided logs, and prioritize optimizing or asynchronous execution of these stages.
  • Consider implementing a yielding mechanism, such as using setImmediate, between initialization stages to allow the event loop to process other tasks and prevent starvation.
  • Evaluate the proposed solutions, including moving embedded initialization to a Worker Thread, async initialization with yielding, pre-warming embedded agents, or separating Gateway and Agent processes, to determine the most effective approach.
  • Monitor event loop metrics, such as eventLoopDelayP99Ms and eventLoopUtilization, to verify the effectiveness of any implemented fixes.

Example

// Example of using setImmediate to yield between initialization stages
function initEmbeddedAgent() {
  // Stage 1: workspace setup
  setupWorkspace();
  setImmediate(() => {
    // Stage 2: model resolution
    resolveModel();
    setImmediate(() => {
      // Stage 3: auth and context engine
      authAndContextEngine();
    });
  });
}

Notes

The provided logs and error messages indicate a clear issue with synchronous embedded agent initialization blocking the event loop. However, the optimal solution may depend on the specific requirements and constraints of the OpenClaw Gateway and embedded agents.

Recommendation

Apply a workaround by moving embedded agent initialization to a Worker Thread or implementing async initialization with yielding, as these approaches are likely to mitigate the event loop blockage and improve Gateway responsiveness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Embedded agent initialization should run asynchronously (Worker Threads / child_process) or yield to event loop periodically. Gateway should remain responsive during agent initialization.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Embedded agent initialization blocks event loop for 33s, causing WebSocket handshake timeouts and 45-85% CPU spike [1 comments, 2 participants]