openclaw - ✅(Solved) Fix [Bug]: Gateway repeatedly closes connections (1000/1005/1006) due to event‑loop starvation caused by stuck tool call [2 pull requests, 8 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78402Fetched 2026-05-07 03:37:22
View on GitHub
Comments
8
Participants
6
Timeline
18
Reactions
3
Assignees
Timeline (top)
commented ×8cross-referenced ×2mentioned ×2subscribed ×2

After upgrading to OpenClaw 2026.5.5, the local gateway becomes unresponsive shortly after startup. WebSocket connections fail with codes 1000, 1005, and 1006, the UI disconnects repeatedly, and the CLI cannot establish a stable connection.

Diagnostics show severe event‑loop starvation and a single long‑running exec tool call blocking the entire runtime for 10–20+ minutes.

This results in:

  • Gateway handshake failures
  • Telegram channel timeouts
  • Extremely slow node.list, sessions.list, and chat.history calls
  • Webchat repeatedly disconnecting
  • CLI unable to connect (gateway closed (1000)

Environment

  • OpenClaw version: 2026.5.5
  • Platform: Linux
  • Agent: agent:marcus:main
  • Channels enabled: Telegram (multiple accounts)
  • Runtime: Local Gateway (127.0.0.1:18789)

Error Message

Key Log Excerpts (Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031 warn fetch-timeout timer delayed 40171ms, likely event-loop starvation warn gateway/ws handshake-timeout ... closed before connect code=1000 warn gateway/ws closed before connect code=1006 activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006 gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe) elapsedMs=50171 timerDelayMs=40171

Root Cause

Analysis Based on the logs, the root cause appears to be:

Fix Action

Fixed

PR fix notes

PR #78479: fix(gateway): close stale WebSocket connections on ping/pong timeout (#78402)

Description (problem / solution / changelog)

Summary

  • Problem: Gateway WebSocket connections drop with codes 1000/1005/1006 during event-loop starvation. When a long-running tool call (exec) monopolises the event loop (utilization=1, P99 delay=35 000 ms), the ws library cannot process incoming frames. When the event loop recovers, pending pings fire simultaneously without any pong ever being tracked, leaving connections in a limbo state until the OS drops them abruptly.
  • Why it matters: Users see repeated reconnect cycles in Control UI during any heavy agent session; the gateway logs show close(1000/1005/1006) chains that look like flaky network but are actually a dead-event-loop artifact.
  • What changed: setClient inside attachGatewayWsConnectionHandler now initialises a pongReceived flag and registers a socket.on("pong") listener. The existing setInterval ping loop checks the flag before each tick: if no pong arrived since the previous ping, it calls close(1001, "ping timeout") and returns without sending another ping.
  • What did NOT change: Pre-authentication connection budget, handshake timeout, close-cause tracking, payload limits, and the lazy-loaded message handler are all unchanged.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration

Linked Issue/PR

  • Closes #78402
  • This PR fixes a bug or regression

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Connections silently dying with 1000/1005/1006 codes during exec-tool starvation instead of being cleanly closed with 1001.
  • Real environment tested: Local development build (Node.js 22, macOS).
  • Exact steps or command run after this patch: pnpm test src/gateway/server/ws-connection.test.ts --reporter=verbose
  • Evidence after fix (terminal output):
 RUN  v4.1.5 /Users/dev/openclaw

 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > threads current auth getters into the handshake handler instead of a stale snapshot 14ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > uses the gateway TLS scheme for canvas host URLs 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > rejects late client registration after a pre-connect socket close 2ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > sends protocol pings until the connection closes 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > closes with code 1001 when a pong is not received before the next ping (event-loop starvation guard) 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > does not close when pong is received before the next ping interval 2ms

 Test Files  1 passed (1)
      Tests  6 passed (6)
   Start at  09:08:18
   Duration  1.57s (transform 554ms, setup 162ms, import 1.25s, tests 30ms, environment 0ms)
  • Observed result after fix: All 6 tests pass, including 2 new regression tests that cover the timeout close and the healthy-pong no-close paths.
  • What was not tested: Live event-loop starvation with a real exec tool call on a running gateway instance (unit tests cover the interval logic via fake timers).

Root Cause (if applicable)

  • Root cause: pingTimer in setClient called socket.ping() on a 25-second interval but never registered a pong event listener. There was no way to detect that a pong was missed — pings accumulated silently during starvation and the connection was left half-open.
  • Missing detection / guardrail: No pong-received tracking and no liveness close path.
  • Contributing context: Node.js event-loop starvation from a long-running synchronous exec call prevents the ws library from reading pong frames from the TCP buffer, so the client's pong reply never gets processed even if it was sent.

Regression Test Plan (if applicable)

Two new tests in src/gateway/server/ws-connection.test.ts using vi.useFakeTimers():

  1. "closes with code 1001 when a pong is not received before the next ping (event-loop starvation guard)" — advances timer 25 s twice without emitting pong; asserts socket.close(1001, "ping timeout") and logWsControl.warn(...) are called.
  2. "does not close when pong is received before the next ping interval" — advances 25 s, emits pong, advances 25 s again; asserts socket.close was never called and a second ping was sent.

Environment

  • OS: macOS (Darwin 25.x)
  • Runtime/container: Node.js 22, local dev
  • Model/provider: N/A (infrastructure fix)
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Trigger a heavy exec tool call that blocks the Node.js event loop for >25 seconds.
  2. Observe gateway logs — before fix: close(1000/1005/1006) repeated on reconnect. After fix: close(1001, "ping timeout") on first stale interval.

Expected

  • Connection closed cleanly with 1001 and a warning log entry containing "ping pong timeout".

Actual (after fix)

  • Exactly that — socket.close(1001, "ping timeout") called after one missed pong interval; ping loop stopped immediately.

Evidence

  • Failing test/log before + passing after (terminal output above)

Human Verification (required)

  • Verified scenarios: Ping-timeout close path (fake timers, no pong); normal pong path (pong received, no spurious close); existing ping-until-close test still passes.
  • Edge cases checked: Socket that closes between pings (timer already cleared by close() listener); ping() throw race with a socket in CLOSING state (existing try/catch handles it).
  • What you did not verify: Live starvation scenario on a real gateway with >600 s exec tool call.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: False-positive closes if the event loop is briefly slow and a pong frame arrives just after the timer fires.
    • Mitigation: At 25-second intervals this is extremely unlikely in practice; any genuinely slow loop that delays a pong by a full 25 s is already harming the user session and should be closed.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/server/ws-connection.test.ts (modified, +153/-0)
  • src/gateway/server/ws-connection.ts (modified, +23/-0)

PR #78645: fix(agents): bound live exec output events

Description (problem / solution / changelog)

Summary

  • Problem: high-volume exec output could emit thousands of live Gateway agent events with growing aggregated output, starving unrelated Gateway RPCs and making clients/channels look disconnected.
  • Why it matters: issue #78402 reports WebSocket reconnects, Telegram timeouts, and slow sessions.list / chat.history / node.list while tool-heavy agents run.
  • What changed: exec/bashing live output deltas are rate-limited per tool call, skipped before redaction/serialization, and oversized live command-output payloads are capped to a bounded tail.
  • What did NOT change (scope boundary): command execution, exit status, approval parsing, provider calls, and underlying tool semantics are not changed.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #78402
  • Related #78479
  • This PR fixes a bug or regression

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Gateway event-loop/RPC starvation during noisy live exec output.
  • Real environment tested: local macOS source checkout, rebuilt runtime bundle, isolated Gateway state under /private/tmp/openclaw-78402-natural, live OpenAI openai/gpt-5.5 API route.
  • Exact steps or command run after this patch: node scripts/tsdown-build.mjs, then pnpm exec tsx /private/tmp/openclaw-78402-natural-repro.ts with credentials loaded from /private/tmp/openclaw-78402-live-creds.env.

Human-readable result: before this patch, the noisy exec run turned one command into a live Gateway event flood. The Gateway sent 9,297 agent WebSocket events while also trying to answer normal RPCs, so unrelated client calls saw multi-second tail latency. After this patch, the same live repro sent 57 agent WebSocket events, command-output updates appeared at a bounded cadence, the final live output was capped, and the normal RPC p99s stayed around half a second. The remaining isolated max values line up with cold plugin-tool loading at agent startup rather than the exec-output stream.

MeasurementBefore patchAfter patchWhat changed
Agent WebSocket events9,29757Live event volume dropped by about 99.4%.
WebSocket closeClean harness shutdownClean harness shutdownNo disconnect introduced by the fix.
sessions.list p99 / max1,728ms / 12,173ms551ms / 5,735msTail latency dropped; remaining max is startup/plugin-load shaped.
chat.history p99 / max3,529ms / 4,685ms516ms / 1,947msTail latency dropped from multi-second to roughly half-second p99.
node.list p99 / max1,925ms / 6,111ms566ms / 5,194msTail latency dropped; no request failures.
Event-loop warningmax about 5,087ms during event floodmax about 3,706ms after bounded streamImproved, with residual cold-start work still visible.
  • Evidence after fix: copied summary from /private/tmp/openclaw-78402-natural/summary.json:
    • WebSocket agent events: 57 total, down from 9,297 in the same repro before the fix.
    • sessions.list: count 125, failures 0, p50 23ms, p95 33ms, p99 551ms, max 5735ms.
    • chat.history: count 125, failures 0, p50 5ms, p95 9ms, p99 516ms, max 1947ms.
    • node.list: count 125, failures 0, p50 3ms, p95 6ms, p99 566ms, max 5194ms.
    • WebSocket closed cleanly with code 1000 after the repro completed.
  • Observed result after fix: live exec command-output events are emitted about every 250ms, final live output is capped, and normal RPC p99s stay around 0.5s instead of multi-second tails from event flood.
  • What was not tested: full channel-specific Telegram or Discord live roundtrip after the patch; the root bottleneck was reproduced at the shared Gateway/agent event stream layer.
  • Before evidence: same harness before the fix emitted 9,297 agent events; sessions.list p99 1728ms/max 12173ms, chat.history p99 3529ms/max 4685ms, node.list p99 1925ms/max 6111ms, with event-loop delay max about 5087ms.

Root Cause (if applicable)

  • Root cause: exec tool updates could carry growing details.aggregated output, and the embedded agent event handler sanitized, serialized, and broadcast every live update plus an unbounded final output payload.
  • Missing detection / guardrail: no rate limit or payload cap existed on the live command-output event stream.
  • Contributing context (if known): PR #78479 addressed stale WebSocket cleanup, but did not remove the event production pressure that starved the Gateway.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/pi-embedded-subscribe.handlers.tools.test.ts
  • Scenario the test should lock in: high-frequency exec output updates are throttled even when output grows quickly, and oversized live final command output is capped.
  • Why this is the smallest reliable guardrail: the bug lives in the embedded event translation layer, so direct handler tests cover the rate/cap behavior without needing provider/network flakiness.
  • Existing test that already covers this (if any): none before this PR.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

Live command-output streams for very noisy exec/bash runs are now bounded. Clients see periodic output updates and a capped final live tail instead of every byte over the Gateway event stream.

Diagram (if applicable)

Before:
exec noisy output -> every aggregated update -> sanitize + serialize + broadcast thousands of Gateway events -> unrelated RPCs stall

After:
exec noisy output -> first/periodic live updates + capped final tail -> bounded Gateway events -> RPC handling remains responsive

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? Yes
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: live exec output presentation is now rate-limited/capped. Command execution and approval semantics are unchanged; the cap reduces the amount of command output broadcast through live Gateway events.

Repro + Verification

Environment

  • OS: macOS local development machine
  • Runtime/container: local source checkout, rebuilt dist via node scripts/tsdown-build.mjs
  • Model/provider: OpenAI openai/gpt-5.5
  • Integration/channel (if any): shared Gateway/WebSocket event stream, no channel-specific transport required
  • Relevant config (redacted): isolated Gateway on loopback port 19814, token auth, seeded session store, live API credentials loaded from /private/tmp/openclaw-78402-live-creds.env

Steps

  1. Build the runtime bundle with node scripts/tsdown-build.mjs.
  2. Run the natural repro harness: pnpm exec tsx /private/tmp/openclaw-78402-natural-repro.ts.
  3. The harness seeds 420 sessions, runs concurrent sessions.list, chat.history, and node.list probes, creates 20 WebChat-like sessions, then runs a live OpenAI agent that executes a noisy command.

Expected

  • No WebSocket disconnect during the live run.
  • Bounded agent event count.
  • RPC p99 remains near sub-second under noisy exec output.

Actual

  • No WebSocket disconnect; close code 1000 at harness shutdown.
  • 57 agent events after the fix, versus 9,297 before.
  • sessions.list, chat.history, and node.list p99s were roughly 0.5s in the final live pass.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • pnpm test src/agents/pi-embedded-subscribe.handlers.tools.test.ts
    • pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-subscribe.handlers.tools.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts
    • git diff --check
    • node scripts/tsdown-build.mjs
    • live OpenAI repro via /private/tmp/openclaw-78402-natural-repro.ts
  • Edge cases checked: status-less live update payloads with details.aggregated, large immediate output growth, final oversized output capping, and secret redaction still applies before live output emission.
  • What you did not verify: broad pnpm check, full pnpm test, and channel-specific Telegram/Discord live E2E.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: a live client that expected complete command output from Gateway command_output events will now see a capped tail for very large outputs.
    • Mitigation: live command-output events are presentation/progress events; keeping them unbounded is the starvation source. Command execution, status, and approval parsing remain intact.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-subscribe.handlers.tools.test.ts (modified, +98/-0)
  • src/agents/pi-embedded-subscribe.handlers.tools.ts (modified, +97/-9)

Code Example

Key Log Excerpts
(Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031
warn fetch-timeout timer delayed 40171ms, likely event-loop starvation
warn gateway/ws handshake-timeout ... closed before connect code=1000
warn gateway/ws closed before connect code=1006
activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s
activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006
gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe)
elapsedMs=50171 timerDelayMs=40171
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

After upgrading to OpenClaw 2026.5.5, the local gateway becomes unresponsive shortly after startup. WebSocket connections fail with codes 1000, 1005, and 1006, the UI disconnects repeatedly, and the CLI cannot establish a stable connection.

Diagnostics show severe event‑loop starvation and a single long‑running exec tool call blocking the entire runtime for 10–20+ minutes.

This results in:

  • Gateway handshake failures
  • Telegram channel timeouts
  • Extremely slow node.list, sessions.list, and chat.history calls
  • Webchat repeatedly disconnecting
  • CLI unable to connect (gateway closed (1000)

Environment

  • OpenClaw version: 2026.5.5
  • Platform: Linux
  • Agent: agent:marcus:main
  • Channels enabled: Telegram (multiple accounts)
  • Runtime: Local Gateway (127.0.0.1:18789)

Steps to reproduce

Steps to Reproduce:

  • Start OpenClaw 2026.5.5
  • Start an agent session that triggers a tool call
  • Observe the gateway logs after a few minutes
  • Attempt to connect via UI or CLI
  • Observe repeated disconnects and handshake failures

Expected behavior

Gateway should accept WebSocket connections reliably Tool calls should not block the event loop Telegram channel startup should not freeze the runtime UI and CLI should remain connected

Actual behavior

  • Gateway repeatedly closes connections with codes 1000, 1005, 1006
  • WebSocket handshakes time out (handshake-timeout)
  • Telegram getMe calls time out after 10–50 seconds
  • node.list, sessions.list, and chat.history take 50–120 seconds
  • UI disconnects (webchat disconnected code=1006)

CLI fails with: Code gateway connect failed: Error: gateway closed (1000)

Diagnostic subsystem reports:

  • eventLoopUtilization = 1
  • eventLoopDelayP99Ms = 35500ms
  • cpuCoreRatio ≈ 1.03
  • phase = channels.telegram.start-account
  • activeTool = exec stuck for 600–1200 seconds

OpenClaw version

2026.5.5

Operating system

Ubuntu

Install method

Script

Model

Minimax m2.7

Provider / routing chain

agent:marcus:main/ Channels enabled: Telegram (multiple accounts)

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Key Log Excerpts
(Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031
warn fetch-timeout timer delayed 40171ms, likely event-loop starvation
warn gateway/ws handshake-timeout ... closed before connect code=1000
warn gateway/ws closed before connect code=1006
activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s
activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006
gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe)
elapsedMs=50171 timerDelayMs=40171

Impact and severity

Analysis Based on the logs, the root cause appears to be:

  1. A single exec tool call is frozen
  • call_function_pt65b1sa6ex9_1 runs for 10–20+ minutes
  • Blocks the event loop completely
  • Prevents all other work from progressing
  1. Event‑loop starvation cascades into system-wide failures WebSocket handshakes cannot complete

Timers fire late by 7–40 seconds

  • Telegram channel startup loops fail
  • UI and CLI disconnect
  • Gateway appears “dead” but is actually blocked
  1. All symptoms are downstream of the blocked tool This is consistent with:
  • synchronous subprocess calls (execSync, spawnSync)
  • Python scripts that never exit
  • tools that produce unbounded stdout/stderr
  • synchronous filesystem operations on large files

Impact This issue makes the gateway effectively unusable:

  • UI cannot stay connected
  • CLI cannot connect
  • Agents cannot run
  • Channels cannot start
  • Tool calls never complete

Additional information

Could the maintainers please investigate:

Whether the gateway should isolate tool execution to prevent event‑loop starvation Whether exec tools should be sandboxed or forced async Whether the Telegram channel retry loop should yield to the event loop Whether watchdog logic should terminate long‑running tool calls I can provide full logs or reproduce the issue again if needed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Gateway should accept WebSocket connections reliably Tool calls should not block the event loop Telegram channel startup should not freeze the runtime UI and CLI should remain connected

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Gateway repeatedly closes connections (1000/1005/1006) due to event‑loop starvation caused by stuck tool call [2 pull requests, 8 comments, 6 participants]