openclaw - ✅(Solved) Fix [Bug]: Gateway repeatedly closes connections (1000/1005/1006) due to event‑loop starvation caused by stuck tool call [2 pull requests, 8 comments, 6 participants]

najef1979-code · 2026-05-06T09:56:02Z

[openclaw] After upgrading to OpenClaw 2026.5.5, the local gateway becomes unresponsive shortly after startup. WebSocket connections fail with codes 1000, 1005… After upgrading to OpenClaw 2026.5.5, the local gateway becomes unresponsive shortly after startup. WebSocket connections fail with codes 1000, 1005, and 1006, the UI disconnects repeatedly, and the CLI cannot establish a stable connection. Diagnostics show severe event‑loop starvation and a single long‑running exec tool call blocking the entire runtime for 10–20+ minutes. This results in: - Gateway handshake failures - Telegram channel timeouts - Extremely slow node.list, sessions.list, and chat.history calls - Webchat repeatedly disconnecting - CLI unable to connect (gateway closed (1000) Environment - OpenClaw version: 2026.5.5 - Platform: Linux - Agent: agent:marcus:main - Channels enabled: Telegram (multiple accounts) - Runtime: Local Gateway (127.0.0.1:18789) # PR #78479: fix(gateway): close stale WebSocket connections on ping/pong timeout (#78402) - Repository: openclaw/openclaw - Author: Beandon13 - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/78479 ## Description (problem / solution / changelog) ## Summary - **Problem:** Gateway WebSocket connections drop with codes 1000/1005/1006 during event-loop starvation. When a long-running tool call (`exec`) monopolises the event loop (utilization=1, P99 delay=35 000 ms), the `ws` library cannot process incoming frames. When the event loop recovers, pending pings fire simultaneously without any pong ever being tracked, leaving connections in a limbo state until the OS drops them abruptly. - **Why it matters:** Users see repeated reconnect cycles in Control UI during any heavy agent session; the gateway logs show `close(1000/1005/1006)` chains that look like flaky network but are actually a dead-event-loop artifact. - **What changed:** `setClient` inside `attachGatewayWsConnectionHandler` now initialises a `pongReceived` flag and registers a `socket.on("pong")` listener. The existing `setInterval` ping loop checks the flag before each tick: if no pong arrived since the previous ping, it calls `close(1001, "ping timeout")` and returns without sending another ping. - **What did NOT change:** Pre-authentication connection budget, handshake timeout, close-cause tracking, payload limits, and the lazy-loaded message handler are all unchanged. ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Gateway / orchestration ## Linked Issue/PR - Closes #78402 - [x] This PR fixes a bug or regression ## Real behavior proof (required for external PRs) - **Behavior or issue addressed:** Connections silently dying with 1000/1005/1006 codes during exec-tool starvation instead of being cleanly closed with 1001. - **Real environment tested:** Local development build (Node.js 22, macOS). - **Exact steps or command run after this patch:** `pnpm test src/gateway/server/ws-connection.test.ts --reporter=verbose` - **Evidence after fix (terminal output):** ``` RUN v4.1.5 /Users/dev/openclaw ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > threads current auth getters into the handshake handler instead of a stale snapshot 14ms ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > uses the gateway TLS scheme for canvas host URLs 3ms ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > rejects late client registration after a pre-connect socket close 2ms ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > sends protocol pings until the connection closes 3ms ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > closes with code 1001 when a pong is not received before the next ping (event-loop starvation guard) 3ms ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > does not close when pong is received before the next ping interval 2ms Test Files 1 passed (1) Tests 6 passed (6) Start at 09:08:18 Duration 1.57s (transform 554ms, setup 162ms, import 1.25s, tests 30ms, environment 0ms) ``` - **Observed result after fix:** All 6 tests pass, including 2 new regression tests that cover the timeout close and the healthy-pong no-close paths. - **What was not tested:** Live event-loop starvation with a real exec tool call on a running gateway instance (unit tests cover the interval logic via fake timers). ## Root Cause (if applicable) - **Root cause:** `pingTimer` in `setClient` called `socket.ping()` on a 25-second interval but never registered a `pong` event listener. There was no way to detect that a pong was missed — pings accumulated silently during starvation and the connection was left half-open. - **Missing detection / guardrail:** No pong-received tracking and no liveness close path. - **Contributing context:** Node.js event-loop starvation from

openclaw2026-05-06 09:56:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#78402•Fetched 2026-05-07 03:37:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×8cross-referenced ×2mentioned ×2subscribed ×2

After upgrading to OpenClaw 2026.5.5, the local gateway becomes unresponsive shortly after startup. WebSocket connections fail with codes 1000, 1005, and 1006, the UI disconnects repeatedly, and the CLI cannot establish a stable connection.

Diagnostics show severe event‑loop starvation and a single long‑running exec tool call blocking the entire runtime for 10–20+ minutes.

This results in:

Gateway handshake failures
Telegram channel timeouts
Extremely slow node.list, sessions.list, and chat.history calls
Webchat repeatedly disconnecting
CLI unable to connect (gateway closed (1000)

Environment

OpenClaw version: 2026.5.5
Platform: Linux
Agent: agent:marcus:main
Channels enabled: Telegram (multiple accounts)
Runtime: Local Gateway (127.0.0.1:18789)

Error Message

Key Log Excerpts (Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031 warn fetch-timeout timer delayed 40171ms, likely event-loop starvation warn gateway/ws handshake-timeout ... closed before connect code=1000 warn gateway/ws closed before connect code=1006 activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006 gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe) elapsedMs=50171 timerDelayMs=40171

Root Cause

Analysis Based on the logs, the root cause appears to be:

Fix Action

Fixed

Fixed by PR: fix(gateway): close stale WebSocket connections on ping/pong timeout (#78402) (https://github.com/openclaw/openclaw/pull/78479)
Fixed by PR: fix(agents): bound live exec output events (https://github.com/openclaw/openclaw/pull/78645)

PR fix notes

PR #78479: fix(gateway): close stale WebSocket connections on ping/pong timeout (#78402)

Repository: openclaw/openclaw
Author: Beandon13
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/78479

Description (problem / solution / changelog)

Summary

Problem: Gateway WebSocket connections drop with codes 1000/1005/1006 during event-loop starvation. When a long-running tool call (exec) monopolises the event loop (utilization=1, P99 delay=35 000 ms), the ws library cannot process incoming frames. When the event loop recovers, pending pings fire simultaneously without any pong ever being tracked, leaving connections in a limbo state until the OS drops them abruptly.
Why it matters: Users see repeated reconnect cycles in Control UI during any heavy agent session; the gateway logs show close(1000/1005/1006) chains that look like flaky network but are actually a dead-event-loop artifact.
What changed: setClient inside attachGatewayWsConnectionHandler now initialises a pongReceived flag and registers a socket.on("pong") listener. The existing setInterval ping loop checks the flag before each tick: if no pong arrived since the previous ping, it calls close(1001, "ping timeout") and returns without sending another ping.
What did NOT change: Pre-authentication connection budget, handshake timeout, close-cause tracking, payload limits, and the lazy-loaded message handler are all unchanged.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Gateway / orchestration

Linked Issue/PR

Closes #78402
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: Connections silently dying with 1000/1005/1006 codes during exec-tool starvation instead of being cleanly closed with 1001.
Real environment tested: Local development build (Node.js 22, macOS).
Exact steps or command run after this patch: pnpm test src/gateway/server/ws-connection.test.ts --reporter=verbose
Evidence after fix (terminal output):

 RUN  v4.1.5 /Users/dev/openclaw

 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > threads current auth getters into the handshake handler instead of a stale snapshot 14ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > uses the gateway TLS scheme for canvas host URLs 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > rejects late client registration after a pre-connect socket close 2ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > sends protocol pings until the connection closes 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > closes with code 1001 when a pong is not received before the next ping (event-loop starvation guard) 3ms
 ✓ |gateway| src/gateway/server/ws-connection.test.ts > attachGatewayWsConnectionHandler > does not close when pong is received before the next ping interval 2ms

 Test Files  1 passed (1)
      Tests  6 passed (6)
   Start at  09:08:18
   Duration  1.57s (transform 554ms, setup 162ms, import 1.25s, tests 30ms, environment 0ms)

Observed result after fix: All 6 tests pass, including 2 new regression tests that cover the timeout close and the healthy-pong no-close paths.
What was not tested: Live event-loop starvation with a real exec tool call on a running gateway instance (unit tests cover the interval logic via fake timers).

Root Cause (if applicable)

Root cause: pingTimer in setClient called socket.ping() on a 25-second interval but never registered a pong event listener. There was no way to detect that a pong was missed — pings accumulated silently during starvation and the connection was left half-open.
Missing detection / guardrail: No pong-received tracking and no liveness close path.
Contributing context: Node.js event-loop starvation from a long-running synchronous exec call prevents the ws library from reading pong frames from the TCP buffer, so the client's pong reply never gets processed even if it was sent.

Regression Test Plan (if applicable)

Two new tests in src/gateway/server/ws-connection.test.ts using vi.useFakeTimers():

"closes with code 1001 when a pong is not received before the next ping (event-loop starvation guard)" — advances timer 25 s twice without emitting pong; asserts socket.close(1001, "ping timeout") and logWsControl.warn(...) are called.
"does not close when pong is received before the next ping interval" — advances 25 s, emits pong, advances 25 s again; asserts socket.close was never called and a second ping was sent.

Environment

OS: macOS (Darwin 25.x)
Runtime/container: Node.js 22, local dev
Model/provider: N/A (infrastructure fix)
Integration/channel (if any): N/A
Relevant config (redacted): N/A

Steps

Trigger a heavy exec tool call that blocks the Node.js event loop for >25 seconds.
Observe gateway logs — before fix: close(1000/1005/1006) repeated on reconnect. After fix: close(1001, "ping timeout") on first stale interval.

Expected

Connection closed cleanly with 1001 and a warning log entry containing "ping pong timeout".

Actual (after fix)

Exactly that — socket.close(1001, "ping timeout") called after one missed pong interval; ping loop stopped immediately.

Evidence

Failing test/log before + passing after (terminal output above)

Human Verification (required)

Verified scenarios: Ping-timeout close path (fake timers, no pong); normal pong path (pong received, no spurious close); existing ping-until-close test still passes.
Edge cases checked: Socket that closes between pings (timer already cleared by close() listener); ping() throw race with a socket in CLOSING state (existing try/catch handles it).
What you did not verify: Live starvation scenario on a real gateway with >600 s exec tool call.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: False-positive closes if the event loop is briefly slow and a pong frame arrives just after the timer fires.
- Mitigation: At 25-second intervals this is extremely unlikely in practice; any genuinely slow loop that delays a pong by a full 25 s is already harming the user session and should be closed.

Changed files

CHANGELOG.md (modified, +1/-0)
src/gateway/server/ws-connection.test.ts (modified, +153/-0)
src/gateway/server/ws-connection.ts (modified, +23/-0)

PR #78645: fix(agents): bound live exec output events

Repository: openclaw/openclaw
Author: joshavant
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/78645

Description (problem / solution / changelog)

Summary

Problem: high-volume exec output could emit thousands of live Gateway agent events with growing aggregated output, starving unrelated Gateway RPCs and making clients/channels look disconnected.
Why it matters: issue #78402 reports WebSocket reconnects, Telegram timeouts, and slow sessions.list / chat.history / node.list while tool-heavy agents run.
What changed: exec/bashing live output deltas are rate-limited per tool call, skipped before redaction/serialization, and oversized live command-output payloads are capped to a bounded tail.
What did NOT change (scope boundary): command execution, exit status, approval parsing, provider calls, and underlying tool semantics are not changed.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #78402
Related #78479
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: Gateway event-loop/RPC starvation during noisy live exec output.
Real environment tested: local macOS source checkout, rebuilt runtime bundle, isolated Gateway state under /private/tmp/openclaw-78402-natural, live OpenAI openai/gpt-5.5 API route.
Exact steps or command run after this patch: node scripts/tsdown-build.mjs, then pnpm exec tsx /private/tmp/openclaw-78402-natural-repro.ts with credentials loaded from /private/tmp/openclaw-78402-live-creds.env.

Human-readable result: before this patch, the noisy exec run turned one command into a live Gateway event flood. The Gateway sent 9,297 agent WebSocket events while also trying to answer normal RPCs, so unrelated client calls saw multi-second tail latency. After this patch, the same live repro sent 57 agent WebSocket events, command-output updates appeared at a bounded cadence, the final live output was capped, and the normal RPC p99s stayed around half a second. The remaining isolated max values line up with cold plugin-tool loading at agent startup rather than the exec-output stream.

Measurement	Before patch	After patch	What changed
Agent WebSocket events	9,297	57	Live event volume dropped by about 99.4%.
WebSocket close	Clean harness shutdown	Clean harness shutdown	No disconnect introduced by the fix.
`sessions.list` p99 / max	1,728ms / 12,173ms	551ms / 5,735ms	Tail latency dropped; remaining max is startup/plugin-load shaped.
`chat.history` p99 / max	3,529ms / 4,685ms	516ms / 1,947ms	Tail latency dropped from multi-second to roughly half-second p99.
`node.list` p99 / max	1,925ms / 6,111ms	566ms / 5,194ms	Tail latency dropped; no request failures.
Event-loop warning	max about 5,087ms during event flood	max about 3,706ms after bounded stream	Improved, with residual cold-start work still visible.

Evidence after fix: copied summary from /private/tmp/openclaw-78402-natural/summary.json:
- WebSocket agent events: 57 total, down from 9,297 in the same repro before the fix.
- sessions.list: count 125, failures 0, p50 23ms, p95 33ms, p99 551ms, max 5735ms.
- chat.history: count 125, failures 0, p50 5ms, p95 9ms, p99 516ms, max 1947ms.
- node.list: count 125, failures 0, p50 3ms, p95 6ms, p99 566ms, max 5194ms.
- WebSocket closed cleanly with code 1000 after the repro completed.
Observed result after fix: live exec command-output events are emitted about every 250ms, final live output is capped, and normal RPC p99s stay around 0.5s instead of multi-second tails from event flood.
What was not tested: full channel-specific Telegram or Discord live roundtrip after the patch; the root bottleneck was reproduced at the shared Gateway/agent event stream layer.
Before evidence: same harness before the fix emitted 9,297 agent events; sessions.list p99 1728ms/max 12173ms, chat.history p99 3529ms/max 4685ms, node.list p99 1925ms/max 6111ms, with event-loop delay max about 5087ms.

Root Cause (if applicable)

Root cause: exec tool updates could carry growing details.aggregated output, and the embedded agent event handler sanitized, serialized, and broadcast every live update plus an unbounded final output payload.
Missing detection / guardrail: no rate limit or payload cap existed on the live command-output event stream.
Contributing context (if known): PR #78479 addressed stale WebSocket cleanup, but did not remove the event production pressure that starved the Gateway.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/agents/pi-embedded-subscribe.handlers.tools.test.ts
Scenario the test should lock in: high-frequency exec output updates are throttled even when output grows quickly, and oversized live final command output is capped.
Why this is the smallest reliable guardrail: the bug lives in the embedded event translation layer, so direct handler tests cover the rate/cap behavior without needing provider/network flakiness.
Existing test that already covers this (if any): none before this PR.
If no new test is added, why not: N/A.

User-visible / Behavior Changes

Live command-output streams for very noisy exec/bash runs are now bounded. Clients see periodic output updates and a capped final live tail instead of every byte over the Gateway event stream.

Diagram (if applicable)

Before:
exec noisy output -> every aggregated update -> sanitize + serialize + broadcast thousands of Gateway events -> unrelated RPCs stall

After:
exec noisy output -> first/periodic live updates + capped final tail -> bounded Gateway events -> RPC handling remains responsive

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? Yes
Data access scope changed? No
If any Yes, explain risk + mitigation: live exec output presentation is now rate-limited/capped. Command execution and approval semantics are unchanged; the cap reduces the amount of command output broadcast through live Gateway events.

Repro + Verification

Environment

OS: macOS local development machine
Runtime/container: local source checkout, rebuilt dist via node scripts/tsdown-build.mjs
Model/provider: OpenAI openai/gpt-5.5
Integration/channel (if any): shared Gateway/WebSocket event stream, no channel-specific transport required
Relevant config (redacted): isolated Gateway on loopback port 19814, token auth, seeded session store, live API credentials loaded from /private/tmp/openclaw-78402-live-creds.env

Steps

Build the runtime bundle with node scripts/tsdown-build.mjs.
Run the natural repro harness: pnpm exec tsx /private/tmp/openclaw-78402-natural-repro.ts.
The harness seeds 420 sessions, runs concurrent sessions.list, chat.history, and node.list probes, creates 20 WebChat-like sessions, then runs a live OpenAI agent that executes a noisy command.

Expected

No WebSocket disconnect during the live run.
Bounded agent event count.
RPC p99 remains near sub-second under noisy exec output.

Actual

No WebSocket disconnect; close code 1000 at harness shutdown.
57 agent events after the fix, versus 9,297 before.
sessions.list, chat.history, and node.list p99s were roughly 0.5s in the final live pass.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios:
- pnpm test src/agents/pi-embedded-subscribe.handlers.tools.test.ts
- pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-subscribe.handlers.tools.ts src/agents/pi-embedded-subscribe.handlers.tools.test.ts
- git diff --check
- node scripts/tsdown-build.mjs
- live OpenAI repro via /private/tmp/openclaw-78402-natural-repro.ts
Edge cases checked: status-less live update payloads with details.aggregated, large immediate output growth, final oversized output capping, and secret redaction still applies before live output emission.
What you did not verify: broad pnpm check, full pnpm test, and channel-specific Telegram/Discord live E2E.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: a live client that expected complete command output from Gateway command_output events will now see a capped tail for very large outputs.
- Mitigation: live command-output events are presentation/progress events; keeping them unbounded is the starvation source. Command execution, status, and approval parsing remain intact.

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/pi-embedded-subscribe.handlers.tools.test.ts (modified, +98/-0)
src/agents/pi-embedded-subscribe.handlers.tools.ts (modified, +97/-9)

Code Example

Key Log Excerpts
(Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031
warn fetch-timeout timer delayed 40171ms, likely event-loop starvation
warn gateway/ws handshake-timeout ... closed before connect code=1000
warn gateway/ws closed before connect code=1006
activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s
activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006
gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe)
elapsedMs=50171 timerDelayMs=40171

RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

Diagnostics show severe event‑loop starvation and a single long‑running exec tool call blocking the entire runtime for 10–20+ minutes.

This results in:

Gateway handshake failures
Telegram channel timeouts
Extremely slow node.list, sessions.list, and chat.history calls
Webchat repeatedly disconnecting
CLI unable to connect (gateway closed (1000)

Environment

OpenClaw version: 2026.5.5
Platform: Linux
Agent: agent:marcus:main
Channels enabled: Telegram (multiple accounts)
Runtime: Local Gateway (127.0.0.1:18789)

Steps to reproduce

Steps to Reproduce:

Start OpenClaw 2026.5.5
Start an agent session that triggers a tool call
Observe the gateway logs after a few minutes
Attempt to connect via UI or CLI
Observe repeated disconnects and handshake failures

Expected behavior

Gateway should accept WebSocket connections reliably Tool calls should not block the event loop Telegram channel startup should not freeze the runtime UI and CLI should remain connected

Actual behavior

Gateway repeatedly closes connections with codes 1000, 1005, 1006
WebSocket handshakes time out (handshake-timeout)
Telegram getMe calls time out after 10–50 seconds
node.list, sessions.list, and chat.history take 50–120 seconds
UI disconnects (webchat disconnected code=1006)

CLI fails with: Code gateway connect failed: Error: gateway closed (1000)

Diagnostic subsystem reports:

eventLoopUtilization = 1
eventLoopDelayP99Ms = 35500ms
cpuCoreRatio ≈ 1.03
phase = channels.telegram.start-account
activeTool = exec stuck for 600–1200 seconds

OpenClaw version

2026.5.5

Operating system

Ubuntu

Install method

Script

Model

Minimax m2.7

Provider / routing chain

agent:marcus:main/ Channels enabled: Telegram (multiple accounts)

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Key Log Excerpts
(Truncated for clarity)

warn diagnostic eventLoopDelayP99Ms=35500.6 eventLoopUtilization=1 cpuCoreRatio=1.031
warn fetch-timeout timer delayed 40171ms, likely event-loop starvation
warn gateway/ws handshake-timeout ... closed before connect code=1000
warn gateway/ws closed before connect code=1006
activeTool=exec activeToolCallId=call_function_pt65b1sa6ex9_1 activeToolAge=697s
activeWorkKind=tool_call queued_behind_active_work

webchat disconnected code=1006
gateway connect failed: Error: gateway closed (1000)

fetch timeout reached; aborting operation (Telegram getMe)
elapsedMs=50171 timerDelayMs=40171

Impact and severity

Analysis Based on the logs, the root cause appears to be:

A single exec tool call is frozen

call_function_pt65b1sa6ex9_1 runs for 10–20+ minutes
Blocks the event loop completely
Prevents all other work from progressing

Event‑loop starvation cascades into system-wide failures WebSocket handshakes cannot complete

Timers fire late by 7–40 seconds

Telegram channel startup loops fail
UI and CLI disconnect
Gateway appears “dead” but is actually blocked

All symptoms are downstream of the blocked tool This is consistent with:

synchronous subprocess calls (execSync, spawnSync)
Python scripts that never exit
tools that produce unbounded stdout/stderr
synchronous filesystem operations on large files

Impact This issue makes the gateway effectively unusable:

UI cannot stay connected
CLI cannot connect
Agents cannot run
Channels cannot start
Tool calls never complete

Additional information

Could the maintainers please investigate:

Whether the gateway should isolate tool execution to prevent event‑loop starvation Whether exec tools should be sandboxed or forced async Whether the Telegram channel retry loop should yield to the event loop Whether watchdog logic should terminate long‑running tool calls I can provide full logs or reproduce the issue again if needed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Gateway should accept WebSocket connections reliably Tool calls should not block the event loop Telegram channel startup should not freeze the runtime UI and CLI should remain connected

#integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Gateway repeatedly closes connections (1000/1005/1006) due to event‑loop starvation caused by stuck tool call [2 pull requests, 8 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #78479: fix(gateway): close stale WebSocket connections on ping/pong timeout (#78402)

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

Environment

Steps

Expected

Actual (after fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

PR #78645: fix(agents): bound live exec output events

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING