openclaw - ✅(Solved) Fix [Bug] SafeGatewayPlugin: Unhandled exception on WebSocket close code 1005 crashes gateway (triggered by health-monitor stale-socket restart) [3 pull requests, 2 comments, 3 participants]

openclaw2026-03-27 04:11:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#55554•Fetched 2026-04-08 01:38:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

cross-referenced ×4commented ×2closed ×1locked ×1

When the health-monitor restarts a stale socket (e.g. [discord:default] health-monitor: restarting (reason: stale-socket)), the resulting WebSocket close triggers an unhandled exception in \SafeGatewayPlugin\ that kills the entire gateway process.

Error Message

\
2026-03-27T03:51:01.274Z info gateway/health-monitor [discord:default] health-monitor: restarting (reason: stale-socket) 2026-03-27T03:51:01.377Z error [openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9) at WebSocket.emit (node:events:508:28) at WebSocket.emitClose (ws/lib/websocket.js:273:10) at TLSSocket.socketOnClose (ws/lib/websocket.js:1346:15) \\

Root Cause

Fix Action

Workaround

A 5-minute watchdog task (Windows Scheduled Task checking port 18789) limits downtime to under a minute per crash, but does not prevent the crash itself.

PR fix notes

PR #55558: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Repository: openclaw/openclaw
Author: ShionEria
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/55558

Description (problem / solution / changelog)

Summary

treat reconnect-exhausted as expected whenever Discord gateway reconnect has already been forced into maxAttempts=0 intentional shutdown/restart mode
keep the lifecycle from surfacing the expected code 1005 reconnect exhaustion as an uncaught fatal error during stale-socket restarts
update the lifecycle regression test to cover the queued-event ordering that previously crashed the gateway

Testing

pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts (fails in this environment because node_modules is missing)
pnpm install (fails in this environment due npm DNS/network error: EAI_AGAIN to registry.npmjs.org)

Closes #55554

Changed files

extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +8/-8)
extensions/discord/src/monitor/provider.lifecycle.ts (modified, +5/-6)

PR #55699: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Repository: openclaw/openclaw
Author: rzblues
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/55699

Description (problem / solution / changelog)

Summary

Problem: When the health-monitor triggers a stale-socket restart, it sets gateway.options.reconnect.maxAttempts = 0 then calls disconnect(). Carbon immediately emits "Max reconnect attempts (0) reached after code 1005", which lands in the supervisor's pending queue while the lifecycle is still awaiting an early step (e.g. execApprovalsHandler.start()).
Why it matters: The previous guard checked lifecycleStopping, which is only set in the finally block — after drainPendingGatewayErrors() has already run. The queued event fell through to throw event.err, crashing the entire gateway process (~54s downtime per occurrence, recurring every ~35 minutes).
What changed: Replaced the timing-sensitive lifecycleStopping check with isExpectedReconnectExhausted, which checks gateway.options.reconnect.maxAttempts === 0. This value is set synchronously before disconnect(), so it is reliably true at drain time regardless of async ordering. Added a regression test confirming that reconnect-exhausted with maxAttempts !== 0 (genuine Carbon retry exhaustion) still propagates as a fatal error.
What did NOT change: No change to reconnect logic, health-monitor behavior, or any other lifecycle path.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Gateway / orchestration
Integrations

Linked Issue/PR

Closes #55554
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: lifecycleStopping is set in the finally block of runDiscordGatewayLifecycle, but drainPendingGatewayErrors() is called inside the try block — before finally ever runs. On a stale-socket restart, the reconnect-exhausted event is queued synchronously (Carbon emits in the same tick as disconnect()), so it is always drained before lifecycleStopping becomes true.
Missing detection / guardrail: No test covered the queued-before-drain timing window.
Prior context: The lifecycleStopping guard was added to handle intentional shutdowns, but did not account for the async ordering between drainPendingGatewayErrors and the finally block.
Why this regressed now: The health-monitor stale-socket path sets maxAttempts=0 and disconnects — Carbon's error fires faster than the lifecycle flag advances.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test: extensions/discord/src/monitor/provider.lifecycle.test.ts
Scenario: queue a reconnect-exhausted event before abort fires; verify lifecycle resolves cleanly (not rejects) when maxAttempts=0; verify it still rejects when maxAttempts is not 0.
Why this is the smallest reliable guardrail: directly exercises the drain-before-finally timing window without requiring real WebSocket infrastructure.

User-visible / Behavior Changes

Gateway no longer crashes on health-monitor stale-socket restarts. The restart proceeds silently (log at info level instead of an uncaught exception).

Diagram (if applicable)

Before:
[health-monitor abort] -> [maxAttempts=0, disconnect()] -> [Carbon emits error]
  -> [pending queue] -> [drainPendingGatewayErrors]
  -> lifecycleStopping=false -> throw event.err -> CRASH

After:
[health-monitor abort] -> [maxAttempts=0, disconnect()] -> [Carbon emits error]
  -> [pending queue] -> [drainPendingGatewayErrors]
  -> maxAttempts===0 -> log + return "stop" -> lifecycle resolves cleanly

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS (darwin arm64) / Windows 10 x64 (as reported in issue)
Runtime: Node 24.14.1
Integration/channel: Discord

Steps

Gateway running with Discord channel enabled
Health-monitor detects stale socket and triggers restart
WebSocket closes with code 1005

Expected

Gateway handles reconnect-exhausted gracefully, logs at info level, continues running.

Actual (before fix)

Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 — gateway process crashes.

Evidence

Failing test before + passing after: repurposed existing test "does not suppress reconnect-exhausted already queued before shutdown" → now asserts resolves.toBeUndefined() and confirms the info log. Added "does not suppress reconnect-exhausted when maxAttempts is not 0" as the negative regression guard. All 22 tests pass.

Human Verification (required)

Verified scenarios: lifecycle test suite (22/22), full discord extension test suite — all pass.
Edge cases checked: maxAttempts undefined (default gateway options) still propagates as fatal; maxAttempts=0 with code 1005 resolves cleanly.
What I did not verify: live Discord environment with actual stale-socket trigger (no Discord bot token available).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: isExpectedReconnectExhausted might suppress a genuine exhaustion if something else sets maxAttempts=0.
- Mitigation: maxAttempts=0 is only set in one place (provider.lifecycle.reconnect.ts:404, the onAbort handler). The negative regression test guards against over-suppression.

🤖 Generated with Claude Code

Changed files

extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +24/-9)
extensions/discord/src/monitor/provider.lifecycle.ts (modified, +5/-6)

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Repository: openclaw/openclaw
Author: openperf
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/55000

Description (problem / solution / changelog)

Summary

A transient Discord WebSocket disconnection (e.g. close code 1005) or an intentional health-monitor restart can trigger an uncaught Max reconnect attempts (0) exception that crashes the entire gateway process. This kills all channels (Feishu, WhatsApp, Telegram, etc.), not just Discord.

This PR fixes the root cause of these gateway crashes by correcting the supervisor phase transition ordering during an abort.

Fixes #54931 Fixes #54894

Root Cause

The gateway crash happens due to an error routing mismatch during teardown:

onAbort() sets maxAttempts to 0 and calls gateway.disconnect().
disconnect() synchronously triggers @buape/carbon's handleClose → handleReconnectionAttempt, which emits a Max reconnect attempts (0) error on the gateway emitter.
The gateway-supervisor is still in the active phase at this point, so it routes the error to the lifecycle handler instead of suppressing it.
The error surfaces as an uncaught exception, causing the entire gateway process to exit.

The supervisor already has the correct teardown suppression logic (it logs and swallows late errors during the teardown phase). The bug is simply that onAbort() never transitions the supervisor to the teardown phase before disconnecting.

Changes

1. Fix teardown ordering in onAbort()

Call params.gatewaySupervisor.detachLifecycle() before gateway.disconnect(). This ensures the supervisor enters the teardown phase and correctly suppresses the synchronous error emitted during disconnect.

2. Add lifecycleStopping safety net

In the catch block, when lifecycleStopping is already true, we no longer re-throw errors. This acts as a defense-in-depth guard for any edge case where an error might still escape during an intentional shutdown.

Test Results

The extension-fast (extension-fast-discord, discord) CI check passes.
Verified locally that triggering an abort correctly suppresses the disconnect error and allows the gateway to restart gracefully without crashing the main process.

Changed files

extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +100/-4)
extensions/discord/src/monitor/provider.lifecycle.ts (modified, +25/-0)

RAW_BUFFERClick to expand / collapse

Summary

Version

OpenClaw \2026.3.24\ · Windows 10.0.19045 (x64) · Node 24.14.1

Steps to Reproduce

Gateway running with Discord channel enabled
Health-monitor detects a stale socket and triggers a restart
WebSocket closes with code 1005
Gateway process crashes with uncaught exception

Error

Impact

Full gateway crash (~54s downtime before watchdog recovery)
All channels (Telegram, Discord) go offline until watchdog restarts the process
Recurring — happens every ~35 minutes since updating to 2026.3.24

Expected Behavior

Health-monitor stale-socket restarts should be handled gracefully without crashing the gateway process. \SafeGatewayPlugin\ should catch the reconnection failure and recover internally rather than throwing an uncaught exception.

Workaround

A 5-minute watchdog task (Windows Scheduled Task checking port 18789) limits downtime to under a minute per crash, but does not prevent the crash itself.

extent analysis

Fix Plan

To fix the issue, we need to modify the SafeGatewayPlugin to handle the WebSocket close event with code 1005 and prevent the uncaught exception. Here are the steps:

Modify the handleClose method in SafeGatewayPlugin to catch the error and recover internally:

handleClose(code, reason) {
  if (code === 1005) {
    // Handle the stale socket restart and recover internally
    this.reconnectAttempts = 0;
    this.reconnect();
  } else {
    // Handle other close codes as before
  }
}

Increase the reconnect attempts limit to prevent the "Max reconnect attempts reached" error:

this.reconnectAttemptsLimit = 5; // Increase the limit to 5 attempts

Add a retry mechanism to the handleReconnectionAttempt method to handle temporary connection failures:

handleReconnectionAttempt() {
  try {
    // Reconnection attempt code here
  } catch (error) {
    if (this.reconnectAttempts < this.reconnectAttemptsLimit) {
      this.reconnectAttempts++;
      setTimeout(() => this.handleReconnectionAttempt(), 1000); // Retry after 1 second
    } else {
      // Handle max reconnect attempts reached error
    }
  }
}

Verification

To verify the fix, restart the gateway process and simulate a stale socket restart by closing the WebSocket connection with code 1005. The gateway process should recover internally without crashing.

Extra Tips

Monitor the gateway process logs to ensure the fix is working as expected.
Consider adding additional logging and metrics to track reconnect attempts and failures.
Review the SafeGatewayPlugin code to ensure it is handling other potential errors and edge cases correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#docker error #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] SafeGatewayPlugin: Unhandled exception on WebSocket close code 1005 crashes gateway (triggered by health-monitor stale-socket restart) [3 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #55558: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Description (problem / solution / changelog)

Summary

Testing

Changed files

PR #55699: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Test Results

Changed files

Summary

Version

Steps to Reproduce

Error

Impact

Expected Behavior

Workaround

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING