openclaw - ✅(Solved) Fix [Bug] SafeGatewayPlugin: Unhandled exception on WebSocket close code 1005 crashes gateway (triggered by health-monitor stale-socket restart) [3 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#55554Fetched 2026-04-08 01:38:08
View on GitHub
Comments
2
Participants
3
Timeline
9
Reactions
0
Timeline (top)
cross-referenced ×4commented ×2closed ×1locked ×1

When the health-monitor restarts a stale socket (e.g. [discord:default] health-monitor: restarting (reason: stale-socket)), the resulting WebSocket close triggers an unhandled exception in \SafeGatewayPlugin\ that kills the entire gateway process.

Error Message

\
2026-03-27T03:51:01.274Z info gateway/health-monitor [discord:default] health-monitor: restarting (reason: stale-socket) 2026-03-27T03:51:01.377Z error [openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9) at WebSocket.emit (node:events:508:28) at WebSocket.emitClose (ws/lib/websocket.js:273:10) at TLSSocket.socketOnClose (ws/lib/websocket.js:1346:15) \\

Root Cause

When the health-monitor restarts a stale socket (e.g. [discord:default] health-monitor: restarting (reason: stale-socket)), the resulting WebSocket close triggers an unhandled exception in \SafeGatewayPlugin\ that kills the entire gateway process.

Fix Action

Workaround

A 5-minute watchdog task (Windows Scheduled Task checking port 18789) limits downtime to under a minute per crash, but does not prevent the crash itself.

PR fix notes

PR #55558: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Description (problem / solution / changelog)

Summary

  • treat reconnect-exhausted as expected whenever Discord gateway reconnect has already been forced into maxAttempts=0 intentional shutdown/restart mode
  • keep the lifecycle from surfacing the expected code 1005 reconnect exhaustion as an uncaught fatal error during stale-socket restarts
  • update the lifecycle regression test to cover the queued-event ordering that previously crashed the gateway

Testing

  • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts (fails in this environment because node_modules is missing)
  • pnpm install (fails in this environment due npm DNS/network error: EAI_AGAIN to registry.npmjs.org)

Closes #55554

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +8/-8)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +5/-6)

PR #55699: fix(discord): suppress expected reconnect-exhausted on stale-socket restart

Description (problem / solution / changelog)

Summary

  • Problem: When the health-monitor triggers a stale-socket restart, it sets gateway.options.reconnect.maxAttempts = 0 then calls disconnect(). Carbon immediately emits "Max reconnect attempts (0) reached after code 1005", which lands in the supervisor's pending queue while the lifecycle is still awaiting an early step (e.g. execApprovalsHandler.start()).
  • Why it matters: The previous guard checked lifecycleStopping, which is only set in the finally block — after drainPendingGatewayErrors() has already run. The queued event fell through to throw event.err, crashing the entire gateway process (~54s downtime per occurrence, recurring every ~35 minutes).
  • What changed: Replaced the timing-sensitive lifecycleStopping check with isExpectedReconnectExhausted, which checks gateway.options.reconnect.maxAttempts === 0. This value is set synchronously before disconnect(), so it is reliably true at drain time regardless of async ordering. Added a regression test confirming that reconnect-exhausted with maxAttempts !== 0 (genuine Carbon retry exhaustion) still propagates as a fatal error.
  • What did NOT change: No change to reconnect logic, health-monitor behavior, or any other lifecycle path.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration
  • Integrations

Linked Issue/PR

  • Closes #55554
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: lifecycleStopping is set in the finally block of runDiscordGatewayLifecycle, but drainPendingGatewayErrors() is called inside the try block — before finally ever runs. On a stale-socket restart, the reconnect-exhausted event is queued synchronously (Carbon emits in the same tick as disconnect()), so it is always drained before lifecycleStopping becomes true.
  • Missing detection / guardrail: No test covered the queued-before-drain timing window.
  • Prior context: The lifecycleStopping guard was added to handle intentional shutdowns, but did not account for the async ordering between drainPendingGatewayErrors and the finally block.
  • Why this regressed now: The health-monitor stale-socket path sets maxAttempts=0 and disconnects — Carbon's error fires faster than the lifecycle flag advances.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test: extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Scenario: queue a reconnect-exhausted event before abort fires; verify lifecycle resolves cleanly (not rejects) when maxAttempts=0; verify it still rejects when maxAttempts is not 0.
  • Why this is the smallest reliable guardrail: directly exercises the drain-before-finally timing window without requiring real WebSocket infrastructure.

User-visible / Behavior Changes

Gateway no longer crashes on health-monitor stale-socket restarts. The restart proceeds silently (log at info level instead of an uncaught exception).

Diagram (if applicable)

Before:
[health-monitor abort] -> [maxAttempts=0, disconnect()] -> [Carbon emits error]
  -> [pending queue] -> [drainPendingGatewayErrors]
  -> lifecycleStopping=false -> throw event.err -> CRASH

After:
[health-monitor abort] -> [maxAttempts=0, disconnect()] -> [Carbon emits error]
  -> [pending queue] -> [drainPendingGatewayErrors]
  -> maxAttempts===0 -> log + return "stop" -> lifecycle resolves cleanly

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (darwin arm64) / Windows 10 x64 (as reported in issue)
  • Runtime: Node 24.14.1
  • Integration/channel: Discord

Steps

  1. Gateway running with Discord channel enabled
  2. Health-monitor detects stale socket and triggers restart
  3. WebSocket closes with code 1005

Expected

Gateway handles reconnect-exhausted gracefully, logs at info level, continues running.

Actual (before fix)

Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 — gateway process crashes.

Evidence

  • Failing test before + passing after: repurposed existing test "does not suppress reconnect-exhausted already queued before shutdown" → now asserts resolves.toBeUndefined() and confirms the info log. Added "does not suppress reconnect-exhausted when maxAttempts is not 0" as the negative regression guard. All 22 tests pass.

Human Verification (required)

  • Verified scenarios: lifecycle test suite (22/22), full discord extension test suite — all pass.
  • Edge cases checked: maxAttempts undefined (default gateway options) still propagates as fatal; maxAttempts=0 with code 1005 resolves cleanly.
  • What I did not verify: live Discord environment with actual stale-socket trigger (no Discord bot token available).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: isExpectedReconnectExhausted might suppress a genuine exhaustion if something else sets maxAttempts=0.
    • Mitigation: maxAttempts=0 is only set in one place (provider.lifecycle.reconnect.ts:404, the onAbort handler). The negative regression test guards against over-suppression.

🤖 Generated with Claude Code

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +24/-9)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +5/-6)

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

A transient Discord WebSocket disconnection (e.g. close code 1005) or an intentional health-monitor restart can trigger an uncaught Max reconnect attempts (0) exception that crashes the entire gateway process. This kills all channels (Feishu, WhatsApp, Telegram, etc.), not just Discord.

This PR fixes the root cause of these gateway crashes by correcting the supervisor phase transition ordering during an abort.

Fixes #54931 Fixes #54894

Root Cause

The gateway crash happens due to an error routing mismatch during teardown:

  1. onAbort() sets maxAttempts to 0 and calls gateway.disconnect().
  2. disconnect() synchronously triggers @buape/carbon's handleClosehandleReconnectionAttempt, which emits a Max reconnect attempts (0) error on the gateway emitter.
  3. The gateway-supervisor is still in the active phase at this point, so it routes the error to the lifecycle handler instead of suppressing it.
  4. The error surfaces as an uncaught exception, causing the entire gateway process to exit.

The supervisor already has the correct teardown suppression logic (it logs and swallows late errors during the teardown phase). The bug is simply that onAbort() never transitions the supervisor to the teardown phase before disconnecting.

Changes

1. Fix teardown ordering in onAbort()

Call params.gatewaySupervisor.detachLifecycle() before gateway.disconnect(). This ensures the supervisor enters the teardown phase and correctly suppresses the synchronous error emitted during disconnect.

2. Add lifecycleStopping safety net

In the catch block, when lifecycleStopping is already true, we no longer re-throw errors. This acts as a defense-in-depth guard for any edge case where an error might still escape during an intentional shutdown.

Test Results

  • The extension-fast (extension-fast-discord, discord) CI check passes.
  • Verified locally that triggering an abort correctly suppresses the disconnect error and allows the gateway to restart gracefully without crashing the main process.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +100/-4)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +25/-0)
RAW_BUFFERClick to expand / collapse

Summary

When the health-monitor restarts a stale socket (e.g. [discord:default] health-monitor: restarting (reason: stale-socket)), the resulting WebSocket close triggers an unhandled exception in \SafeGatewayPlugin\ that kills the entire gateway process.

Version

OpenClaw \2026.3.24\ · Windows 10.0.19045 (x64) · Node 24.14.1

Steps to Reproduce

  1. Gateway running with Discord channel enabled
  2. Health-monitor detects a stale socket and triggers a restart
  3. WebSocket closes with code 1005
  4. Gateway process crashes with uncaught exception

Error

\
2026-03-27T03:51:01.274Z info gateway/health-monitor [discord:default] health-monitor: restarting (reason: stale-socket) 2026-03-27T03:51:01.377Z error [openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9) at WebSocket.emit (node:events:508:28) at WebSocket.emitClose (ws/lib/websocket.js:273:10) at TLSSocket.socketOnClose (ws/lib/websocket.js:1346:15) \\

Impact

  • Full gateway crash (~54s downtime before watchdog recovery)
  • All channels (Telegram, Discord) go offline until watchdog restarts the process
  • Recurring — happens every ~35 minutes since updating to 2026.3.24

Expected Behavior

Health-monitor stale-socket restarts should be handled gracefully without crashing the gateway process. \SafeGatewayPlugin\ should catch the reconnection failure and recover internally rather than throwing an uncaught exception.

Workaround

A 5-minute watchdog task (Windows Scheduled Task checking port 18789) limits downtime to under a minute per crash, but does not prevent the crash itself.

extent analysis

Fix Plan

To fix the issue, we need to modify the SafeGatewayPlugin to handle the WebSocket close event with code 1005 and prevent the uncaught exception. Here are the steps:

  • Modify the handleClose method in SafeGatewayPlugin to catch the error and recover internally:
handleClose(code, reason) {
  if (code === 1005) {
    // Handle the stale socket restart and recover internally
    this.reconnectAttempts = 0;
    this.reconnect();
  } else {
    // Handle other close codes as before
  }
}
  • Increase the reconnect attempts limit to prevent the "Max reconnect attempts reached" error:
this.reconnectAttemptsLimit = 5; // Increase the limit to 5 attempts
  • Add a retry mechanism to the handleReconnectionAttempt method to handle temporary connection failures:
handleReconnectionAttempt() {
  try {
    // Reconnection attempt code here
  } catch (error) {
    if (this.reconnectAttempts < this.reconnectAttemptsLimit) {
      this.reconnectAttempts++;
      setTimeout(() => this.handleReconnectionAttempt(), 1000); // Retry after 1 second
    } else {
      // Handle max reconnect attempts reached error
    }
  }
}

Verification

To verify the fix, restart the gateway process and simulate a stale socket restart by closing the WebSocket connection with code 1005. The gateway process should recover internally without crashing.

Extra Tips

  • Monitor the gateway process logs to ensure the fix is working as expected.
  • Consider adding additional logging and metrics to track reconnect attempts and failures.
  • Review the SafeGatewayPlugin code to ensure it is handling other potential errors and edge cases correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING