openclaw - ✅(Solved) Fix BUG: Discord health-monitor triggers uncaught exception crash loop (v2026.3.24) [4 pull requests, 10 comments, 8 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#54931Fetched 2026-04-08 01:34:25
View on GitHub
Comments
10
Participants
8
Timeline
48
Reactions
4
Author
Timeline (top)
cross-referenced ×17commented ×10referenced ×7mentioned ×5

After upgrading from v2026.3.11 to v2026.3.24, the gateway crashes every ~35 minutes due to Discord's health-monitor detecting stale sockets and triggering a reconnection path that throws an uncaught exception. Zero crashes occurred across 5+ days on v2026.3.11. On v2026.3.24, 16 crashes occurred in a single day.

Error Message

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket) [openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)

Root cause in provider-CAlWEl41.js:

Line 6952 — onAbort sets: gateway.options.reconnect = { maxAttempts: 0 };

Lines 3316-3318 — Reconnection handler checks:

const { maxAttempts = 5 } = this.options.reconnect ?? {}; if (this.reconnectAttempts >= maxAttempts) { this.emitter.emit("error", new Error(Max reconnect attempts (${maxAttempts}) reached...));

Crash frequency data:

• Mar 17-24 (v2026.3.11): 0 crashes across 5+ days • Mar 25 (v2026.3.24): 16 crashes in one day, every ~35 min

Root Cause

Root cause in provider-CAlWEl41.js:

Fix Action

Fix / Workaround

Workaround: Disable Discord (channels.discord.enabled: false).

Note: Also observed a secondary issue — with Discord channel disabled but Discord plugin still enabled (plugins.entries.discord.enabled: true), message-action-discovery still tries to resolve the Discord token SecretRef, causing a separate crash ("Unhandled promise rejection: channels.discord.token: unresolved SecretRef"). Both the channel AND plugin must be disabled as workaround.

PR fix notes

PR #54973: fix(discord): suppress reconnect-exhausted crash during health-monitor restart

Description (problem / solution / changelog)

Summary

Fixes #54931 — Discord health-monitor triggers uncaught exception crash loop.

Root Cause

When the health-monitor detects a stale socket and restarts a Discord channel:

  1. stopChannel() aborts the controller
  2. onAbort() sets gateway.options.reconnect = { maxAttempts: 0 } and calls disconnect()
  3. Carbon emits "Max reconnect attempts" error asynchronously

Two race conditions caused this to crash:

  1. Error rethrown during intentional shutdown: startDiscordLifecycle catch block rethrew the reconnect-exhausted error even when lifecycleStopping=true (abort was intentional)
  2. Supervisor disposed too early: gatewaySupervisor.dispose() removed the error listener immediately in the finally block. Late async errors from Carbon had no listener and became uncaught EventEmitter errors crashing the process.

Fix

  1. provider.lifecycle.ts: Suppress reconnect-exhausted errors in the catch block when lifecycleStopping is true (intentional shutdown)
  2. provider.ts: Defer gatewaySupervisor.dispose() by 5 seconds so late errors are handled in the supervisor's teardown phase instead of crashing

Testing

The crash occurs ~35 minutes after startup when the health-monitor fires. Both changes are defensive guards that only activate during intentional shutdown (abort signal), so they cannot affect normal reconnection behavior.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +8/-1)
  • extensions/discord/src/monitor/provider.ts (modified, +8/-1)

PR #54974: fix(discord): prevent gateway crash during health-monitor restart

Description (problem / solution / changelog)

Summary

— Discord health-monitor triggers uncaught exception crash loop, bringing down the entire gateway every ~35 minutes after upgrading from v2026.3.11 to v2026.3.24.

Discord WebSocket reconnect failure crashes entire gateway

Root Cause

An uncaught exception occurred due to a timing issue during the channel teardown sequence:

  1. In the onAbort handler, setting gateway.options.reconnect = { maxAttempts: 0 } and calling gateway.disconnect() synchronously triggered a Max reconnect attempts (0) reached error from @buape/carbon.
  2. Because gatewaySupervisor.detachLifecycle() was previously only called in the finally block, the supervisor was still in the active phase when this synchronous error was emitted.
  3. The error was routed to the lifecycle handler, treated as a fatal reconnect-exhausted event, and eventually rejected the wait promise.
  4. The catch block in runDiscordGatewayLifecycle only swallowed disallowed-intents errors, so this reconnect error was re-thrown, causing an uncaught exception that crashed the entire gateway process.

Changes

extensions/discord/src/monitor/provider.lifecycle.ts

  • Transitioned the supervisor to the teardown phase by calling params.gatewaySupervisor.detachLifecycle() before disconnecting the gateway in the onAbort handler. This ensures the synchronous reconnect error is safely suppressed by the supervisor's existing logLateTeardownEvent mechanism.
  • Added a safety net in the catch block to swallow all errors when lifecycleStopping is true, preventing any residual reconnect errors from propagating as uncaught exceptions during an intentional shutdown.

Test Results

All existing Discord lifecycle and supervisor tests pass. The idempotency of detachLifecycle() ensures the repeated call in the finally block remains safe.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +15/-1)

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

A transient Discord WebSocket disconnection (e.g. close code 1005) or an intentional health-monitor restart can trigger an uncaught Max reconnect attempts (0) exception that crashes the entire gateway process. This kills all channels (Feishu, WhatsApp, Telegram, etc.), not just Discord.

This PR fixes the root cause of these gateway crashes by correcting the supervisor phase transition ordering during an abort.

Fixes #54931 Fixes #54894

Root Cause

The gateway crash happens due to an error routing mismatch during teardown:

  1. onAbort() sets maxAttempts to 0 and calls gateway.disconnect().
  2. disconnect() synchronously triggers @buape/carbon's handleClosehandleReconnectionAttempt, which emits a Max reconnect attempts (0) error on the gateway emitter.
  3. The gateway-supervisor is still in the active phase at this point, so it routes the error to the lifecycle handler instead of suppressing it.
  4. The error surfaces as an uncaught exception, causing the entire gateway process to exit.

The supervisor already has the correct teardown suppression logic (it logs and swallows late errors during the teardown phase). The bug is simply that onAbort() never transitions the supervisor to the teardown phase before disconnecting.

Changes

1. Fix teardown ordering in onAbort()

Call params.gatewaySupervisor.detachLifecycle() before gateway.disconnect(). This ensures the supervisor enters the teardown phase and correctly suppresses the synchronous error emitted during disconnect.

2. Add lifecycleStopping safety net

In the catch block, when lifecycleStopping is already true, we no longer re-throw errors. This acts as a defense-in-depth guard for any edge case where an error might still escape during an intentional shutdown.

Test Results

  • The extension-fast (extension-fast-discord, discord) CI check passes.
  • Verified locally that triggering an abort correctly suppresses the disconnect error and allows the gateway to restart gracefully without crashing the main process.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +100/-4)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +25/-0)

PR #202: 🦅 Scout: Critical Inherited Defect Report - 2026-03-24

Description (problem / solution / changelog)

🦅 Scout: Critical Inherited Defect Report - 2026-03-24

Scanned the upstream OpenClaw repository and identified 3 defect patterns representing critical/high-impact regressions introduced in the recent v2026.3.24 release that are present in our local codebase:

Upstream Issue #54931: BUG: Discord health-monitor triggers uncaught exception crash loop

  • Location in our code: src/discord/monitor/provider.lifecycle.ts
  • Observed Behavior: The Discord health-monitor handles aborts by setting gateway.options.reconnect = { maxAttempts: 0 } and calling gateway.disconnect(). When the WebSocket close handler immediately fires, it checks reconnectAttempts(0) >= maxAttempts(0), evaluating to true, which emits a new Error that goes uncaught, crashing the entire Node.js process and disrupting all subagent sessions.
  • Expected Behavior: Health-monitor should gracefully restart the Discord channel without crashing the gateway process. The forced disconnection path shouldn't invoke the unexpected connection-retry limits error logic, or if it does, the error should be correctly caught instead of bringing down the gateway process.
  • Impact Severity: High — Gateway crashes periodically (every ~35 mins) if Discord goes stale. All running subagent sessions are disrupted or killed.

Upstream Issue #54936: BUG: Subagent runTimeoutSeconds default fallback resolves to infinite timeout instead of configured default

  • Location in our code: src/agents/subagent-registry.ts
  • Observed Behavior: When a subagent is spawned via sessions_spawn without explicitly passing a runTimeoutSeconds parameter, the parameter initializer mistakenly falls back to 0. The downstream timeout resolver interprets 0 as an explicit "disable timeout" instruction, causing the subagent to run indefinitely rather than falling back to the configured default (agents.defaults.subagents.runTimeoutSeconds).
  • Expected Behavior: Subagents should be killed after the configured default runtime timeout when no explicit timeout is passed during the spawn call.
  • Impact Severity: Medium — Subagents spawned without explicit timeouts will run indefinitely, potentially creating zombie processes that stall and exhaust system resources.

Upstream Issue #54975: BUG: All tools with required parameters receive empty {} arguments - Gateway parameter parsing failure

  • Location in our code: Provider integrations (Anthropic/OpenAI compat)
  • Observed Behavior: Third-party providers fail to correctly serialize/send tool parameters, causing validation failures where the agent's intent is lost. Tools receive empty {} arguments instead of the correctly passed parameters.
  • Expected Behavior: Tools should receive the correctly passed parameters and execute successfully.
  • Impact Severity: High — Complete automation blockage when using specific models.

The .jules/scout.md journal has been updated with these patterns for future tracking.


PR created automatically by Jules for task 2111446781902950035 started by @MillionthOdin16

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

  • Documentation
    • Added documentation for known defect patterns and tracked issues.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • .jules/scout.md (added, +14/-0)
  • report.txt (added, +13/-0)

Code Example

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)

Root cause in provider-CAlWEl41.js:

Line 6952 — onAbort sets: gateway.options.reconnect = { maxAttempts: 0 };

Lines 3316-3318Reconnection handler checks:

const { maxAttempts = 5 } = this.options.reconnect ?? {};
if (this.reconnectAttempts >= maxAttempts) {
    this.emitter.emit("error", new Error(`Max reconnect attempts (${maxAttempts}) reached...`));

Crash frequency data:

Mar 17-24 (v2026.3.11): 0 crashes across 5+ days
Mar 25 (v2026.3.24): 16 crashes in one day, every ~35 min
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Summary

After upgrading from v2026.3.11 to v2026.3.24, the gateway crashes every ~35 minutes due to Discord's health-monitor detecting stale sockets and triggering a reconnection path that throws an uncaught exception. Zero crashes occurred across 5+ days on v2026.3.11. On v2026.3.24, 16 crashes occurred in a single day.

Steps to reproduce

  1. Install v2026.3.24 with Discord channel enabled (single guild, allowlist-only)
  2. Gateway runs normally for ~30-35 minutes
  3. Discord health-monitor detects a stale WebSocket (no events within staleSocketMinutes, default 30)
  4. Health-monitor calls stopChannel() → triggers onAbort()
  5. onAbort() sets gateway.options.reconnect = { maxAttempts: 0 } then calls gateway.disconnect()
  6. WebSocket closes with code 1005 ("No Status Received")
  7. handleClose(1005) → handleReconnectionAttempt() → checks reconnectAttempts(0) >= maxAttempts(0) → true
  8. Emits new Error("Max reconnect attempts (0) reached after code 1005")
  9. Error is uncaught → entire Node.js process crashes
  10. systemd restarts → cycle repeats every ~35 minutes

Expected behavior

Health-monitor should gracefully restart the Discord channel without crashing the gateway process.

Actual behavior

The onAbort handler sets maxAttempts: 0 before disconnecting. The WebSocket close handler then fires and immediately triggers the max-attempts error path (0 >= 0 is true), emitting an uncaught exception that crashes the entire Node.js process.

OpenClaw version

2026.3.24 (upgraded from 2026.3.11)

Operating system

Ubuntu 24.04 LTS (Linux 6.18.7 x64)

Install method

npm global

Model

anthropic/claude-opus-4-6 / anthropic/claude-sonnet-4-6

Provider / routing chain

openclaw -> anthropic (direct)

Additional provider/model setup details

Bug is in Discord WebSocket lifecycle management, not model-specific.

Logs, screenshots, and evidence

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)

Root cause in provider-CAlWEl41.js:

Line 6952 — onAbort sets: gateway.options.reconnect = { maxAttempts: 0 };

Lines 3316-3318 — Reconnection handler checks:

const { maxAttempts = 5 } = this.options.reconnect ?? {};
if (this.reconnectAttempts >= maxAttempts) {
    this.emitter.emit("error", new Error(`Max reconnect attempts (${maxAttempts}) reached...`));

Crash frequency data:

• Mar 17-24 (v2026.3.11): 0 crashes across 5+ days
• Mar 25 (v2026.3.24): 16 crashes in one day, every ~35 min

Impact and severity

High — Gateway crashes every ~35 minutes. All running subagent sessions are disrupted or killed. Subagent completion announce-back fails after restart ("Outbound not configured for channel: telegram"). Long-running subagent tasks (30-75 min) have near-zero chance of completing.

Additional information

Suggested fixes:

  1. (Preferred) Set a flag to suppress the close handler rather than manipulating maxAttempts — lifecycleStopping already exists on line 6944, add a check in handleClose
  2. Set maxAttempts to a sentinel value that handleReconnectionAttempt treats as "intentional shutdown, don't emit error"
  3. Catch the error in the health-monitor's restart flow so it doesn't propagate as uncaught

Workaround: Disable Discord (channels.discord.enabled: false).

Note: Also observed a secondary issue — with Discord channel disabled but Discord plugin still enabled (plugins.entries.discord.enabled: true), message-action-discovery still tries to resolve the Discord token SecretRef, causing a separate crash ("Unhandled promise rejection: channels.discord.token: unresolved SecretRef"). Both the channel AND plugin must be disabled as workaround.

extent analysis

Fix Plan

To resolve the issue, we will implement the preferred suggested fix: set a flag to suppress the close handler rather than manipulating maxAttempts.

  1. Add a flag: Introduce a new flag, isLifecycleStopping, to track whether the lifecycle is stopping.
  2. Modify onAbort: Set isLifecycleStopping to true in the onAbort handler.
  3. Check flag in handleClose: Add a check for isLifecycleStopping in the handleClose method. If true, suppress the close handler.

Example code:

// Add flag
this.isLifecycleStopping = false;

// Modify onAbort
onAbort() {
  this.isLifecycleStopping = true;
  // ...
}

// Check flag in handleClose
handleClose(code) {
  if (this.isLifecycleStopping) {
    this.isLifecycleStopping = false;
    return;
  }
  // ...
}

Verification

To verify the fix, monitor the gateway for crashes after implementing the changes. The gateway should no longer crash every ~35 minutes.

Extra Tips

  • Ensure that the isLifecycleStopping flag is properly reset after the lifecycle stopping process is complete.
  • Consider adding additional logging to track the state of the isLifecycleStopping flag for debugging purposes.
  • Review the code for any other potential issues that may be related to the lifecycle management of the Discord WebSocket connection.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Health-monitor should gracefully restart the Discord channel without crashing the gateway process.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING