openclaw - ✅(Solved) Fix [Bug]: Discord stale-socket health-monitor recoveries are frequent and restart scope is unclear [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#54851Fetched 2026-04-08 01:35:18
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
renamed ×4cross-referenced ×3commented ×1

Root Cause

The problem is easy to underestimate because service usually self-recovers quickly, so Discord may appear responsive most of the time. But the logs show chronic health-monitor intervention over many days.

PR fix notes

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

A transient Discord WebSocket disconnection (e.g. close code 1005) or an intentional health-monitor restart can trigger an uncaught Max reconnect attempts (0) exception that crashes the entire gateway process. This kills all channels (Feishu, WhatsApp, Telegram, etc.), not just Discord.

This PR fixes the root cause of these gateway crashes by correcting the supervisor phase transition ordering during an abort.

Fixes #54931 Fixes #54894

Root Cause

The gateway crash happens due to an error routing mismatch during teardown:

  1. onAbort() sets maxAttempts to 0 and calls gateway.disconnect().
  2. disconnect() synchronously triggers @buape/carbon's handleClosehandleReconnectionAttempt, which emits a Max reconnect attempts (0) error on the gateway emitter.
  3. The gateway-supervisor is still in the active phase at this point, so it routes the error to the lifecycle handler instead of suppressing it.
  4. The error surfaces as an uncaught exception, causing the entire gateway process to exit.

The supervisor already has the correct teardown suppression logic (it logs and swallows late errors during the teardown phase). The bug is simply that onAbort() never transitions the supervisor to the teardown phase before disconnecting.

Changes

1. Fix teardown ordering in onAbort()

Call params.gatewaySupervisor.detachLifecycle() before gateway.disconnect(). This ensures the supervisor enters the teardown phase and correctly suppresses the synchronous error emitted during disconnect.

2. Add lifecycleStopping safety net

In the catch block, when lifecycleStopping is already true, we no longer re-throw errors. This acts as a defense-in-depth guard for any edge case where an error might still escape during an intentional shutdown.

Test Results

  • The extension-fast (extension-fast-discord, discord) CI check passes.
  • Verified locally that triggering an abort correctly suppresses the disconnect error and allows the gateway to restart gracefully without crashing the main process.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +100/-4)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +25/-0)

Code Example

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[discord] gateway: WebSocket connection closed with code 1005

---

2026-03-26T00:22:36.826+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T00:57:36.833+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T01:32:36.846+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T02:27:36.874+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T03:02:36.917+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T03:37:36.942+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T04:12:36.967+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T04:47:36.981+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T05:22:37.002+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T05:57:37.021+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T06:42:37.046+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T09:37:15.419+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T10:12:21.075+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T11:17:27.190+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T12:51:08.554+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)

---

[discord] gateway: WebSocket connection closed with code 1005
RAW_BUFFERClick to expand / collapse

Bug Description

Discord health-monitor frequently triggers restarting (reason: stale-socket) and occasionally coincides with actual BOOT/restart notifications, but this does not appear to be a simple v2026.3.24-only regression.

After deeper log review, the underlying Discord socket instability (stale-socket and websocket close code 1005) has existed since at least 2026-03-17. What is unclear from current logs is which restart actions are provider/channel-level recoveries versus full gateway restarts.

Why this issue is important

The problem is easy to underestimate because service usually self-recovers quickly, so Discord may appear responsive most of the time. But the logs show chronic health-monitor intervention over many days.

This creates three practical problems:

  1. Frequent Discord connection recovery actions indicate unhealthy long-lived socket stability.
  2. Some events now appear to coincide with BOOT.md restart notifications, suggesting at least some recovery paths escalate beyond simple socket reconnects.
  3. Current logs do not make it easy to distinguish:
    • provider-level reconnect/restart
    • channel-level restart
    • full gateway process restart

Environment

  • OpenClaw version currently observed: 2026.3.24 (cff6dc9)
  • Node: v25.8.1
  • OS: Darwin 25.3.0 (arm64)
  • Channels configured: Discord + Feishu
  • Gateway managed via LaunchAgent on macOS

Observed Symptoms

Long-running Discord instability

The following patterns appear repeatedly in logs over multiple days:

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[discord] gateway: WebSocket connection closed with code 1005

Important clarification

This is not just a one-day issue after upgrading to 2026.3.24.

Historical evidence shows repeated stale-socket and code 1005 events since 2026-03-17.

Today’s stronger symptom

On 2026-03-26, the user reports receiving multiple BOOT.md gateway-restart notifications on Feishu. That suggests at least some of these health-monitor recovery events are reaching the restart-notification path, not just silently reconnecting a Discord socket.

Log Evidence

Examples of repeated stale-socket recoveries

2026-03-26T00:22:36.826+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T00:57:36.833+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T01:32:36.846+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T02:27:36.874+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T03:02:36.917+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T03:37:36.942+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T04:12:36.967+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T04:47:36.981+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T05:22:37.002+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T05:57:37.021+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T06:42:37.046+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T09:37:15.419+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T10:12:21.075+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T11:17:27.190+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
2026-03-26T12:51:08.554+08:00 [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)

Matching websocket closures

[discord] gateway: WebSocket connection closed with code 1005

These also appear repeatedly across earlier days.

Historical scope

The same stale-socket / code-1005 pattern is visible back to 2026-03-17, so this should not be framed as a cleanly isolated regression introduced only in 2026.3.24.

Current Assessment

This looks like a chronic Discord socket health problem and/or overly aggressive stale-socket detection, with recovery behavior that is insufficiently observable.

The key unresolved question is:

When health-monitor logs restarting, what is actually being restarted?

The current behavior makes it difficult to tell whether a given recovery event is:

  • only reconnecting the Discord websocket
  • restarting the Discord provider/channel
  • or escalating into a full gateway restart

Because BOOT.md notifications were observed today, at least some events appear to reach a deeper restart path.

Expected Behavior

  1. Health-monitor should expose more precise diagnostics for Discord socket health.
  2. Logs should clearly distinguish:
    • websocket reconnect
    • provider/channel restart
    • full gateway restart
  3. stale-socket detection should avoid excessive false positives if the connection is still recoverable.
  4. Restart-notification paths should be traceable so operators can tell whether BOOT.md notifications correspond to actual process restarts.

Suggested Improvements

  • Include additional structured diagnostics when stale-socket is detected:
    • socket state
    • last inbound activity timestamp
    • ping/pong timing
    • reconnect attempt count
  • Log explicit recovery mode labels, e.g.:
    • discord-reconnect
    • provider-restart
    • gateway-restart
  • Clarify whether code 1005 should be treated as expected transient behavior or restart-worthy failure.

Why I’m filing this

Even though the bot usually remains responsive, the health-monitor is intervening frequently enough that operators cannot tell whether the system is healthy or repeatedly self-recovering from a latent Discord transport issue.

extent analysis

Fix Plan

To address the chronic Discord socket health problem and improve diagnostics, follow these steps:

  1. Enhance logging for stale-socket detection:
    • Include socket state, last inbound activity timestamp, ping/pong timing, and reconnect attempt count when logging stale-socket events.
    • Example log format: [discord] gateway: WebSocket connection closed with code 1005, socket state: <state>, last activity: <timestamp>, ping/pong: <timing>, reconnect attempts: <count>
  2. Implement explicit recovery mode labels:
    • Log discord-reconnect, provider-restart, or gateway-restart labels to clearly distinguish between recovery modes.
    • Example log format: [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket, mode: discord-reconnect)
  3. Improve restart-notification paths:
    • Include a unique identifier or correlation ID in BOOT.md notifications to trace the corresponding restart event.
    • Example log format: [health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket, mode: gateway-restart, correlation-id: <id>)
  4. Review and adjust stale-socket detection:
    • Investigate whether code 1005 should be treated as expected transient behavior or restart-worthy failure.
    • Consider implementing a retry mechanism with exponential backoff to reduce false positives.

Example code snippet (Node.js):

const logger = require('./logger');

// Enhance logging for stale-socket detection
function logStaleSocket(socketState, lastActivity, pingPongTiming, reconnectAttempts) {
  logger.info(`[discord] gateway: WebSocket connection closed with code 1005, socket state: ${socketState}, last activity: ${lastActivity}, ping/pong: ${pingPongTiming}, reconnect attempts: ${reconnectAttempts}`);
}

// Implement explicit recovery mode labels
function logRecoveryMode(mode) {
  logger.info(`[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket, mode: ${mode})`);
}

// Improve restart-notification paths
function logRestartNotification(correlationId) {
  logger.info(`[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket, mode: gateway-restart, correlation-id: ${correlationId})`);
}

Verification

To verify the fix, monitor the logs for the following:

  • Improved logging for stale-socket detection, including socket state, last inbound activity timestamp, ping/pong timing, and reconnect attempt count.
  • Explicit recovery mode labels (discord-reconnect, provider-restart, or gateway-restart) in the logs.
  • Unique identifiers or correlation IDs in BOOT.md notifications to trace the corresponding restart event.
  • Reduced frequency of stale-socket events and restart notifications.

Extra Tips

  • Regularly review logs

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING