openclaw - 💡(How to fix) Fix Reconnect supervisor hangs on zombie WSS — no event/log emitted, requires manual restart [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77249Fetched 2026-05-05 05:50:45
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
2
Author
Timeline (top)
mentioned ×2subscribed ×2commented ×1referenced ×1

The Slack socket-mode reconnect supervisor in provider-B16DhfKN.js waits exclusively on emitter events (disconnected, error, unable_to_socket_mode_start). When the underlying WebSocket enters a zombie state — TCP keepalive holding but ping/pong stalled mid-frame — none of those events fire, the supervisor blocks indefinitely on disconnectWaiter.promise, and no log line is emitted at all. The only recovery path is systemctl restart openclaw.

We hit this in production (2026-05-03 08:09:31Z–08:14:14Z, ~5min Slack unresponsiveness, openclaw 2026.5.2). Service was active (running) the entire time; journal was silent; bot did not reply to @-mentions until the systemd unit was restarted by an unrelated config reload.

Error Message

The Slack socket-mode reconnect supervisor in provider-B16DhfKN.js waits exclusively on emitter events (disconnected, error, unable_to_socket_mode_start). When the underlying WebSocket enters a zombie state — TCP keepalive holding but ping/pong stalled mid-frame — none of those events fire, the supervisor blocks indefinitely on disconnectWaiter.promise, and no log line is emitted at all. The only recovery path is systemctl restart openclaw.

Root Cause

(1) is preferable because it reuses the existing event stream and adds no new Slack API traffic.

Fix Action

Fix / Workaround

Workaround (current, defense-in-depth)

RAW_BUFFERClick to expand / collapse

Summary

The Slack socket-mode reconnect supervisor in provider-B16DhfKN.js waits exclusively on emitter events (disconnected, error, unable_to_socket_mode_start). When the underlying WebSocket enters a zombie state — TCP keepalive holding but ping/pong stalled mid-frame — none of those events fire, the supervisor blocks indefinitely on disconnectWaiter.promise, and no log line is emitted at all. The only recovery path is systemctl restart openclaw.

We hit this in production (2026-05-03 08:09:31Z–08:14:14Z, ~5min Slack unresponsiveness, openclaw 2026.5.2). Service was active (running) the entire time; journal was silent; bot did not reply to @-mentions until the systemd unit was restarted by an unrelated config reload.

Evidence

In dist/provider-B16DhfKN.js (2026.5.2):

  • Line 1769: OPENCLAW_SLACK_CLIENT_PING_TIMEOUT_MS = 15e3 (default 15s).
  • Line 1839: autoReconnectEnabled: false is hard-coded — Bolt's own auto-reconnect is intentionally disabled, leaving recovery entirely to openclaw's supervisor.
  • Lines 1718–1750: waitForSlackSocketDisconnect listens only to emitter events; no liveness probe, no timeout.
  • Lines 2984–3025: supervisor while loop awaits disconnectWaiter.promise with no upper bound.

If the underlying @slack/socket-mode client never emits any of those three events (which can happen on stalled keepalive WSS), the supervisor and the entire Slack channel are wedged.

Trigger amplification

socketMode.serverPingTimeout/clientPingTimeout raise the dead-window between a real ping miss and the eventual disconnected emission. Increasing them from the 15s default (we tried 30s) measurably worsened detection latency without addressing the underlying gap. We have since reverted that config — but the underlying race exists at any timeout value, since the failure mode requires no event at all, not just a delayed one.

Reproduction (observational, not deterministic)

We cannot reproduce on demand — the trigger appears to be an intermittent network-stack condition (NLB/NAT idle, TCP keepalive misalignment with the WS ping cadence). What we observed:

  1. openclaw running normally, Slack channel responsive.
  2. Last journal entry from openclaw.service at T0; service status active (running).
  3. From T0 onward: journal silent, bot ignores all messages, systemctl status healthy.
  4. External systemctl restart openclaw (or any restart) — recovers immediately.

Proposed fix

Either of these would close the gap:

  1. Message-flow watchdog: track the timestamp of the last received Slack event (any kind — message, ping ack, hello). If no event in N seconds (configurable, default ~120s), abort disconnectWaiter and reconnect.
  2. Periodic health probe: every 60–120s during steady-state, call auth.test (or any cheap web-API call) on a separate code path. On failure, abort and reconnect.

(1) is preferable because it reuses the existing event stream and adds no new Slack API traffic.

Workaround (current, defense-in-depth)

We added a third detection path to our external watchdog (stuck-session-watchdog.sh): if openclaw.service is active (running) but the journal has been silent for ≥5 minutes, issue systemctl restart openclaw. This recovers within one watchdog cycle (60s) but is brittle — it depends on every healthy openclaw run emitting at least one log line per 5 minutes.

Environment

  • openclaw 2026.5.2 (npm global install)
  • @slack/bolt + @slack/socket-mode (versions per 2026.5.2 lockfile)
  • Node.js on Linux, behind AWS NLB
  • Long-lived WSS to Slack

Related

  • 85a6a98 (your commit raising socketMode timeouts to 30s) improved happy-path latency but widened the dead-window for fault detection — orthogonal to this issue but worth flagging.

extent analysis

TL;DR

Implement a message-flow watchdog or periodic health probe to detect and recover from the Slack WebSocket zombie state.

Guidance

  • Introduce a message-flow watchdog to track the timestamp of the last received Slack event and abort disconnectWaiter if no event is received within a configurable time (e.g., 120s).
  • Alternatively, implement a periodic health probe (e.g., every 60-120s) using a cheap web-API call like auth.test to detect and recover from the zombie state.
  • Review the stuck-session-watchdog.sh script to ensure it correctly detects and restarts openclaw.service when the journal has been silent for an extended period.
  • Consider increasing the frequency of the external watchdog cycle to reduce recovery time.

Example

// Example message-flow watchdog implementation
let lastEventTimestamp = Date.now();
const watchdogTimeout = 120000; // 120s

// Update lastEventTimestamp on each received Slack event
slackEventHandler((event) => {
  lastEventTimestamp = Date.now();
});

// Periodically check for inactivity and abort disconnectWaiter if necessary
setInterval(() => {
  if (Date.now() - lastEventTimestamp > watchdogTimeout) {
    // Abort disconnectWaiter and reconnect
  }
}, 1000); // 1s interval

Notes

The proposed fix assumes that the Slack WebSocket zombie state is the primary cause of the issue. However, other factors like network-stack conditions or TCP keepalive misalignment may still contribute to the problem. The message-flow watchdog or periodic health probe should help mitigate these issues.

Recommendation

Apply the message-flow watchdog fix, as it reuses the existing event stream and adds no new Slack API traffic, making it a more preferable solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Reconnect supervisor hangs on zombie WSS — no event/log emitted, requires manual restart [1 comments, 2 participants]