openclaw - 💡(How to fix) Fix Gateway should self-SIGTERM after K=3 failed Slack reconnects (silent 15h outage in v2026.5.7) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#81491Fetched 2026-05-14 03:31:32
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1

channels.slack provider can die inside a live gateway with no recovery: in-process health-monitor logs restarting (reason: disconnected) but the restart silently fails and never escalates to process-level restart. HTTP /health continues returning 200, so external watchdogs that probe HTTP see nothing wrong.

Caused a 15h 30m silent Slack outage for me on 2026-05-12.

Error Message

↑ 15h 29m of silence — no further slack lines, no error escalation

Root Cause

External HTTP watchdogs cannot detect this — process is healthy by every observable signal except channel-provider state. Operators must either:

  1. Build their own Slack-aware probe (what I did as a workaround — see below)
  2. Stare at logs and hope they notice the gap
  3. Accept silent outages

Fix Action

Fix / Workaround

  1. Build their own Slack-aware probe (what I did as a workaround — see below)
  2. Stare at logs and hope they notice the gap
  3. Accept silent outages

My workaround

Code Example

[slack] socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
   [health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)

---

2026-05-12 19:58:11  slack socket mode connected
2026-05-12 20:37:29  socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
2026-05-12 20:38:10  [slack:default] health-monitor: restarting (reason: disconnected)
                     ↑ 15h 29m of silence — no further slack lines, no error escalation
2026-05-13 12:07:48  manual `openclaw gateway stop`
2026-05-13 12:07:56  gateway ready
2026-05-13 12:07:57  slack socket mode connected   ← recovered in 1s once process restarted
RAW_BUFFERClick to expand / collapse

Summary

channels.slack provider can die inside a live gateway with no recovery: in-process health-monitor logs restarting (reason: disconnected) but the restart silently fails and never escalates to process-level restart. HTTP /health continues returning 200, so external watchdogs that probe HTTP see nothing wrong.

Caused a 15h 30m silent Slack outage for me on 2026-05-12.

Version

[email protected] (homebrew npm install on macOS Darwin 25.3.0)

Repro

  1. Start gateway, observe slack socket mode connected.
  2. Drop the WebSocket inside the running gateway process (natural network event, or kill the underlying socket).
  3. Observe:
    [slack] socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
    [health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)
  4. Observe no subsequent slack socket mode connected line. Ever.
  5. HTTP /health returns 200 throughout — process is alive, provider is dead.

My actual outage (timestamps from /tmp/openclaw/openclaw-2026-05-12.log)

2026-05-12 19:58:11  slack socket mode connected
2026-05-12 20:37:29  socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
2026-05-12 20:38:10  [slack:default] health-monitor: restarting (reason: disconnected)
                     ↑ 15h 29m of silence — no further slack lines, no error escalation
2026-05-13 12:07:48  manual `openclaw gateway stop`
2026-05-13 12:07:56  gateway ready
2026-05-13 12:07:57  slack socket mode connected   ← recovered in 1s once process restarted

The fact that a manual stop+start recovered in 1 second confirms the gateway-internal slack provider got stuck and only a fresh process recovers it.

Proposed fix

After K=3 (configurable, e.g. channels.slack.reconnect.processRestartAfter) in-process reconnect attempts fail within a window, the gateway should call process.kill(process.pid, 'SIGTERM') so launchd / systemd / watchdog respawns the whole process.

This is the only reliable recovery path observed; in-process provider restart appears unreliable under socket-state corruption.

Why this matters

External HTTP watchdogs cannot detect this — process is healthy by every observable signal except channel-provider state. Operators must either:

  1. Build their own Slack-aware probe (what I did as a workaround — see below)
  2. Stare at logs and hope they notice the gap
  3. Accept silent outages

My workaround

I added an external slack_socket_healthy() function to my watchdog that grep-parses the gateway log for last slack socket mode connected vs socket disconnected timestamps. Force-restart if disconnected > 300s. Code: https://github.com/Lakescape/DockBotclaw/pull/13

This shouldn't be necessary — fixing it upstream removes the need for external log parsing.

Acceptance

  • Config flag channels.slack.reconnect.processRestartAfter (default 3) controls escalation threshold
  • After K failed reconnects in W window, gateway SIGTERMs itself
  • Documented in CHANGELOG + README "monitoring" section
  • Reverse-compat: existing in-process reconnect behavior preserved up to threshold

Happy to test against a pre-release if useful. Logs from outage available on request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway should self-SIGTERM after K=3 failed Slack reconnects (silent 15h outage in v2026.5.7) [1 comments, 2 participants]