openclaw - 💡(How to fix) Fix Gateway should self-SIGTERM after K=3 failed Slack reconnects (silent 15h outage in v2026.5.7) [1 comments, 2 participants]

openclaw2026-05-13 17:45:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#81491•Fetched 2026-05-14 03:31:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Lakescape

Participants

clawsweeper[bot]

Lakescape

Timeline (top)

commented ×1cross-referenced ×1

channels.slack provider can die inside a live gateway with no recovery: in-process health-monitor logs restarting (reason: disconnected) but the restart silently fails and never escalates to process-level restart. HTTP /health continues returning 200, so external watchdogs that probe HTTP see nothing wrong.

Caused a 15h 30m silent Slack outage for me on 2026-05-12.

Error Message

↑ 15h 29m of silence — no further slack lines, no error escalation

Root Cause

External HTTP watchdogs cannot detect this — process is healthy by every observable signal except channel-provider state. Operators must either:

Build their own Slack-aware probe (what I did as a workaround — see below)
Stare at logs and hope they notice the gap
Accept silent outages

Fix Action

Fix / Workaround

Build their own Slack-aware probe (what I did as a workaround — see below)
Stare at logs and hope they notice the gap
Accept silent outages

My workaround

Code Example

[slack] socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
   [health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)

---

2026-05-12 19:58:11  slack socket mode connected
2026-05-12 20:37:29  socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
2026-05-12 20:38:10  [slack:default] health-monitor: restarting (reason: disconnected)
                     ↑ 15h 29m of silence — no further slack lines, no error escalation
2026-05-13 12:07:48  manual `openclaw gateway stop`
2026-05-13 12:07:56  gateway ready
2026-05-13 12:07:57  slack socket mode connected   ← recovered in 1s once process restarted

RAW_BUFFERClick to expand / collapse

Summary

Caused a 15h 30m silent Slack outage for me on 2026-05-12.

Version

[email protected] (homebrew npm install on macOS Darwin 25.3.0)

Repro

Start gateway, observe slack socket mode connected.
Drop the WebSocket inside the running gateway process (natural network event, or kill the underlying socket).

Observe:

[slack] socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
[health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)

Observe no subsequent slack socket mode connected line. Ever.
HTTP /health returns 200 throughout — process is alive, provider is dead.

My actual outage (timestamps from `/tmp/openclaw/openclaw-2026-05-12.log`)

2026-05-12 19:58:11  slack socket mode connected
2026-05-12 20:37:29  socket disconnected (disconnect); reconnecting in 2s (attempt 1/12)
2026-05-12 20:38:10  [slack:default] health-monitor: restarting (reason: disconnected)
                     ↑ 15h 29m of silence — no further slack lines, no error escalation
2026-05-13 12:07:48  manual `openclaw gateway stop`
2026-05-13 12:07:56  gateway ready
2026-05-13 12:07:57  slack socket mode connected   ← recovered in 1s once process restarted

The fact that a manual stop+start recovered in 1 second confirms the gateway-internal slack provider got stuck and only a fresh process recovers it.

Proposed fix

After K=3 (configurable, e.g. channels.slack.reconnect.processRestartAfter) in-process reconnect attempts fail within a window, the gateway should call process.kill(process.pid, 'SIGTERM') so launchd / systemd / watchdog respawns the whole process.

This is the only reliable recovery path observed; in-process provider restart appears unreliable under socket-state corruption.

Why this matters

External HTTP watchdogs cannot detect this — process is healthy by every observable signal except channel-provider state. Operators must either:

Build their own Slack-aware probe (what I did as a workaround — see below)
Stare at logs and hope they notice the gap
Accept silent outages

My workaround

I added an external slack_socket_healthy() function to my watchdog that grep-parses the gateway log for last slack socket mode connected vs socket disconnected timestamps. Force-restart if disconnected > 300s. Code: https://github.com/Lakescape/DockBotclaw/pull/13

This shouldn't be necessary — fixing it upstream removes the need for external log parsing.

Acceptance

Config flag channels.slack.reconnect.processRestartAfter (default 3) controls escalation threshold
After K failed reconnects in W window, gateway SIGTERMs itself
Documented in CHANGELOG + README "monitoring" section
Reverse-compat: existing in-process reconnect behavior preserved up to threshold

Happy to test against a pre-release if useful. Logs from outage available on request.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway should self-SIGTERM after K=3 failed Slack reconnects (silent 15h outage in v2026.5.7) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

My workaround

Code Example

Summary

Version

Repro

My actual outage (timestamps from `/tmp/openclaw/openclaw-2026-05-12.log`)

Proposed fix

Why this matters

My workaround

Acceptance

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Gateway should self-SIGTERM after K=3 failed Slack reconnects (silent 15h outage in v2026.5.7) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

My workaround

Code Example

Summary

Version

Repro

My actual outage (timestamps from /tmp/openclaw/openclaw-2026-05-12.log)

Proposed fix

Why this matters

My workaround

Acceptance

Still need to ship something?

RELATED_DISCOVERY

TRENDING

My actual outage (timestamps from `/tmp/openclaw/openclaw-2026-05-12.log`)