openclaw - ✅(Solved) Fix Slack DM: Embedded agent run not triggered when socket reconnects (stale-socket) [1 pull requests, 1 comments, 1 participants]

Dutch-Forward · 2026-03-31T19:27:30Z

[openclaw] PR 68253: fix slack : opt out of stale-socket health-monitor Socket Mode owns liveness - Repository: openclaw/openclaw - Author: mjamiv - State: ope… # PR #68253: fix(slack): opt out of stale-socket health-monitor (Socket Mode owns liveness) - Repository: openclaw/openclaw - Author: mjamiv - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/68253 ## Description (problem / solution / changelog) ## Summary Slack Socket Mode bots were being restarted by the gateway every ~35 minutes on idle — even when the WebSocket was fully healthy, Slack's server pings were flowing, and the bot was simply waiting for the next message. The fix opts the Slack channel out of the gateway's `stale-socket` heuristic, since `@slack/socket-mode` already runs its own liveness deadman. A latent bug in the status-adapter factory that silently dropped this opt-out field for any plugin is fixed at the same time, so the existing Telegram opt-out now works for the right reason rather than by coincidence. Fixes #61072, #64009, #58540. ## Problem `evaluateChannelHealth` in `src/gateway/channel-health-policy.ts` restarts any channel whose `lastEventAt` is older than `channelStaleEventThresholdMinutes` (default 30). In the Slack plugin, `lastEventAt` is bumped only on user-facing Slack events — `message`, `app_mention`, reactions, member/channel/pin events. Socket Mode envelopes like `hello`, the raw WebSocket ping/pong frames that Slack sends every ~30s, and the internal deadman that the library runs on outgoing pings are not surfaced as `slack_event` emissions and therefore never advance the counter. On a quiet DM bot (e.g. a one-owner personal bot) this triggers spurious restarts every 30 min + up to one 5-min health-check tick. In production on one tenant I observed **22 such restarts in an 8-hour window**, each causing a ~1-2s offline interval, a Socket Mode reconnect, and occasional dropped DMs (see #58540). ## Why the current behavior is wrong for Slack The `@slack/socket-mode` client already maintains its own liveness signal independently of anything the gateway does: - **Server pings** — Slack emits ping frames roughly every 30s (`serverPingTimeoutMS=30000`). If the client stops receiving them, the library reconnects on its own. - **Client pings** — the library itself pings the server and expects a pong within `clientPingTimeoutMS=5000`; three consecutive missed pongs force an internal reconnect. - On real disconnects, the library emits `disconnecting` → `disconnected`, which flows into the plugin's runtime state, sets `snapshot.connected = false`, and is handled by the **separate** `connected === false` branch of `evaluateChannelHealth`. That branch still restarts the channel. So the `stale-socket` check in the gateway is a second, weaker liveness signal that only fires on a heuristic (no user event in 30 min) that doesn't correlate with socket health. For Slack specifically, opting out of it removes a source of false positives without removing any real-failure coverage. ## What changed 1. `extensions/slack/src/channel.ts` — sets `skipStaleSocketHealthCheck: true` on the Slack status adapter, with an inline comment explaining why. Mirrors what `extensions/telegram/src/channel.ts` already declares. 2. `src/plugin-sdk/status-helpers.ts` — `createComputedAccountStatusAdapter` and `createAsyncComputedAccountStatusAdapter` now forward `skipStaleSocketHealthCheck` from the options onto the returned adapter. Both factories previously whitelisted their output fields and silently dropped this one. The field is already declared on `ChannelStatusAdapter` (see `src/channels/plugins/types.adapters.ts`) and is read by the health monitor via `getChannelPlugin(channelId)?.status?.skipStaleSocketHealthCheck`. 3. `src/plugin-sdk/status-helpers.test.ts` — regression test that verifies the flag round-trips through both the sync and async factory, plus a negative case. **Out of scope:** no change to the health monitor, the stale-socket branch itself, or the Telegram opt-out (other than the factory now propagating the flag it has always accepted). No behavior change for any plugin that does not set `skipStaleSocketHealthCheck: true`. ## A note on Telegram `extensions/telegram/src/channel.ts` has had `skipStaleSocketHealthCheck: true` for a while. Before this patch that line was effectively a no-op, because the factory dropped it. Telegram isn't affected in practice, because long-polling never sets `snapshot.connected === true`, so the stale-socket branch is skipped on a different condition (`snapshot.connected === true` must hold). With the factory fix Telegram's existing opt-out starts working for the reason its comment implies it should. ## Testing - New unit tests in `src/plugin-sdk/status-helpers.test.ts` cover both factories and both presence/absence of the flag. - `npx vitest run src/plugin-sdk/status-helpers.test.ts` → 23 passed. - Existing Slack tests unaffected: `npx vitest run extensions/slack/src/channel.

openclaw2026-03-31 19:27:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#58540•Fetched 2026-04-08 02:01:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Dutch-Forward

Participants

Dutch-Forward

Timeline (top)

commented ×1cross-referenced ×1

Fix Action

Fix / Workaround

Issue #28037 (Slack Socket Mode routing) — similar pattern, previously resolved by switching from channel names to channel IDs. This may be a separate bug in DM handling post-reconnect.

PR fix notes

PR #68253: fix(slack): opt out of stale-socket health-monitor (Socket Mode owns liveness)

Repository: openclaw/openclaw
Author: mjamiv
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/68253

Description (problem / solution / changelog)

Summary

Slack Socket Mode bots were being restarted by the gateway every ~35 minutes on idle — even when the WebSocket was fully healthy, Slack's server pings were flowing, and the bot was simply waiting for the next message. The fix opts the Slack channel out of the gateway's stale-socket heuristic, since @slack/socket-mode already runs its own liveness deadman. A latent bug in the status-adapter factory that silently dropped this opt-out field for any plugin is fixed at the same time, so the existing Telegram opt-out now works for the right reason rather than by coincidence.

Fixes #61072, #64009, #58540.

Problem

evaluateChannelHealth in src/gateway/channel-health-policy.ts restarts any channel whose lastEventAt is older than channelStaleEventThresholdMinutes (default 30). In the Slack plugin, lastEventAt is bumped only on user-facing Slack events — message, app_mention, reactions, member/channel/pin events. Socket Mode envelopes like hello, the raw WebSocket ping/pong frames that Slack sends every ~30s, and the internal deadman that the library runs on outgoing pings are not surfaced as slack_event emissions and therefore never advance the counter.

On a quiet DM bot (e.g. a one-owner personal bot) this triggers spurious restarts every 30 min + up to one 5-min health-check tick. In production on one tenant I observed 22 such restarts in an 8-hour window, each causing a ~1-2s offline interval, a Socket Mode reconnect, and occasional dropped DMs (see #58540).

Why the current behavior is wrong for Slack

The @slack/socket-mode client already maintains its own liveness signal independently of anything the gateway does:

Server pings — Slack emits ping frames roughly every 30s (serverPingTimeoutMS=30000). If the client stops receiving them, the library reconnects on its own.
Client pings — the library itself pings the server and expects a pong within clientPingTimeoutMS=5000; three consecutive missed pongs force an internal reconnect.
On real disconnects, the library emits disconnecting → disconnected, which flows into the plugin's runtime state, sets snapshot.connected = false, and is handled by the separate connected === false branch of evaluateChannelHealth. That branch still restarts the channel.

So the stale-socket check in the gateway is a second, weaker liveness signal that only fires on a heuristic (no user event in 30 min) that doesn't correlate with socket health. For Slack specifically, opting out of it removes a source of false positives without removing any real-failure coverage.

What changed

extensions/slack/src/channel.ts — sets skipStaleSocketHealthCheck: true on the Slack status adapter, with an inline comment explaining why. Mirrors what extensions/telegram/src/channel.ts already declares.
src/plugin-sdk/status-helpers.ts — createComputedAccountStatusAdapter and createAsyncComputedAccountStatusAdapter now forward skipStaleSocketHealthCheck from the options onto the returned adapter. Both factories previously whitelisted their output fields and silently dropped this one. The field is already declared on ChannelStatusAdapter (see src/channels/plugins/types.adapters.ts) and is read by the health monitor via getChannelPlugin(channelId)?.status?.skipStaleSocketHealthCheck.
src/plugin-sdk/status-helpers.test.ts — regression test that verifies the flag round-trips through both the sync and async factory, plus a negative case.

Out of scope: no change to the health monitor, the stale-socket branch itself, or the Telegram opt-out (other than the factory now propagating the flag it has always accepted). No behavior change for any plugin that does not set skipStaleSocketHealthCheck: true.

A note on Telegram

extensions/telegram/src/channel.ts has had skipStaleSocketHealthCheck: true for a while. Before this patch that line was effectively a no-op, because the factory dropped it. Telegram isn't affected in practice, because long-polling never sets snapshot.connected === true, so the stale-socket branch is skipped on a different condition (snapshot.connected === true must hold). With the factory fix Telegram's existing opt-out starts working for the reason its comment implies it should.

Testing

New unit tests in src/plugin-sdk/status-helpers.test.ts cover both factories and both presence/absence of the flag.
- npx vitest run src/plugin-sdk/status-helpers.test.ts → 23 passed.
Existing Slack tests unaffected: npx vitest run extensions/slack/src/channel.test.ts → 26 passed.
Production validation: applied the equivalent patch to a 4.11 deployment running an idle Slack DM bot; set channelStaleEventThresholdMinutes: 1 and health-check interval=60s (aggressive worst case). Ran 8+ minutes idle with zero stale-socket events. Pre-patch under default 30/5 settings, the same bot had been averaging one spurious restart every ~35 min (22 over 8h).

Compatibility

No config changes. Default behavior for Slack shifts from "spurious restarts on idle" to "no spurious restarts on idle." Real socket failures still restart via the connected === false branch, which is unchanged.
No public API changes. The ChannelStatusAdapter.skipStaleSocketHealthCheck field was already exported and already read by the health monitor; this PR just closes the factory gap.
No plugin opt-in needed. Slack already has the opt-out in this PR; Telegram has always had it.

Test plan (for reviewers)

Unit: npx vitest run src/plugin-sdk/status-helpers.test.ts
Unit: npx vitest run extensions/slack/src/channel.test.ts
Idle-bot repro: configure a Slack bot, leave it untouched for ~45 min with default settings, confirm no [health-monitor] ... reason: stale-socket lines in the gateway log.
Real-disconnect check: briefly block outbound traffic to wss-primary.slack.com, confirm the library emits disconnected and the connected === false branch triggers a restart (i.e. we didn't accidentally silence genuine failures).

Related issues

#61072 — originally reported symptom (~39-min restart cadence)
#64009 — "Slack socket-mode connection becomes stale, misses ping/pong, and restarts repeatedly"
#58540 — downstream impact: Slack DM not triggered after stale-socket reconnect
#65632 — open Discord PR taking the same opt-out approach for a sibling channel
#38643 / #39083 — earlier stale-socket work that seeded lastEventAt on connect but did not close the post-connect gap

Changed files

extensions/slack/src/channel.ts (modified, +11/-0)
src/plugin-sdk/status-helpers.test.ts (modified, +46/-0)
src/plugin-sdk/status-helpers.ts (modified, +2/-0)

Code Example

[gateway/health-monitor] [slack:default] health-monitor: restarting (reason: stale-socket)
[gateway/channels/slack] [default] starting provider
[gateway/channels/slack] slack socket mode connected
[gateway/channels/slack] slack channels resolved: C0AHF5SPSF4→C0AHF5SPSF4
[gateway/channels/slack] slack users resolved: U031ZR9M9C2→U031ZR9M9C2

RAW_BUFFERClick to expand / collapse

Problem

Slack DM messages sent to the bot are not triggering embedded agent runs when they arrive during or shortly after a socket reconnect event.

Environment

OpenClaw version: 2026.3.24 (npm)
Mode: local, Socket Mode
OS: macOS Darwin 25.3.0 (x64)
Node: v22.22.0

Symptoms

Socket Mode connects successfully (slack socket mode connected)
User ID resolved in allowlist (slack users resolved: U031ZR9M9C2→U031ZR9M9C2)
DM sent by approved user
Gateway health monitor detects stale socket and reconnects (health-monitor: restarting (reason: stale-socket))
No embedded_run triggered for the DM — message silently dropped

Log Evidence

[gateway/health-monitor] [slack:default] health-monitor: restarting (reason: stale-socket)
[gateway/channels/slack] [default] starting provider
[gateway/channels/slack] slack socket mode connected
[gateway/channels/slack] slack channels resolved: C0AHF5SPSF4→C0AHF5SPSF4
[gateway/channels/slack] slack users resolved: U031ZR9M9C2→U031ZR9M9C2

No embedded_run_agent_start or embedded_run_agent_end entries follow.

Additional Context

Earlier DMs did reach the embedded runner but failed with overloaded_error from Anthropic (separate issue). After adding model fallbacks (Gemini Flash → Haiku), the socket went stale and reconnected — the next DM after reconnect was silently dropped entirely.

Expected Behavior

DMs from approved users should trigger embedded agent runs regardless of socket reconnect events. Messages arriving during reconnect window should be queued or retried.

Issue #28037 (Slack Socket Mode routing) — similar pattern, previously resolved by switching from channel names to channel IDs. This may be a separate bug in DM handling post-reconnect.

extent analysis

TL;DR

Implement a message queue or retry mechanism to handle DMs arriving during or shortly after a socket reconnect event in Slack Socket Mode.

Guidance

Investigate the Slack Socket Mode reconnect event handling to identify why DMs are being silently dropped instead of being queued or retried.
Review the health-monitor and gateway/channels/slack logs to understand the timing and sequence of events during a reconnect.
Consider adding a temporary buffer or queue to store incoming DMs during the reconnect window, allowing them to be processed once the socket is reestablished.
Examine the differences between channel and DM handling in the Slack Socket Mode implementation, as the issue may be related to the separate bug in DM handling post-reconnect mentioned in Issue #28037.

Example

No code snippet is provided due to the lack of specific implementation details in the issue.

Notes

The solution may require modifications to the OpenClaw library or the custom implementation using it. The exact changes will depend on the internal workings of the library and the specific requirements of the application.

Recommendation

Apply a workaround by implementing a message queue or retry mechanism to handle DMs arriving during or shortly after a socket reconnect event, as this will allow the application to recover from the reconnect event and process the DMs as expected.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model loading #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Slack DM: Embedded agent run not triggered when socket reconnects (stale-socket) [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #68253: fix(slack): opt out of stale-socket health-monitor (Socket Mode owns liveness)

Description (problem / solution / changelog)

Summary

Problem

Why the current behavior is wrong for Slack

What changed

A note on Telegram

Testing

Compatibility

Test plan (for reviewers)

Related issues

Changed files

Code Example

Problem

Environment

Symptoms

Log Evidence

Additional Context

Expected Behavior

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING