Root Cause

The Slack @slack/socket-mode SDK uses a 5 s clientPingTimeout (the time it waits to receive a server-initiated pong before considering the socket dead). Under main-loop blocking, the JS callback that processes the pong frame is delayed past 5 s, so the SDK treats the socket as broken and reconnects — even though the WebSocket itself is fine.

Because the entire OpenClaw gateway shares one Node thread, any concurrent agent turn (large JSON serialization, prompt build, tool result stringification, etc.) can cause this. Adding hardware does not solve it — Node is single-threaded.

Code Example

2026-04-28T21:11:48.689Z opened
2026-04-28T21:11:48.889Z WS OPEN
2026-04-28T21:12:03.695Z pings=2 drops=0 state=1
2026-04-28T21:12:18.699Z pings=3 drops=0 state=1
...
2026-04-28T21:13:18.699Z DONE pings=9 drops=0

---

2026-04-28T21:09:31.325+00:00 ws38
2026-04-28T21:09:31.327+00:00 ws39   (different account)

Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout)

jared-rebel · 2026-04-28T23:14:53Z

[openclaw] OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns LLM streaming, JSON parsi… OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns (LLM streaming, JSON parsing of large transcripts, trajectory writes), the Slack SDK's hardcoded 5-second pong-timeout watchdog mis-fires and force-reconnects, even though the underlying network is fine. Under sustained load this cascades to the in-process health-monitor, which then SIGTERMs the gateway and lets systemd restart it. Net effect for users: Slack messages arrive late or are bunched up after the reconnect window, and the gateway intermittently restarts itself. ## Fix / Workaround Happy to test patches against this exact workload — full logs, trajectory exports, and reproduction scripts available on request. # Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout) ## Summary OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns (LLM streaming, JSON parsing of large transcripts, trajectory writes), the Slack SDK's hardcoded 5-second pong-timeout watchdog mis-fires and force-reconnects, even though the underlying network is fine. Under sustained load this cascades to the in-process health-monitor, which then SIGTERMs the gateway and lets systemd restart it. Net effect for users: Slack messages arrive late or are bunched up after the reconnect window, and the gateway intermittently restarts itself. ## Environment - OpenClaw `2026.4.26` (commit `be8c246`) - Node `v22.22.2` - Linux `6.8.0-110-generic` (Ubuntu, x64) - DigitalOcean Premium Intel droplet, 4 vCPU / 8 GB RAM - 2 Slack accounts configured (socket mode, both with valid bot+app tokens) - ~50 cron jobs across 4 agents; mixed Opus / Sonnet / Haiku ## Reproduction 1. Configure two `channels.slack.accounts` entries in socket mode. 2. Run several concurrent cron-driven agent turns (any combination that drives the JS event loop above ~50% CPU on a single core for >5 s). 3. Tail the gateway log: `/tmp/openclaw/openclaw- .log` 4. Observe: - `[WARN] socket-mode:SlackWebSocket:N "A pong wasn't received from the server before the timeout of 5000ms!"` - `{"subsystem":"gateway/channels/slack"} slack socket disconnected (disconnect). retry 1/12 in 2s` - Pairs of socket-mode warnings appear within milliseconds for both accounts simultaneously (shared-loop signature). - After enough consecutive disconnects: `[slack:default] health-monitor: restarting (reason: disconnected)` followed by `signal SIGTERM received`. ## Evidence the network and tokens are not the cause A standalone Node script (using the same `ws` library version OpenClaw bundles, the same Slack app token, on the same host, at the same time) holds a Slack socket-mode WSS connection open indefinitely with zero pong timeouts: ``` 2026-04-28T21:11:48.689Z opened 2026-04-28T21:11:48.889Z WS OPEN 2026-04-28T21:12:03.695Z pings=2 drops=0 state=1 2026-04-28T21:12:18.699Z pings=3 drops=0 state=1 ... 2026-04-28T21:13:18.699Z DONE pings=9 drops=0 ``` Meanwhile, in the gateway process during the exact same minute, both Slack sockets flap multiple times. Pong-miss timestamps for the two accounts match within ~3 ms, e.g.: ``` 2026-04-28T21:09:31.325+00:00 ws38 2026-04-28T21:09:31.327+00:00 ws39 (different account) ``` That lockstep behaviour rules out network/Slack-side issues and locates the fault in the shared JS event loop. ## What I tried (in order) | Fix | Effect | |---|---| | Stagger heavy `*/15` and `0 * * * *` crons off shared minute marks | Reduced peaks | | Archive trajectory files >5 MB / >60 min idle (145 MB freed) | Reduced disk I/O during writes | | Resize droplet 2 → 4 vCPU | Halved flap rate (extra cores help libuv pool / OS scheduler, not main event loop) | | Stagger 9 hour-boundary crons with `--stagger 8m` | Smoothed the 17:00 / 20:00 clusters; flap variance dropped | | Search for a config knob to cap concurrent agent turns | Not exposed | | Search for a config knob to extend the Slack pong timeout | Not exposed | After all of the above, residual flap rate is ~30/hr in normal windows, and health-monitor still triggers gateway restarts during cron-heavy hours. ## Root cause analysis The Slack `@slack/socket-mode` SDK uses a 5 s `clientPingTimeout` (the time it waits to receive a server-initiated pong before considering the socket dead). Under main-loop blocking, the JS callback that processes the pong frame is delayed past 5 s, so the SDK treats the socket as broken and reconnects — even though the WebSocket itself is fine. Because the entire OpenClaw gateway shares one Node thread, any concurrent agent turn (large JSON serialization, prompt build, tool result stringification, etc.) can cause this. Adding hardware does not solve it — Node is single-threaded. ## Proposed fixes (ranked by ROI) #

Summary

OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns (LLM streaming, JSON parsing of large transcripts, trajectory writes), the Slack SDK's hardcoded 5-second pong-timeout watchdog mis-fires and force-reconnects, even though the underlying network is fine. Under sustained load this cascades to the in-process health-monitor, which then SIGTERMs the gateway and lets systemd restart it.

Net effect for users: Slack messages arrive late or are bunched up after the reconnect window, and the gateway intermittently restarts itself.

Environment

OpenClaw 2026.4.26 (commit be8c246)
Node v22.22.2
Linux 6.8.0-110-generic (Ubuntu, x64)
DigitalOcean Premium Intel droplet, 4 vCPU / 8 GB RAM
2 Slack accounts configured (socket mode, both with valid bot+app tokens)
~50 cron jobs across 4 agents; mixed Opus / Sonnet / Haiku

Reproduction

Configure two channels.slack.accounts entries in socket mode.
Run several concurrent cron-driven agent turns (any combination that drives the JS event loop above ~50% CPU on a single core for >5 s).
Tail the gateway log: /tmp/openclaw/openclaw-<date>.log
Observe:
- [WARN] socket-mode:SlackWebSocket:N "A pong wasn't received from the server before the timeout of 5000ms!"
- {"subsystem":"gateway/channels/slack"} slack socket disconnected (disconnect). retry 1/12 in 2s
- Pairs of socket-mode warnings appear within milliseconds for both accounts simultaneously (shared-loop signature).
- After enough consecutive disconnects: [slack:default] health-monitor: restarting (reason: disconnected) followed by signal SIGTERM received.

Evidence the network and tokens are not the cause

A standalone Node script (using the same ws library version OpenClaw bundles, the same Slack app token, on the same host, at the same time) holds a Slack socket-mode WSS connection open indefinitely with zero pong timeouts:

2026-04-28T21:11:48.689Z opened
2026-04-28T21:11:48.889Z WS OPEN
2026-04-28T21:12:03.695Z pings=2 drops=0 state=1
2026-04-28T21:12:18.699Z pings=3 drops=0 state=1
...
2026-04-28T21:13:18.699Z DONE pings=9 drops=0

Meanwhile, in the gateway process during the exact same minute, both Slack sockets flap multiple times. Pong-miss timestamps for the two accounts match within ~3 ms, e.g.:

2026-04-28T21:09:31.325+00:00 ws38
2026-04-28T21:09:31.327+00:00 ws39   (different account)

That lockstep behaviour rules out network/Slack-side issues and locates the fault in the shared JS event loop.

What I tried (in order)

Fix	Effect
Stagger heavy `/15` and `0 * * *` crons off shared minute marks	Reduced peaks
Archive trajectory files >5 MB / >60 min idle (145 MB freed)	Reduced disk I/O during writes
Resize droplet 2 → 4 vCPU	Halved flap rate (extra cores help libuv pool / OS scheduler, not main event loop)
Stagger 9 hour-boundary crons with `--stagger 8m`	Smoothed the 17:00 / 20:00 clusters; flap variance dropped
Search for a config knob to cap concurrent agent turns	Not exposed
Search for a config knob to extend the Slack pong timeout	Not exposed

After all of the above, residual flap rate is ~30/hr in normal windows, and health-monitor still triggers gateway restarts during cron-heavy hours.

Root cause analysis

Proposed fixes (ranked by ROI)

1. Configurable Slack `clientPingTimeout` (smallest change, biggest immediate win)

Expose channels.slack.clientPingTimeoutMs (default 5000, recommended fallback 15000–30000) and pass it through to @slack/socket-mode's SocketModeClient constructor.

This alone would eliminate >90% of spurious reconnects without any architectural change. The Slack server-side ping interval is ~10 s, so 15–30 s clientPingTimeout is safe.

2. Worker-thread socket-mode

Run each Slack account's SocketModeClient in a dedicated worker_threads.Worker and forward inbound events to the main thread via postMessage. The worker's event loop is independent of the main thread, so heavy agent turns never starve socket-level keepalives.

This is the correct long-term fix; the SDK is small and message-passing-friendly.

3. Optional: `runtime.maxConcurrentAgentTurns`

A configurable cap on simultaneous in-flight LLM turns (queue the rest). Independent of Slack; would also reduce general gateway sluggishness.

Concrete asks

Land #1 (configurable pong timeout) as a small follow-up — single-digit lines of code, immediate user-visible improvement.
Track #2 (worker-thread socket-mode) as a structural fix.
Consider #3 for general scheduler hygiene (the same root cause affects other channel plugins, not just Slack).

Happy to test patches against this exact workload — full logs, trajectory exports, and reproduction scripts available on request.

extent analysis

TL;DR

Implementing a configurable Slack clientPingTimeout is the most likely fix to reduce spurious reconnects.

Guidance

Increase the clientPingTimeout value to 15-30 seconds to give the Node.js event loop more time to process pong frames.
Consider running each Slack account's SocketModeClient in a dedicated worker_threads.Worker to isolate the event loop and prevent starvation of socket-level keepalives.
Implementing a configurable cap on simultaneous in-flight LLM turns (runtime.maxConcurrentAgentTurns) could also help reduce gateway sluggishness.
Verify the fix by monitoring the gateway log for reconnects and pong timeouts after applying the changes.

Example

No code snippet is provided as the issue is more related to configuration and architecture.

Notes

The proposed fixes assume that the issue is indeed caused by the Slack SDK's hardcoded 5-second pong timeout and the shared Node.js event loop. The effectiveness of the fixes may vary depending on the specific workload and environment.

Recommendation

Apply workaround #1 (configurable pong timeout) as a short-term fix to immediately improve user experience, and track #2 (worker-thread socket-mode) as a long-term structural fix.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout)

Summary

Environment

Reproduction

Evidence the network and tokens are not the cause

What I tried (in order)

Root cause analysis

Proposed fixes (ranked by ROI)

1. Configurable Slack `clientPingTimeout` (smallest change, biggest immediate win)

2. Worker-thread socket-mode

3. Optional: `runtime.maxConcurrentAgentTurns`

Concrete asks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout)

Summary

Environment

Reproduction

Evidence the network and tokens are not the cause

What I tried (in order)

Root cause analysis

Proposed fixes (ranked by ROI)

1. Configurable Slack clientPingTimeout (smallest change, biggest immediate win)

2. Worker-thread socket-mode

3. Optional: runtime.maxConcurrentAgentTurns

Concrete asks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Configurable Slack `clientPingTimeout` (smallest change, biggest immediate win)

3. Optional: `runtime.maxConcurrentAgentTurns`