openclaw - 💡(How to fix) Fix Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73857Fetched 2026-04-29 06:14:12
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
closed ×1commented ×1

OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns (LLM streaming, JSON parsing of large transcripts, trajectory writes), the Slack SDK's hardcoded 5-second pong-timeout watchdog mis-fires and force-reconnects, even though the underlying network is fine. Under sustained load this cascades to the in-process health-monitor, which then SIGTERMs the gateway and lets systemd restart it.

Net effect for users: Slack messages arrive late or are bunched up after the reconnect window, and the gateway intermittently restarts itself.

Error Message

  • [WARN] socket-mode:SlackWebSocket:N "A pong wasn't received from the server before the timeout of 5000ms!"

Root Cause

The Slack @slack/socket-mode SDK uses a 5 s clientPingTimeout (the time it waits to receive a server-initiated pong before considering the socket dead). Under main-loop blocking, the JS callback that processes the pong frame is delayed past 5 s, so the SDK treats the socket as broken and reconnects — even though the WebSocket itself is fine.

Because the entire OpenClaw gateway shares one Node thread, any concurrent agent turn (large JSON serialization, prompt build, tool result stringification, etc.) can cause this. Adding hardware does not solve it — Node is single-threaded.

Fix Action

Fix / Workaround

Happy to test patches against this exact workload — full logs, trajectory exports, and reproduction scripts available on request.

Code Example

2026-04-28T21:11:48.689Z opened
2026-04-28T21:11:48.889Z WS OPEN
2026-04-28T21:12:03.695Z pings=2 drops=0 state=1
2026-04-28T21:12:18.699Z pings=3 drops=0 state=1
...
2026-04-28T21:13:18.699Z DONE pings=9 drops=0

---

2026-04-28T21:09:31.325+00:00 ws38
2026-04-28T21:09:31.327+00:00 ws39   (different account)
RAW_BUFFERClick to expand / collapse

Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout)

Summary

OpenClaw's Slack socket-mode plugin runs on the main Node.js event loop. When the loop is busy with concurrent agent turns (LLM streaming, JSON parsing of large transcripts, trajectory writes), the Slack SDK's hardcoded 5-second pong-timeout watchdog mis-fires and force-reconnects, even though the underlying network is fine. Under sustained load this cascades to the in-process health-monitor, which then SIGTERMs the gateway and lets systemd restart it.

Net effect for users: Slack messages arrive late or are bunched up after the reconnect window, and the gateway intermittently restarts itself.

Environment

  • OpenClaw 2026.4.26 (commit be8c246)
  • Node v22.22.2
  • Linux 6.8.0-110-generic (Ubuntu, x64)
  • DigitalOcean Premium Intel droplet, 4 vCPU / 8 GB RAM
  • 2 Slack accounts configured (socket mode, both with valid bot+app tokens)
  • ~50 cron jobs across 4 agents; mixed Opus / Sonnet / Haiku

Reproduction

  1. Configure two channels.slack.accounts entries in socket mode.
  2. Run several concurrent cron-driven agent turns (any combination that drives the JS event loop above ~50% CPU on a single core for >5 s).
  3. Tail the gateway log: /tmp/openclaw/openclaw-<date>.log
  4. Observe:
    • [WARN] socket-mode:SlackWebSocket:N "A pong wasn't received from the server before the timeout of 5000ms!"
    • {"subsystem":"gateway/channels/slack"} slack socket disconnected (disconnect). retry 1/12 in 2s
    • Pairs of socket-mode warnings appear within milliseconds for both accounts simultaneously (shared-loop signature).
    • After enough consecutive disconnects: [slack:default] health-monitor: restarting (reason: disconnected) followed by signal SIGTERM received.

Evidence the network and tokens are not the cause

A standalone Node script (using the same ws library version OpenClaw bundles, the same Slack app token, on the same host, at the same time) holds a Slack socket-mode WSS connection open indefinitely with zero pong timeouts:

2026-04-28T21:11:48.689Z opened
2026-04-28T21:11:48.889Z WS OPEN
2026-04-28T21:12:03.695Z pings=2 drops=0 state=1
2026-04-28T21:12:18.699Z pings=3 drops=0 state=1
...
2026-04-28T21:13:18.699Z DONE pings=9 drops=0

Meanwhile, in the gateway process during the exact same minute, both Slack sockets flap multiple times. Pong-miss timestamps for the two accounts match within ~3 ms, e.g.:

2026-04-28T21:09:31.325+00:00 ws38
2026-04-28T21:09:31.327+00:00 ws39   (different account)

That lockstep behaviour rules out network/Slack-side issues and locates the fault in the shared JS event loop.

What I tried (in order)

FixEffect
Stagger heavy */15 and 0 * * * * crons off shared minute marksReduced peaks
Archive trajectory files >5 MB / >60 min idle (145 MB freed)Reduced disk I/O during writes
Resize droplet 2 → 4 vCPUHalved flap rate (extra cores help libuv pool / OS scheduler, not main event loop)
Stagger 9 hour-boundary crons with --stagger 8mSmoothed the 17:00 / 20:00 clusters; flap variance dropped
Search for a config knob to cap concurrent agent turnsNot exposed
Search for a config knob to extend the Slack pong timeoutNot exposed

After all of the above, residual flap rate is ~30/hr in normal windows, and health-monitor still triggers gateway restarts during cron-heavy hours.

Root cause analysis

The Slack @slack/socket-mode SDK uses a 5 s clientPingTimeout (the time it waits to receive a server-initiated pong before considering the socket dead). Under main-loop blocking, the JS callback that processes the pong frame is delayed past 5 s, so the SDK treats the socket as broken and reconnects — even though the WebSocket itself is fine.

Because the entire OpenClaw gateway shares one Node thread, any concurrent agent turn (large JSON serialization, prompt build, tool result stringification, etc.) can cause this. Adding hardware does not solve it — Node is single-threaded.

Proposed fixes (ranked by ROI)

1. Configurable Slack clientPingTimeout (smallest change, biggest immediate win)

Expose channels.slack.clientPingTimeoutMs (default 5000, recommended fallback 15000–30000) and pass it through to @slack/socket-mode's SocketModeClient constructor.

This alone would eliminate >90% of spurious reconnects without any architectural change. The Slack server-side ping interval is ~10 s, so 15–30 s clientPingTimeout is safe.

2. Worker-thread socket-mode

Run each Slack account's SocketModeClient in a dedicated worker_threads.Worker and forward inbound events to the main thread via postMessage. The worker's event loop is independent of the main thread, so heavy agent turns never starve socket-level keepalives.

This is the correct long-term fix; the SDK is small and message-passing-friendly.

3. Optional: runtime.maxConcurrentAgentTurns

A configurable cap on simultaneous in-flight LLM turns (queue the rest). Independent of Slack; would also reduce general gateway sluggishness.

Concrete asks

  1. Land #1 (configurable pong timeout) as a small follow-up — single-digit lines of code, immediate user-visible improvement.
  2. Track #2 (worker-thread socket-mode) as a structural fix.
  3. Consider #3 for general scheduler hygiene (the same root cause affects other channel plugins, not just Slack).

Happy to test patches against this exact workload — full logs, trajectory exports, and reproduction scripts available on request.

extent analysis

TL;DR

Implementing a configurable Slack clientPingTimeout is the most likely fix to reduce spurious reconnects.

Guidance

  • Increase the clientPingTimeout value to 15-30 seconds to give the Node.js event loop more time to process pong frames.
  • Consider running each Slack account's SocketModeClient in a dedicated worker_threads.Worker to isolate the event loop and prevent starvation of socket-level keepalives.
  • Implementing a configurable cap on simultaneous in-flight LLM turns (runtime.maxConcurrentAgentTurns) could also help reduce gateway sluggishness.
  • Verify the fix by monitoring the gateway log for reconnects and pong timeouts after applying the changes.

Example

No code snippet is provided as the issue is more related to configuration and architecture.

Notes

The proposed fixes assume that the issue is indeed caused by the Slack SDK's hardcoded 5-second pong timeout and the shared Node.js event loop. The effectiveness of the fixes may vary depending on the specific workload and environment.

Recommendation

Apply workaround #1 (configurable pong timeout) as a short-term fix to immediately improve user experience, and track #2 (worker-thread socket-mode) as a long-term structural fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Slack socket-mode disconnects under main-thread load (need worker thread or configurable pong timeout) [1 comments, 2 participants]