openclaw - ✅(Solved) Fix [Bug]: Gateway event-loop stalls cause cross-channel latency, missed replies, and channel disconnects [1 pull requests, 7 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75882Fetched 2026-05-02 05:28:32
View on GitHub
Comments
7
Participants
4
Timeline
15
Reactions
2
Author
Timeline (top)
commented ×7cross-referenced ×4subscribed ×2closed ×1

Gateway intermittently stalls its Node event loop for tens to hundreds of seconds, causing cross-channel latency/failures. This is not limited to WhatsApp: during the same periods Telegram polling/send actions stall or fail, Slack socket pings/pongs time out, and WhatsApp Web repeatedly disconnects/exits. WhatsApp additionally hits recurring 408/428 reconnect/session-expiry failures, tracked separately in #75736.

Error Message

Gateway reachable.

  • Slack default: enabled, configured, stopped, disconnected, error: channel stop timed out after 5000ms
  • Telegram default: enabled, configured, running, connected, mode: polling, works
  • WhatsApp default: enabled, configured, linked, stopped, disconnected, error: channel exited without an error

Root Cause

Gateway intermittently stalls its Node event loop for tens to hundreds of seconds, causing cross-channel latency/failures. This is not limited to WhatsApp: during the same periods Telegram polling/send actions stall or fail, Slack socket pings/pongs time out, and WhatsApp Web repeatedly disconnects/exits. WhatsApp additionally hits recurring 408/428 reconnect/session-expiry failures, tracked separately in #75736.

Fix Action

Fix / Workaround

  • WhatsApp messages sometimes get an automatic reaction, but no assistant reply is sent or the reply is delayed by minutes.
  • Telegram is also occasionally slow and has send/dispatch failures.
  • Slack socket mode repeatedly disconnects/restarts.
  • Gateway status probes can show channels as connected briefly, then stopped/disconnected shortly after.
[telegram] Polling stall detected (active getUpdates stuck for 172.51s); forcing restart.
[telegram] [diag] polling cycle finished reason=polling stall detected durationMs=172510 error=Network request for 'getUpdates' failed!
[telegram] polling runner stopped (polling stall detected); restarting in 2.49s.
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
[telegram] dispatch failed: SessionWriteLockTimeoutError: session file locked (timeout 10000ms): .../sessions.json.lock

PR fix notes

PR #75922: Fix plugin-only tool and registry latency regressions

Description (problem / solution / changelog)

Summary

  • Skip core coding tool construction when an explicit allowlist only requests plugin tools.
  • Keep the full workspace plugin registry cache separate from scoped plugin registry loads.
  • Add regressions for both latency paths.

Tests

  • OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test src/agents/pi-embedded-runner/run/attempt.tools-allow-regression.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/tool-policy.plugin-only-allowlist.test.ts src/agents/pi-tools.create-openclaw-coding-tools.test.ts src/plugins/plugin-lru-cache.test.ts src/plugins/loader.runtime-registry.test.ts src/plugins/loader.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/attempt.ts src/agents/pi-embedded-runner/run/attempt.spawn-workspace.test-support.ts src/agents/pi-embedded-runner/run/attempt.tools-allow-regression.test.ts src/plugins/loader.ts src/plugins/loader.runtime-registry.test.ts
  • git diff --check origin/main...HEAD

Fixes #75882 Fixes #75907 Fixes #75906 Fixes #75887 Fixes #75851

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-runner/run/attempt.spawn-workspace.test-support.ts (modified, +36/-20)
  • src/agents/pi-embedded-runner/run/attempt.tools-allow-regression.test.ts (added, +59/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +53/-4)
  • src/agents/pi-tools.ts (modified, +190/-145)
  • src/plugins/loader.runtime-registry.test.ts (modified, +28/-1)
  • src/plugins/loader.ts (modified, +40/-20)

Code Example

Gateway reachable.
- Slack default: enabled, configured, stopped, disconnected, error: channel stop timed out after 5000ms
- Telegram default: enabled, configured, running, connected, mode: polling, works
- WhatsApp default: enabled, configured, linked, stopped, disconnected, error: channel exited without an error

---

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=173s eventLoopDelayP99Ms=171798.7 eventLoopDelayMaxMs=171798.7 eventLoopUtilization=1 cpuCoreRatio=1.055 active=0 waiting=0 queued=0
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=34s eventLoopDelayP99Ms=12146.7 eventLoopDelayMaxMs=12146.7 eventLoopUtilization=1 cpuCoreRatio=1.031 active=0 waiting=0 queued=2
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=181s eventLoopDelayP99Ms=171530.3 eventLoopDelayMaxMs=171530.3 eventLoopUtilization=1 cpuCoreRatio=1.054 active=1 waiting=0 queued=2

---

[telegram] Polling stall detected (active getUpdates stuck for 172.51s); forcing restart.
[telegram] [diag] polling cycle finished reason=polling stall detected durationMs=172510 error=Network request for 'getUpdates' failed!
[telegram] polling runner stopped (polling stall detected); restarting in 2.49s.
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
[telegram] dispatch failed: SessionWriteLockTimeoutError: session file locked (timeout 10000ms): .../sessions.json.lock

---

[WARN] socket-mode:SlackWebSocket A pong wasn't received from the server before the timeout of 15000ms!
[slack] socket disconnected (disconnect). retry 1/12 in 2s
[health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

---

[whatsapp] Web connection closed (status 408). Retry 1/12 in 2.2s… (status=408 Request Time-out Connection was lost)
[whatsapp] Web connection closed (status 428: session expired or precondition required). Relink with `openclaw channels login --channel whatsapp`. Stopping web monitoring.
[whatsapp] [default] channel exited without an error
[whatsapp] [default] auto-restart attempt 3/10 in 22s
[tools] message failed: Error: No active WhatsApp Web listener (account: default).

---

[whatsapp] Sending reaction "✅" -> message ...
[whatsapp] Inbound message ... (direct, 106 chars)
[whatsapp] Sent reaction "✅" -> message ...
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=403418 queueAhead=4

---

[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=594650 queueAhead=4
[diagnostic] lane task error: lane=cron-nested durationMs=508892 error="FailoverError: LLM request timed out."
[diagnostic] lane task error: lane=session:agent:main:cron:... durationMs=636726 error="FailoverError: LLM request timed out."
[agent/embedded] agent cleanup timed out: ... step=pi-trajectory-flush timeoutMs=10000
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:whatsapp:default:direct:... source=assistantError ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.

---

event_loop_liveness: 173
telegram_polling_stall: 9
whatsapp_428: 8
wa_no_listener: 8
wa_channel_exited: 11
stuck_session: 84
gateway_timeout: 113
slack_disconnect: 64
assistant_error/context-overflow: 3
send_fail: 7
RAW_BUFFERClick to expand / collapse

Bug type

Performance / reliability regression

Summary

Gateway intermittently stalls its Node event loop for tens to hundreds of seconds, causing cross-channel latency/failures. This is not limited to WhatsApp: during the same periods Telegram polling/send actions stall or fail, Slack socket pings/pongs time out, and WhatsApp Web repeatedly disconnects/exits. WhatsApp additionally hits recurring 408/428 reconnect/session-expiry failures, tracked separately in #75736.

User-visible impact

  • WhatsApp messages sometimes get an automatic reaction, but no assistant reply is sent or the reply is delayed by minutes.
  • Telegram is also occasionally slow and has send/dispatch failures.
  • Slack socket mode repeatedly disconnects/restarts.
  • Gateway status probes can show channels as connected briefly, then stopped/disconnected shortly after.

Why the WhatsApp reaction happens but no answer follows

Logs show WhatsApp inbound/reaction handling can complete before the assistant run/delivery path finishes. After the reaction, the gateway/agent path can stall on event-loop delay, session/lane waits, file lock timeouts, LLM timeout, or WhatsApp listener flapping. This creates the visible pattern: ✅ reaction arrives, but no final answer is delivered.

Environment

  • OpenClaw: 2026.4.29 (a448042)
  • OS: Linux 6.8.0-100-generic x64
  • Node: 22.22.0
  • Gateway: systemd user service
  • Install: npm/pnpm global CLI
  • Channels enabled: WhatsApp, Telegram, Slack
  • Host resources during investigation: RAM available ~1.5–1.6GiB, disk ~88% full, gateway process around 35–40% RSS and CPU spikes during stalls.

Current channel state example

Gateway reachable.
- Slack default: enabled, configured, stopped, disconnected, error: channel stop timed out after 5000ms
- Telegram default: enabled, configured, running, connected, mode: polling, works
- WhatsApp default: enabled, configured, linked, stopped, disconnected, error: channel exited without an error

Sanitized evidence

Event-loop / liveness stalls

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=173s eventLoopDelayP99Ms=171798.7 eventLoopDelayMaxMs=171798.7 eventLoopUtilization=1 cpuCoreRatio=1.055 active=0 waiting=0 queued=0
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=34s eventLoopDelayP99Ms=12146.7 eventLoopDelayMaxMs=12146.7 eventLoopUtilization=1 cpuCoreRatio=1.031 active=0 waiting=0 queued=2
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=181s eventLoopDelayP99Ms=171530.3 eventLoopDelayMaxMs=171530.3 eventLoopUtilization=1 cpuCoreRatio=1.054 active=1 waiting=0 queued=2

Telegram affected too

[telegram] Polling stall detected (active getUpdates stuck for 172.51s); forcing restart.
[telegram] [diag] polling cycle finished reason=polling stall detected durationMs=172510 error=Network request for 'getUpdates' failed!
[telegram] polling runner stopped (polling stall detected); restarting in 2.49s.
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
[telegram] dispatch failed: SessionWriteLockTimeoutError: session file locked (timeout 10000ms): .../sessions.json.lock

Slack affected too

[WARN] socket-mode:SlackWebSocket A pong wasn't received from the server before the timeout of 15000ms!
[slack] socket disconnected (disconnect). retry 1/12 in 2s
[health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

WhatsApp flapping / delivery failures

[whatsapp] Web connection closed (status 408). Retry 1/12 in 2.2s… (status=408 Request Time-out Connection was lost)
[whatsapp] Web connection closed (status 428: session expired or precondition required). Relink with `openclaw channels login --channel whatsapp`. Stopping web monitoring.
[whatsapp] [default] channel exited without an error
[whatsapp] [default] auto-restart attempt 3/10 in 22s
[tools] message failed: Error: No active WhatsApp Web listener (account: default).

Reaction without timely answer pattern

[whatsapp] Sending reaction "✅" -> message ...
[whatsapp] Inbound message ... (direct, 106 chars)
[whatsapp] Sent reaction "✅" -> message ...
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=403418 queueAhead=4

Agent/session/lane symptoms

[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=594650 queueAhead=4
[diagnostic] lane task error: lane=cron-nested durationMs=508892 error="FailoverError: LLM request timed out."
[diagnostic] lane task error: lane=session:agent:main:cron:... durationMs=636726 error="FailoverError: LLM request timed out."
[agent/embedded] agent cleanup timed out: ... step=pi-trajectory-flush timeoutMs=10000
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:whatsapp:default:direct:... source=assistantError ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.

Counts from a 12h log sample

event_loop_liveness: 173
telegram_polling_stall: 9
whatsapp_428: 8
wa_no_listener: 8
wa_channel_exited: 11
stuck_session: 84
gateway_timeout: 113
slack_disconnect: 64
assistant_error/context-overflow: 3
send_fail: 7

Hypotheses

  1. Gateway event loop is being blocked by one or more synchronous/CPU-heavy or file-lock-heavy operations, causing all channel transports to miss heartbeats/timeouts.
  2. Session persistence / trajectory flushing may be contributing: sessions.json.lock timeout and pi-trajectory-flush cleanup timeout appear near stalls.
  3. LLM/tool-loop timeouts and context-overflow diagnostics may be leaving sessions in long processing_without_queue states, causing lane waits and downstream delivery delays.
  4. WhatsApp has an additional channel-specific reconnect/session-expiry bug (#75736), which becomes more visible under event-loop stalls.

Expected behavior

  • A stuck agent run or trajectory flush should not block channel polling/websocket heartbeats for 10–170s.
  • Inbound ack/reaction and assistant reply delivery should not diverge silently; if a reply cannot be delivered, the failure should be recoverable/observable.
  • Telegram/Slack/WhatsApp transports should remain responsive even when one session or cron is stuck.

Actual behavior

Gateway event-loop stalls correlate with Telegram polling stalls, Slack pings timing out, WhatsApp disconnects/exits, lane waits, session lock failures, and missed/delayed user replies.

Related

  • WhatsApp 428/channel-exit issue: #75736

extent analysis

TL;DR

The Gateway's event loop stalls are likely caused by synchronous or CPU-heavy operations, session persistence issues, or LLM/tool-loop timeouts, leading to cross-channel latency and failures.

Guidance

  1. Investigate event loop blocking operations: Identify and optimize any synchronous or CPU-heavy operations that may be blocking the event loop, such as file-lock-heavy operations or long-running database queries.
  2. Improve session persistence and trajectory flushing: Optimize session persistence and trajectory flushing to reduce the likelihood of sessions.json.lock timeouts and pi-trajectory-flush cleanup timeouts.
  3. Implement timeouts and retries for LLM/tool-loop operations: Introduce timeouts and retries for LLM/tool-loop operations to prevent sessions from getting stuck in long processing_without_queue states.
  4. Monitor and analyze event loop utilization: Use diagnostics like eventLoopUtilization and eventLoopDelayP99Ms to monitor event loop performance and identify potential bottlenecks.

Example

No specific code snippet can be provided without more context, but optimizing event loop operations and introducing timeouts for LLM/tool-loop operations could involve using Node.js built-in features like async/await and setTimeout.

Notes

The provided information suggests that the issue is complex and multifaceted, requiring a thorough investigation of the Gateway's event loop, session persistence, and LLM/tool-loop operations. The hypotheses provided in the issue body offer a good starting point for further analysis.

Recommendation

Apply workarounds to optimize event loop operations, improve session persistence, and introduce timeouts for LLM/tool-loop operations, as these changes are likely to mitigate the issue without requiring a full version upgrade.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • A stuck agent run or trajectory flush should not block channel polling/websocket heartbeats for 10–170s.
  • Inbound ack/reaction and assistant reply delivery should not diverge silently; if a reply cannot be delivered, the failure should be recoverable/observable.
  • Telegram/Slack/WhatsApp transports should remain responsive even when one session or cron is stuck.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Gateway event-loop stalls cause cross-channel latency, missed replies, and channel disconnects [1 pull requests, 7 comments, 4 participants]