openclaw - ✅(Solved) Fix [Bug]: WhatsApp health-monitor re-starts channels stopped by terminal DisconnectReason (loggedOut, connectionReplaced) — 12-channel multi-tenant restart loop, 12.9GB heap, 24s RPC latency on 2026.5.3-1 [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78419Fetched 2026-05-07 03:37:06
View on GitHub
Comments
3
Participants
3
Timeline
10
Reactions
2
Timeline (top)
referenced ×6commented ×3cross-referenced ×1

On a multi-tenant gateway (12 WhatsApp accounts), all 12 channels enter a perpetual health-monitor: restarting (reason: stopped) loop after their underlying WhatsApp Web sessions are invalidated. The health monitor restarts channels regardless of why they stopped, so channels stopped by terminal DisconnectReasons (loggedOut, connectionReplaced, badSession) get re-started indefinitely with no chance of success — they need a fresh QR pairing, not a restart. Over ~20 hours this drove gateway RSS from <500 MB to 12.9 GB peak, CPU to 16 h of 20 h wall time, and RPC agents.files.get latency from <200 ms to 24-25 s, blocking unrelated tenants on the same gateway.

Error Message

Capture connection.update from Baileys and, when connection === 'close', read lastDisconnect.error.output.statusCode (or lastDisconnect.error.data.reason) and map to DisconnectReason:

Root Cause

Actual behavior

  • Channel exits with connection: 'close' and a terminal DisconnectReason from Baileys.
  • Gateway sets the channel to running=false (status: stopped).
  • Health monitor wakes on its interval, sees running=false, and calls startChannel().
  • New connection immediately fails with the same terminal reason (or a connectionReplaced 440 because the previous Noise key is now stale on WhatsApp's side).
  • Loop continues indefinitely. Per-attempt cost (Baileys auth state + protobuf parsing + WebSocket setup) accumulates: in our case 12 channels × ~one attempt every health-check interval × no upper bound on consecutive failures.

Fix Action

Fixed

PR fix notes

PR #78511: fix(gateway): skip health-monitor restart for terminal WhatsApp disconnects

Description (problem / solution / changelog)

Summary

  • Problem: On multi-tenant gateways running ≥1 WhatsApp account, any account that receives a terminal disconnect (loggedOut 401, connectionReplaced 440) enters a perpetual restart loop — the health monitor calls startChannel() on every check interval, each attempt leaks a full Baileys WebSocket + Noise handshake, and over time this exhausts gateway heap (12.9 GB reported over 20 h on 12 accounts), saturates the event loop (~80% CPU), and degrades RPC latency for all tenants on the same gateway (~24 s p99 vs. <200 ms baseline). Recovery requires manual intervention.

  • Root Cause: Two independent restart paths both lacked a terminal-disconnect guard, and the status surface also missed the signal. (1) ChannelHealthSnapshot had no terminalDisconnect field, so evaluateChannelHealth returned "not-running" for every stopped channel regardless of why it stopped, causing the health-monitor to unconditionally call startChannel(). (2) The ChannelManager task-exit recovery handler fires immediately when the channel task settles — before the health monitor is involved — and only checked manuallyStopped, so terminal exits still triggered the backoff-and-restart loop independently. (3) WhatsApp's custom buildAccountSnapshot handler forwarded healthState from the runtime snapshot but not terminalDisconnect, so channels.status recomputed "not-running" instead of "terminal-disconnect" for logged-out accounts.

  • Fix: Add terminalDisconnect?: boolean to ChannelHealthSnapshot and the public ChannelAccountSnapshot. The WhatsApp status controller sets the flag to true in markStopped() when healthState is "logged-out" or "conflict" (the only two states that clear auth and require QR re-pairing), and resets it to undefined in noteConnected() on successful re-auth. Both restart paths now observe the flag: evaluateChannelHealth returns a new "terminal-disconnect" reason that the health monitor logs and skips; the ChannelManager task-exit handler returns early with an info-level log. The flag is also forwarded in WhatsApp's buildAccountSnapshot so channels.status reflects the correct terminal state. The flag is channel-agnostic: any future extension can set it on its own ChannelAccountSnapshot to opt in to the same no-restart behavior without modifying core.

  • What changed:

    • src/channels/plugins/types.core.ts — added terminalDisconnect?: boolean to ChannelAccountSnapshot
    • src/gateway/channel-health-policy.ts — added field to ChannelHealthSnapshot; added "terminal-disconnect" reason; added guard before the not-running branch
    • src/gateway/channel-health-monitor.ts — added skip block for terminal-disconnect reason with an info-level log line
    • src/gateway/server-channels.ts — added terminalDisconnect guard in the task-exit recovery handler after the manuallyStopped check
    • extensions/whatsapp/src/auto-reply/types.ts — added terminalDisconnect?: boolean to WebChannelStatus
    • extensions/whatsapp/src/auto-reply/monitor-state.ts — set flag in markStopped() for "logged-out" / "conflict"; clear in noteConnected()
    • extensions/whatsapp/src/channel.ts — forwarded terminalDisconnect from the runtime snapshot in the custom buildAccountSnapshot handler
    • src/gateway/channel-health-policy.test.ts — 3 new cases covering terminal-disconnect evaluation, no-flag fallthrough, and running-channel flag ignore
    • src/gateway/channel-health-monitor.test.ts — 2 new cases proving the monitor skips terminal-disconnect channels and still restarts non-terminal stopped channels
    • src/gateway/server-channels.test.ts — 1 new case proving the manager task-exit path does not auto-restart a terminal-disconnect channel
    • src/gateway/server-methods/channels.status.test.ts — 1 new case proving channels.status reports "terminal-disconnect" health state when the flag is set
  • What did NOT change (scope boundary):

    • No changes to reconnect policy, cooldown config, or hourly restart cap
    • No changes to any other channel extension (Telegram, Discord, Slack, etc.)
    • isManuallyStopped() logic is untouched; manually stopped channels continue to be skipped by the existing guard that runs before health evaluation
    • resolveChannelRestartReason is not reachable for terminal-disconnect (monitor skips before calling it) and is not modified
    • No new public-facing config keys or gateway events in this change

Reproduction

  1. Run a gateway with ≥1 WhatsApp account:
    channels:
      whatsapp:
        accounts:
          default:
            enabled: true
  2. From the linked phone: Settings → Linked Devices → unlink the session.
  3. Observe gateway.log:
    [health-monitor] [whatsapp:default] health-monitor: restarting (reason: stopped)
    firing every check interval indefinitely.
  4. After this fix:
    [default] auto-restart skipped, terminal disconnect
    [health-monitor] [whatsapp:default] health-monitor: skipping restart, terminal disconnect
    fires with no startChannel() call, and channels.status reports healthState: "terminal-disconnect".

Risk / Mitigation

  • Risk: A terminal-disconnect channel will no longer be auto-recovered by either restart path. If terminalDisconnect is accidentally set true on a channel that could have reconnected, that channel would stay stopped until manually restarted.
  • Mitigation: The flag is set only for "logged-out" and "conflict" healthStates, which both clear stored auth credentials — the channel genuinely cannot reconnect without operator QR action. noteConnected() unconditionally clears the flag, so a successful re-auth restores normal behavior on all paths. Three policy unit tests, two monitor integration tests, one manager recovery test, and one channels.status regression test cover the boundary cases.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway
  • WhatsApp plugin
  • Channel status types (additive, backward-compatible)
  • Tests

Linked Issue/PR

Fixes #78419

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/whatsapp/src/auto-reply/monitor-state.ts (modified, +3/-0)
  • extensions/whatsapp/src/auto-reply/types.ts (modified, +1/-0)
  • extensions/whatsapp/src/channel.ts (modified, +3/-0)
  • src/channels/plugins/types.core.ts (modified, +1/-0)
  • src/gateway/channel-health-monitor.test.ts (modified, +30/-0)
  • src/gateway/channel-health-monitor.ts (modified, +6/-0)
  • src/gateway/channel-health-policy.test.ts (modified, +55/-0)
  • src/gateway/channel-health-policy.ts (modified, +5/-0)
  • src/gateway/server-channels.test.ts (modified, +23/-0)
  • src/gateway/server-channels.ts (modified, +4/-0)
  • src/gateway/server-methods/channels.status.test.ts (modified, +30/-0)

Code Example

[health-monitor] [whatsapp:cp-tenantN] health-monitor: restarting (reason: stopped)

---

# 12 channels entering health-monitor restart on the same global sweep
[health-monitor] [whatsapp:cp-tenant1]   health-monitor: restarting (reason: stopped)
[health-monitor] [whatsapp:cp-tenant2]   health-monitor: restarting (reason: stopped)
[health-monitor] [whatsapp:cp-tenant3]   health-monitor: restarting (reason: stopped)
... (12 lines, all within the same second) ...

# RPC latency during saturation vs. recovery
[ws] ⇄ res ✓ agents.files.get  24913ms  conn=<redacted>  id=<redacted>
[ws] ⇄ res ✓ agents.files.get  24241ms  conn=<redacted>  id=<redacted>
# After the offending channel was removed from config:
[ws] ⇄ res ✓ agents.files.get     36ms  conn=<redacted>  id=<redacted>

# Gateway RSS over 20h (sampled via systemd-cgtop / RSS column of /proc/<pid>/status)
00:00 RSS=  462 MB
04:00 RSS=  3.1 GB
08:00 RSS=  6.4 GB
12:00 RSS=  9.8 GB
16:00 RSS= 11.7 GB
20:00 RSS= 12.9 GB  ← peak before manual intervention
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

On a multi-tenant gateway (12 WhatsApp accounts), all 12 channels enter a perpetual health-monitor: restarting (reason: stopped) loop after their underlying WhatsApp Web sessions are invalidated. The health monitor restarts channels regardless of why they stopped, so channels stopped by terminal DisconnectReasons (loggedOut, connectionReplaced, badSession) get re-started indefinitely with no chance of success — they need a fresh QR pairing, not a restart. Over ~20 hours this drove gateway RSS from <500 MB to 12.9 GB peak, CPU to 16 h of 20 h wall time, and RPC agents.files.get latency from <200 ms to 24-25 s, blocking unrelated tenants on the same gateway.

Steps to reproduce

  1. Run an OpenClaw gateway with multiple WhatsApp Web accounts (channels.whatsapp.accounts.*) on 2026.5.3-1.
  2. Wait for one or more accounts to receive a terminal disconnect from WhatsApp servers — easiest reproductions:
    • User logs out the linked device from the phone (Settings → Linked Devices → unlink) → DisconnectReason.loggedOut (401).
    • User links a new device that replaces the OpenClaw session → DisconnectReason.connectionReplaced (440).
    • Auth state corruption → DisconnectReason.badSession.
  3. Observe that the channel transitions to running=false after the close handler.
  4. Watch ~/.openclaw/logs/gateway.log (or the systemd-user journal) for ~30 min:
    [health-monitor] [whatsapp:cp-tenantN] health-monitor: restarting (reason: stopped)
    firing every health-check interval, per channel, with each restart immediately failing the same way.
  5. With 12 stuck channels, observe gateway RSS climb steadily (Baileys WebSocket + Noise handshake state per attempt), event-loop delay rise, and other RPCs degrade.

Expected behavior

  • Channels stopped by terminal DisconnectReasons are marked in a state distinct from "stopped" (e.g. reauth_required) and not restarted by the health-monitor.
  • A channel.reauth_required event (or equivalent) is emitted on the gateway event stream so control planes can surface a re-pairing prompt to the operator.
  • channels.status exposes a reauthRequired: boolean field so downstream consumers can distinguish "needs human action" from "transient outage".
  • Channels that exit for retryable reasons (connectionClosed, connectionLost, restartRequired, timedOut) continue to be restarted as today.
  • A bounded restart budget exists for any restart loop that escapes the DisconnectReason classification (defense-in-depth — see Additional information).

Actual behavior

  • Channel exits with connection: 'close' and a terminal DisconnectReason from Baileys.
  • Gateway sets the channel to running=false (status: stopped).
  • Health monitor wakes on its interval, sees running=false, and calls startChannel().
  • New connection immediately fails with the same terminal reason (or a connectionReplaced 440 because the previous Noise key is now stale on WhatsApp's side).
  • Loop continues indefinitely. Per-attempt cost (Baileys auth state + protobuf parsing + WebSocket setup) accumulates: in our case 12 channels × ~one attempt every health-check interval × no upper bound on consecutive failures.

Observed metrics (gateway PID, single instance, 20h window):

  • RSS: <500 MB → 12.9 GB peak (~26× growth).
  • CPU: 16 h of 20 h (~80%).
  • agents.files.get RPC p50/p99: ~150 ms / ~24-25 s (normal <200 ms).
  • Other tenants' control-plane RPCs queued behind the saturated event loop.

OpenClaw version

2026.5.3-1

Operating system

Ubuntu 24.04 (Linux 6.8.0-110-generic), VPS

Install method

npm global, gateway as systemd --user unit (openclaw-gateway.service)

Model

Not relevant to this bug — failure is upstream of any model invocation. (Mixed: Anthropic Claude Opus 4.7, Sonnet 4.6 for tenant agents.)

Provider / routing chain

Not relevant — failure is in the WhatsApp transport before any agent runs.

Additional provider/model setup details

Multi-tenant SaaS control plane: ~12 active WhatsApp accounts on a single gateway, each tied to a distinct tenant. This is the relevant difference from prior single-user reports — the blast radius spans tenants.

Logs, screenshots, and evidence

# 12 channels entering health-monitor restart on the same global sweep
[health-monitor] [whatsapp:cp-tenant1]   health-monitor: restarting (reason: stopped)
[health-monitor] [whatsapp:cp-tenant2]   health-monitor: restarting (reason: stopped)
[health-monitor] [whatsapp:cp-tenant3]   health-monitor: restarting (reason: stopped)
... (12 lines, all within the same second) ...

# RPC latency during saturation vs. recovery
[ws] ⇄ res ✓ agents.files.get  24913ms  conn=<redacted>  id=<redacted>
[ws] ⇄ res ✓ agents.files.get  24241ms  conn=<redacted>  id=<redacted>
# After the offending channel was removed from config:
[ws] ⇄ res ✓ agents.files.get     36ms  conn=<redacted>  id=<redacted>

# Gateway RSS over 20h (sampled via systemd-cgtop / RSS column of /proc/<pid>/status)
00:00 RSS=  462 MB
04:00 RSS=  3.1 GB
08:00 RSS=  6.4 GB
12:00 RSS=  9.8 GB
16:00 RSS= 11.7 GB
20:00 RSS= 12.9 GB  ← peak before manual intervention

Impact and severity

  • Affected: any operator running >1 WhatsApp account on a single gateway where one or more accounts is invalidated upstream (logout from phone, account ban, device replaced). Severity scales with channel count.
  • Severity: high — degrades the entire gateway, not just the affected channel. In multi-tenant deployments, one tenant's logout can cause cross-tenant outage.
  • Frequency: 100% reproducible whenever a WhatsApp Web session ends with a terminal DisconnectReason. Frequency in the wild depends on user behavior; in our population we see it weekly.
  • Consequence: gateway memory exhaustion, event-loop saturation, missed cross-channel replies, RPC latency that cascades into agent-side timeouts. Recovery currently requires manual gateway restart + per-channel re-pairing.

Additional information

Relation to prior closed issues

This is the same problem family as #48390, #49305, #51342, #54614, #70463, but the reason logged is now stopped (not stale-socket), and the symptom is observed on 2026.5.3-1 after those fixes shipped. Reading the closed-issue history, the post-fix gateway appears to have moved the health-monitor trigger from "stale event timestamp" to "channel running=false" — which closes the stale-socket false positive, but reintroduces the same loop because the health-monitor still cannot tell why a channel is stopped.

The architectural gap: the health-monitor and the close-reason classifier are decoupled. The close handler knows the channel was logged out; that information does not reach the health-monitor.

Proposed fix #1 — terminal DisconnectReason classification (root cause)

Capture connection.update from Baileys and, when connection === 'close', read lastDisconnect.error.output.statusCode (or lastDisconnect.error.data.reason) and map to DisconnectReason:

DisconnectReasonCodeAction
loggedOut401Mark channel reauth_required; do not restart; clear stored auth state
connectionReplaced440Mark channel reauth_required; do not restart
badSessionMark channel reauth_required; clear auth-profiles.json; do not restart
connectionClosed / connectionLost / restartRequired / timedOut408 / 428 / 515 / 408Restart as today (retryable)
multideviceMismatchMark reauth_required; do not restart

Additions to public surface:

  • channels.status RPC: add optional reauthRequired: boolean (and ideally reauthReason: 'logged_out' | 'connection_replaced' | 'bad_session' | ...). Optional → does not break existing clients.
  • New gateway event: channel.reauth_required on the event stream, with { channelId, accountId, reason } payload, so control planes can proactively show a re-pair / new-QR prompt.
  • When a fresh QR scan completes successfully, clear reauthRequired and resume normal health-monitor behavior automatically.

Proposed fix #2 — bounded restart budget (defense-in-depth)

For close reasons that fall outside the mapping (new Baileys constants, unmapped 4xx, undocumented disconnects, or close-before-connection.update-fires races):

  • After N consecutive restart attempts (suggested default 5) within a rolling window of X minutes (suggested 10 min), stop restarting.
  • Mark the channel disconnected (max_retries_exceeded); surface the same channel.reauth_required event so operators see the failure.
  • Reset the counter when the channel stays connected for Y minutes (suggested 5 min) without a close.
  • Make N, X, Y configurable under gateway.channelHealthMonitor.{maxConsecutiveRestarts, restartWindowMinutes, healthyResetMinutes}.

Why both fixes, not just one:

  • Fix #1 alone fails on classes the gateway doesn't yet know about (new Baileys releases, undocumented WhatsApp server codes). Fix #2 catches that.
  • Fix #2 alone preserves the bad UX of silent failure: operators only learn after the budget exhausts (5 attempts × ~30 s ≈ 2-3 min lag) and lose the strong "logged out → please re-pair" signal that Fix #1 provides.

Last known good / first known bad

We have not bisected — this is structural rather than a regression. Prior closed issues in the same family suggest the loop has existed across several versions with different surface symptoms (stale-socketstopped).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • Channels stopped by terminal DisconnectReasons are marked in a state distinct from "stopped" (e.g. reauth_required) and not restarted by the health-monitor.
  • A channel.reauth_required event (or equivalent) is emitted on the gateway event stream so control planes can surface a re-pairing prompt to the operator.
  • channels.status exposes a reauthRequired: boolean field so downstream consumers can distinguish "needs human action" from "transient outage".
  • Channels that exit for retryable reasons (connectionClosed, connectionLost, restartRequired, timedOut) continue to be restarted as today.
  • A bounded restart budget exists for any restart loop that escapes the DisconnectReason classification (defense-in-depth — see Additional information).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: WhatsApp health-monitor re-starts channels stopped by terminal DisconnectReason (loggedOut, connectionReplaced) — 12-channel multi-tenant restart loop, 12.9GB heap, 24s RPC latency on 2026.5.3-1 [1 pull requests, 3 comments, 3 participants]