openclaw - 💡(How to fix) Fix WhatsApp watchdog app-silent timeout detects zombie state but never triggers reconnect [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75915Fetched 2026-05-03 04:44:19
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

Error Message

onWatchdogTimeout: (snapshot) => { // ... existing detection + logging ... controller.forceClose({ status: 499, isLoggedOut: false, error: new Error(watchdog ${watchdogReason} timeout) }); }

Root Cause

In monitor-Cd6tuzSy.js (v2026.4.29), the onWatchdogTimeout callback:

onWatchdogTimeout: (snapshot) => {
    const watchdogReason = transportSilentMs > transportTimeoutMs 
        ? "transport-inactive" : "app-silent";
    statusController.noteWatchdogStale();
    heartbeatLogger.warn({...}, "WhatsApp watchdog timeout detected - forcing reconnect");
    whatsappHeartbeatLog.warn(`WhatsApp watchdog timeout (${watchdogReason}) - restarting connection`);
    // ← nothing happens here. No reconnect. Just logging.
}

The log message says "forcing reconnect" but no reconnect is forced.

Meanwhile:

  • noteWatchdogStale() sets healthState = "stale" — but nobody listens for "stale" to trigger recovery
  • isTerminalHealthState() only recognizes "conflict", "logged-out", "stopped" — not "stale"
  • The connection loop stays in its monitoring state waiting for a Baileys-level disconnect that never comes (because the transport is healthy — the app-level event handler is what is dead)
  • controller.forceClose() exists and is correctly wired for unhandled rejection / crypto errors, but is not called from the watchdog timeout path

Code Example

onWatchdogTimeout: (snapshot) => {
    const watchdogReason = transportSilentMs > transportTimeoutMs 
        ? "transport-inactive" : "app-silent";
    statusController.noteWatchdogStale();
    heartbeatLogger.warn({...}, "WhatsApp watchdog timeout detected - forcing reconnect");
    whatsappHeartbeatLog.warn(`WhatsApp watchdog timeout (${watchdogReason}) - restarting connection`);
    // ← nothing happens here. No reconnect. Just logging.
}

---

onWatchdogTimeout: (snapshot) => {
    // ... existing detection + logging ...
    controller.forceClose({
        status: 499,
        isLoggedOut: false,
        error: new Error(`watchdog ${watchdogReason} timeout`)
    });
}
RAW_BUFFERClick to expand / collapse

Bug

The onWatchdogTimeout handler in the WhatsApp web monitor (#63855 / #66920) correctly detects when a post-408 reconnect produces a zombie connection (transport frames flowing, but messages.upsert handler not re-registered), but never actually forces a reconnect. The handler logs a warning, sets healthState = "stale", and returns — leaving the dead connection running indefinitely.

Root Cause

In monitor-Cd6tuzSy.js (v2026.4.29), the onWatchdogTimeout callback:

onWatchdogTimeout: (snapshot) => {
    const watchdogReason = transportSilentMs > transportTimeoutMs 
        ? "transport-inactive" : "app-silent";
    statusController.noteWatchdogStale();
    heartbeatLogger.warn({...}, "WhatsApp watchdog timeout detected - forcing reconnect");
    whatsappHeartbeatLog.warn(`WhatsApp watchdog timeout (${watchdogReason}) - restarting connection`);
    // ← nothing happens here. No reconnect. Just logging.
}

The log message says "forcing reconnect" but no reconnect is forced.

Meanwhile:

  • noteWatchdogStale() sets healthState = "stale" — but nobody listens for "stale" to trigger recovery
  • isTerminalHealthState() only recognizes "conflict", "logged-out", "stopped" — not "stale"
  • The connection loop stays in its monitoring state waiting for a Baileys-level disconnect that never comes (because the transport is healthy — the app-level event handler is what is dead)
  • controller.forceClose() exists and is correctly wired for unhandled rejection / crypto errors, but is not called from the watchdog timeout path

Expected Behavior

The onWatchdogTimeout handler should call controller.forceClose() to tear down the Baileys socket and re-enter the connection loop with fresh event listeners:

onWatchdogTimeout: (snapshot) => {
    // ... existing detection + logging ...
    controller.forceClose({
        status: 499,
        isLoggedOut: false,
        error: new Error(`watchdog ${watchdogReason} timeout`)
    });
}

This would close the dead socket, break out of the monitoring loop, and trigger the reconnect path — which re-registers messages.upsert and other event handlers.

Impact

Complete silent inbound message loss. After any 408/428 disconnect-reconnect cycle, the WhatsApp channel can enter a state where:

  • Outbound messages send successfully (masking the failure)
  • Gateway status shows "Listening" / "connected"
  • Transport heartbeats flow normally
  • But zero inbound messages are processed
  • The watchdog correctly identifies app-silent and logs "forcing reconnect" — but does not reconnect

The only current recovery is a full gateway process kill + restart. The watchdog forceClose reconnect path (499 status code) also fails to recover because it appears to be the same incomplete path.

Reproduction

  1. Run OpenClaw v2026.4.29 with WhatsApp channel enabled
  2. Wait for a 408 server-forced disconnect (happens naturally every few hours on most connections)
  3. Baileys reconnects the WebSocket but fails to re-register messages.upsert
  4. Watchdog detects app-silent after configured timeout
  5. Observe: warning logged, healthState set to "stale", but no reconnect occurs
  6. Inbound messages are permanently lost until process restart

Environment

  • OpenClaw v2026.4.29 (build a448042)
  • macOS 15 (arm64), Node v25.8.2
  • Baileys v7.0.0-rc.9
  • Two WhatsApp accounts configured (default + secondary)
  • pmset sleep 0 (sleep is not a factor)

Related

  • #63855 / #66920 — the PR that added the detection but not the recovery
  • #73580 — added web.whatsapp.* socket timing config (helps prevent 408s but does not fix zombie recovery)
  • #72656 — transport liveness / earlier reconnect on silent stalls
  • #73914 — upstream Baileys zombie reconnect tracking

extent analysis

TL;DR

The onWatchdogTimeout handler should call controller.forceClose() to tear down the Baileys socket and re-enter the connection loop with fresh event listeners.

Guidance

  • The onWatchdogTimeout handler detects a zombie connection but does not force a reconnect, leading to silent inbound message loss.
  • To fix this, call controller.forceClose() with a 499 status code and an error message in the onWatchdogTimeout handler.
  • Verify that the reconnect path is triggered by checking for the re-registration of messages.upsert and other event handlers.
  • Test the fix by reproducing the issue using the provided reproduction steps and observing if inbound messages are processed correctly after a reconnect.

Example

onWatchdogTimeout: (snapshot) => {
    // ... existing detection + logging ...
    controller.forceClose({
        status: 499,
        isLoggedOut: false,
        error: new Error(`watchdog ${watchdogReason} timeout`)
    });
}

Notes

The provided fix assumes that the controller.forceClose() method is correctly wired and functional. If this method is not working as expected, additional debugging may be required.

Recommendation

Apply the workaround by calling controller.forceClose() in the onWatchdogTimeout handler, as this should fix the silent inbound message loss issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix WhatsApp watchdog app-silent timeout detects zombie state but never triggers reconnect [1 comments, 2 participants]