openclaw - ✅(Solved) Fix Mattermost channel health monitor fails to detect and recover from silent WebSocket disconnection [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#50138Fetched 2026-04-08 00:58:46
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
cross-referenced ×2referenced ×1

Error Message

  1. No error or warning is logged after the disconnect

Root Cause

Looking at the source code in gateway-cli-Ol-vpIk7.js, the evaluateChannelHealth() function has a gap in its detection logic:

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

If the Mattermost WebSocket disconnects but the internal snapshot.connected state becomes false, the stale-socket detection path is skipped entirely. The monitor then falls through to the disconnected branch, which only triggers a restart if reconnectAttempts >= 10. If the plugin's internal reconnection logic stalls (e.g., exponential backoff grows too large, or reconnect count never reaches 10), the health monitor considers the channel "still trying" and never intervenes.

Additionally, there appears to be a scenario where the channel enters a zombie state — the process is alive, the channel object exists, but no events are processed and no logs are written. This suggests the event loop or message processing pipeline may be blocked/deadlocked, which the health monitor cannot detect because it only checks metadata snapshots, not actual liveness.

Fix Action

Workaround

Reduce the health check interval to catch issues faster:

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart

This does not fix the root cause but reduces the window of undetected failure.

PR fix notes

PR #50143: Fix Mattermost stale-zombie sockets by tracking lastEventAt + health policy fallback

Description (problem / solution / changelog)

Summary

This PR improves Mattermost channel liveness detection and recovery in the gateway health monitor.

Changes

  1. Mattermost WebSocket monitor now updates lastEventAt on every incoming WS message.

    • File: extensions/mattermost/src/mattermost/monitor-websocket.ts
    • This enables stale-socket detection to work for Mattermost channels.
  2. Health policy now detects zombie state when connected === true but lastEventAt is missing for too long.

    • File: src/gateway/channel-health-policy.ts
    • Added fallback: if a non-webhook/non-telegram channel has been connected longer than stale threshold and never produced events, mark as stale-socket.
  3. Added tests for the new zombie-state behavior.

    • File: src/gateway/channel-health-policy.test.ts
    • Covers:
      • healthy within threshold
      • stale zombie after threshold
      • telegram exclusion

Why

In production, Mattermost channels can become silently unresponsive for many hours while the gateway process remains alive. The previous logic depended on lastEventAt, but Mattermost runtime did not update it, so stale-socket restart path could be skipped.

This PR ensures Mattermost contributes event timestamps and adds a defensive health-policy fallback for channels that report connected but never emit events.

Notes

  • Local test execution in this environment is currently blocked by missing optional native dependency (@rolldown/binding-darwin-arm64) in toolchain bootstrap. Tests were updated and should run in CI/tooling-complete environments.
  • Linked issue: #50138

Changed files

  • extensions/mattermost/src/mattermost/monitor-websocket.ts (modified, +11/-0)
  • src/gateway/channel-health-policy.test.ts (modified, +80/-9)
  • src/gateway/channel-health-policy.ts (modified, +15/-0)

PR #57621: fix(mattermost): add WebSocket ping/pong keepalive to detect silent connection drops

Description (problem / solution / changelog)

Summary

  • Problem: When TCP silently dies (NAT expiry, proxy timeout, network switch), no WebSocket close/error event fires. The bot appears connected but stops receiving messages — potentially for 12+ hours (#50138) before detection.
  • Why it matters: Multiple users on macOS arm64 + Node.js v25.8.x report silent WebSocket death with no automatic recovery (#51104, #44160, #41837, #50138).
  • What changed: Added standard WebSocket ping/pong keepalive (30s interval) alongside the existing getBotUpdateAt health check. On pong timeout, the connection is terminated so runWithReconnect kicks in.
  • What did NOT change (scope boundary): Existing health check logic (getBotUpdateAt, healthCheckIntervalMs) is untouched. No changes to reconnect logic, message handling, or auth flow.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #51104
  • Related #44160
  • Related #41837
  • Related #50138
  • Supersedes #46345 (closed by author for PR count management, not for code quality — Greptile rated 4/5 "safe to merge")
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: WebSocket connections behind NAT/proxy can enter a half-open state where the TCP connection is dead but no FIN/RST is received. Without application-level keepalive, the ws library never fires close/error events.
  • Missing detection / guardrail: No ping/pong keepalive was implemented. The existing getBotUpdateAt health check only detects bot account modifications, not dead TCP connections.
  • Prior context: PR #46345 by @Br1an67 implemented the same fix but was closed for PR count management. Greptile review noted a minor robustness issue with clearTimeout ordering.
  • Why this regressed now: Not a regression — this has always been the behavior. It surfaces in environments with NAT/proxy layers (reverse proxies, corporate firewalls).
  • If unknown, what was ruled out: N/A

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: monitor-websocket.test.ts
  • Scenario the test should lock in: When ping is sent and no pong received within the next interval, ws.terminate() should be called.
  • Why this is the smallest reliable guardrail: Unit test on the FakeWebSocket can verify timer behavior without network dependencies.
  • Existing test that already covers this: None.
  • If no new test is added, why not: The existing FakeWebSocket class needs ping() and pong event support. Happy to add tests with maintainer guidance on the preferred mock pattern.

User-visible / Behavior Changes

  • Bots behind NAT/proxy now automatically reconnect within ~60s of a silent TCP drop (previously could hang indefinitely).
  • New optional config: pingIntervalMs (default 30000ms, set to 0 to disable).
  • No change to existing behavior when connections are healthy.

Diagram (if applicable)

Before:
[TCP dies silently] -> [no close/error event] -> [bot hangs forever] -> [manual restart needed]

After:
[TCP dies silently] -> [ping sent, no pong] -> [next ping: awaitingPong=true] -> [ws.terminate()]
  -> [close event fires] -> [runWithReconnect] -> [fresh connection] -> [bot recovers]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No — WebSocket ping/pong frames are protocol-level, not application-level HTTP calls.
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS 15.4 (Apple Silicon, Mac mini M4)
  • Runtime/container: Node.js 25.8.1
  • Model/provider: openai/gpt-5.4 and vllm/openai/gpt-oss-120b
  • Integration/channel: Mattermost (self-hosted behind Nginx reverse proxy)
  • Relevant config (redacted):
    {
      "channels": {
        "mattermost": {
          "websocket": {
            "pingIntervalMs": 10000,
            "reconnectIntervalMs": 3000
          }
        }
      }
    }

Steps

  1. Deploy OpenClaw with two Mattermost bot accounts behind a reverse proxy.
  2. Simulate silent TCP drop (e.g., firewall rule, proxy timeout, network switch).
  3. Without patch: bot hangs indefinitely, no recovery.
  4. With patch: bot detects pong timeout within ~60s and reconnects.

Expected

Bot automatically reconnects within one ping interval after silent TCP drop.

Actual

Verified on production Mattermost instance with two bot accounts. Both accounts maintain keepalive and would auto-recover on connection loss.

Evidence

  • Trace/log snippets

Gateway log on successful keepalive (silent — no log output when healthy):

[mattermost] connected as @jason_wang_mbot
[mattermost] connected as @gpt-oss-120b

On pong timeout (would appear):

mattermost websocket pong timeout — terminating connection for reconnect
  • Prior art: PR #46345 validated the same approach (Greptile 4/5 confidence).

Human Verification (required)

  • Verified scenarios: Deployed to production with two Mattermost bot accounts (codex + got-oss-120b). Both accounts connected successfully with keepalive active. Gateway running stable.
  • Edge cases checked: Verified clearTimers() cleans up both ping timer and health check timer. Verified awaitingPong is reset on close. Verified pingIntervalMs: 0 disables keepalive.
  • What you did not verify: Actual silent TCP drop simulation (impractical in production). Relied on code review and prior art (#46345) for correctness.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No — new optional pingIntervalMs defaults to 30000ms.
  • Migration needed? No

Risks and Mitigations

  • Risk: ws.ping() not available on all WebSocket implementations.
    • Mitigation: The ws library (used by OpenClaw) natively supports ping()/pong. The MattermostWebSocketLike interface is extended to require it, so any custom implementation must provide it.
  • Risk: Ping/pong traffic adds overhead.
    • Mitigation: Ping frames are 2 bytes + framing. At 30s intervals, this is ~4 bytes/minute — negligible.

Changed files

  • extensions/mattermost/src/mattermost/monitor-websocket.test.ts (modified, +14/-1)
  • extensions/mattermost/src/mattermost/monitor-websocket.ts (modified, +32/-0)

Code Example

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

---

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart
RAW_BUFFERClick to expand / collapse

Bug Description

The Mattermost WebSocket channel can silently disconnect and remain unresponsive for 12+ hours while the gateway process itself stays alive. The channel health monitor fails to detect and recover from this state.

Environment

  • OpenClaw version: 2026.3.13
  • OS: macOS 15 (arm64)
  • Channel: Mattermost (WebSocket mode)
  • Node.js: v25.8.1

Steps to Reproduce

  1. Run OpenClaw gateway with Mattermost channel connected via WebSocket
  2. Wait for a network interruption or WebSocket disconnect event
  3. Observe that the gateway process remains alive (confirmed via macOS system log processID tracking), but the Mattermost channel stops processing messages
  4. No error or warning is logged after the disconnect

Expected Behavior

The channel health monitor should detect the silent disconnection within a few minutes and automatically restart the Mattermost channel.

Actual Behavior

  • Gateway process (PID 85440) stayed alive from 21:04 to 09:27 next day (12+ hours)
  • macOS system log confirmed the process was running the entire time (network path checks every minute showed satisfied)
  • Last OpenClaw application-level log entry was at 21:05:10 (delivered reply to user:...)
  • Zero log entries from 21:05 to 09:27 — no errors, no warnings, no health monitor output
  • The health monitor (configured at default 300s interval) did not detect or restart the channel

Root Cause Analysis

Looking at the source code in gateway-cli-Ol-vpIk7.js, the evaluateChannelHealth() function has a gap in its detection logic:

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

If the Mattermost WebSocket disconnects but the internal snapshot.connected state becomes false, the stale-socket detection path is skipped entirely. The monitor then falls through to the disconnected branch, which only triggers a restart if reconnectAttempts >= 10. If the plugin's internal reconnection logic stalls (e.g., exponential backoff grows too large, or reconnect count never reaches 10), the health monitor considers the channel "still trying" and never intervenes.

Additionally, there appears to be a scenario where the channel enters a zombie state — the process is alive, the channel object exists, but no events are processed and no logs are written. This suggests the event loop or message processing pipeline may be blocked/deadlocked, which the health monitor cannot detect because it only checks metadata snapshots, not actual liveness.

Suggested Fixes

  1. Add active liveness probing: The health monitor should periodically send a ping/probe through the actual WebSocket connection, not just check metadata snapshots
  2. Treat prolonged silence as unhealthy: If a channel has connected === true but lastEventAt has not updated in staleEventThresholdMs (default 30 min), it should be considered unhealthy regardless — the current code does this, but the zombie state where connected is neither clearly true nor false may bypass all checks
  3. Log health monitor evaluations: Even when healthy, periodically log the health status so that silent failures are diagnosable (currently zero log output for 12 hours makes debugging very difficult)
  4. Add a watchdog for application-level log silence: If the gateway writes no application log entries for N minutes while the process is alive, something is wrong

Workaround

Reduce the health check interval to catch issues faster:

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart

This does not fix the root cause but reduces the window of undetected failure.

extent analysis

Fix Plan

To address the silent disconnection issue in the Mattermost WebSocket channel, we will implement the following fixes:

  • Add active liveness probing to the health monitor
  • Treat prolonged silence as unhealthy
  • Log health monitor evaluations
  • Add a watchdog for application-level log silence

Step-by-Step Solution

  1. Modify the evaluateChannelHealth() function to include active liveness probing:
// Send a ping/probe through the actual WebSocket connection
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    // Add a ping/probe to check the actual connection
    const pingResult = sendPingProbe(snapshot);
    if (!pingResult) {
        return { healthy: false, reason: "unresponsive-connection" };
    }
    // ...
}
  1. Treat prolonged silence as unhealthy:
// If a channel has `connected === true` but `lastEventAt` has not updated in `staleEventThresholdMs`
if (now - snapshot.lastEventAt > staleEventThresholdMs) {
    return { healthy: false, reason: "stale-socket" };
}
  1. Log health monitor evaluations:
// Log the health status periodically
console.log(`Health monitor evaluation: ${new Date().toISOString()} - ${policy.channelId} is ${healthy ? 'healthy' : 'unhealthy'}`);
  1. Add a watchdog for application-level log silence:
// Set a timer to check for log silence
const logSilenceTimeout = 10 * 60 * 1000; // 10 minutes
const lastLogTime = new Date();
setInterval(() => {
    const now = new Date();
    if (now - lastLogTime > logSilenceTimeout) {
        console.error('Application-level log silence detected');
        // Take action to restart the channel or alert the administrator
    }
}, logSilenceTimeout);

Verification

To verify that the fix worked, you can:

  • Run the OpenClaw gateway with the modified code
  • Simulate a network interruption or WebSocket disconnect event
  • Check the logs to see if the health monitor detects the issue and restarts the channel
  • Verify that the channel is processing messages again after the restart

Extra Tips

  • Make sure to test the modified code thoroughly to ensure that it works as expected
  • Consider adding additional logging and monitoring to detect similar issues in the future
  • Review the OpenClaw documentation and source code to ensure that the fixes are compatible with the latest version of the software.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Mattermost channel health monitor fails to detect and recover from silent WebSocket disconnection [2 pull requests, 1 comments, 1 participants]