openclaw - ✅(Solved) Fix Mattermost channel health monitor fails to detect and recover from silent WebSocket disconnection [2 pull requests, 1 comments, 1 participants]

ThomasChan · 2026-03-19T02:16:27Z

[openclaw] PR 50143: Fix Mattermost stale-zombie sockets by tracking lastEventAt + health policy fallback - Repository: openclaw/openclaw - Author: ThomasChan… # PR #50143: Fix Mattermost stale-zombie sockets by tracking lastEventAt + health policy fallback - Repository: openclaw/openclaw - Author: ThomasChan - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/50143 ## Description (problem / solution / changelog) ## Summary This PR improves Mattermost channel liveness detection and recovery in the gateway health monitor. ### Changes 1. Mattermost WebSocket monitor now updates `lastEventAt` on every incoming WS message. - File: `extensions/mattermost/src/mattermost/monitor-websocket.ts` - This enables stale-socket detection to work for Mattermost channels. 2. Health policy now detects zombie state when `connected === true` but `lastEventAt` is missing for too long. - File: `src/gateway/channel-health-policy.ts` - Added fallback: if a non-webhook/non-telegram channel has been connected longer than stale threshold and never produced events, mark as `stale-socket`. 3. Added tests for the new zombie-state behavior. - File: `src/gateway/channel-health-policy.test.ts` - Covers: - healthy within threshold - stale zombie after threshold - telegram exclusion ## Why In production, Mattermost channels can become silently unresponsive for many hours while the gateway process remains alive. The previous logic depended on `lastEventAt`, but Mattermost runtime did not update it, so stale-socket restart path could be skipped. This PR ensures Mattermost contributes event timestamps and adds a defensive health-policy fallback for channels that report connected but never emit events. ## Notes - Local test execution in this environment is currently blocked by missing optional native dependency (`@rolldown/binding-darwin-arm64`) in toolchain bootstrap. Tests were updated and should run in CI/tooling-complete environments. - Linked issue: #50138 ## Changed files - `extensions/mattermost/src/mattermost/monitor-websocket.ts` (modified, +11/-0) - `src/gateway/channel-health-policy.test.ts` (modified, +80/-9) - `src/gateway/channel-health-policy.ts` (modified, +15/-0) --- # PR #57621: fix(mattermost): add WebSocket ping/pong keepalive to detect silent connection drops - Repository: openclaw/openclaw - Author: JasonWang1124 - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/57621 ## Description (problem / solution / changelog) ## Summary - **Problem:** When TCP silently dies (NAT expiry, proxy timeout, network switch), no WebSocket close/error event fires. The bot appears connected but stops receiving messages — potentially for 12+ hours (#50138) before detection. - **Why it matters:** Multiple users on macOS arm64 + Node.js v25.8.x report silent WebSocket death with no automatic recovery (#51104, #44160, #41837, #50138). - **What changed:** Added standard WebSocket ping/pong keepalive (30s interval) alongside the existing `getBotUpdateAt` health check. On pong timeout, the connection is terminated so `runWithReconnect` kicks in. - **What did NOT change (scope boundary):** Existing health check logic (`getBotUpdateAt`, `healthCheckIntervalMs`) is untouched. No changes to reconnect logic, message handling, or auth flow. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Related #51104 - Related #44160 - Related #41837 - Related #50138 - Supersedes #46345 (closed by author for PR count management, not for code quality — Greptile rated 4/5 "safe to merge") - [x] This PR fixes a bug or regression ## Root Cause / Regression History (if applicable) - **Root cause:** WebSocket connections behind NAT/proxy can enter a half-open state where the TCP connection is dead but no FIN/RST is received. Without application-level keepalive, the `ws` library never fires close/error events. - **Missing detection / guardrail:** No ping/pong keepalive was implemented. The existing `getBotUpdateAt` health check only detects bot account modifications, not dead TCP connections. - **Prior context:** PR #46345 by @Br1an67 implemented the same fix but was closed for PR count management. Greptile review noted a minor robustness issue with `clearTimeout` ordering. - **Why this regressed now:** Not a regression — this has always been the behavior. It surfaces in environments with NAT/proxy layers (reverse proxies, corporate firewalls). - **If unknown, what was ruled out:** N/A ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [ ] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - **Target te

openclaw2026-03-19 02:16:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#50138•Fetched 2026-04-08 00:58:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ThomasChan

Participants

ThomasChan

Timeline (top)

cross-referenced ×2referenced ×1

Error Message

No error or warning is logged after the disconnect

Root Cause

Looking at the source code in gateway-cli-Ol-vpIk7.js, the evaluateChannelHealth() function has a gap in its detection logic:

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

If the Mattermost WebSocket disconnects but the internal snapshot.connected state becomes false, the stale-socket detection path is skipped entirely. The monitor then falls through to the disconnected branch, which only triggers a restart if reconnectAttempts >= 10. If the plugin's internal reconnection logic stalls (e.g., exponential backoff grows too large, or reconnect count never reaches 10), the health monitor considers the channel "still trying" and never intervenes.

Additionally, there appears to be a scenario where the channel enters a zombie state — the process is alive, the channel object exists, but no events are processed and no logs are written. This suggests the event loop or message processing pipeline may be blocked/deadlocked, which the health monitor cannot detect because it only checks metadata snapshots, not actual liveness.

Fix Action

Workaround

Reduce the health check interval to catch issues faster:

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart

This does not fix the root cause but reduces the window of undetected failure.

PR fix notes

PR #50143: Fix Mattermost stale-zombie sockets by tracking lastEventAt + health policy fallback

Repository: openclaw/openclaw
Author: ThomasChan
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/50143

Description (problem / solution / changelog)

Summary

This PR improves Mattermost channel liveness detection and recovery in the gateway health monitor.

Changes

Mattermost WebSocket monitor now updates lastEventAt on every incoming WS message.
- File: extensions/mattermost/src/mattermost/monitor-websocket.ts
- This enables stale-socket detection to work for Mattermost channels.
Health policy now detects zombie state when connected === true but lastEventAt is missing for too long.
- File: src/gateway/channel-health-policy.ts
- Added fallback: if a non-webhook/non-telegram channel has been connected longer than stale threshold and never produced events, mark as stale-socket.
Added tests for the new zombie-state behavior.
- File: src/gateway/channel-health-policy.test.ts
- Covers:
  - healthy within threshold
  - stale zombie after threshold
  - telegram exclusion

Why

In production, Mattermost channels can become silently unresponsive for many hours while the gateway process remains alive. The previous logic depended on lastEventAt, but Mattermost runtime did not update it, so stale-socket restart path could be skipped.

This PR ensures Mattermost contributes event timestamps and adds a defensive health-policy fallback for channels that report connected but never emit events.

Notes

Local test execution in this environment is currently blocked by missing optional native dependency (@rolldown/binding-darwin-arm64) in toolchain bootstrap. Tests were updated and should run in CI/tooling-complete environments.
Linked issue: #50138

Changed files

extensions/mattermost/src/mattermost/monitor-websocket.ts (modified, +11/-0)
src/gateway/channel-health-policy.test.ts (modified, +80/-9)
src/gateway/channel-health-policy.ts (modified, +15/-0)

PR #57621: fix(mattermost): add WebSocket ping/pong keepalive to detect silent connection drops

Repository: openclaw/openclaw
Author: JasonWang1124
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/57621

Description (problem / solution / changelog)

Summary

Problem: When TCP silently dies (NAT expiry, proxy timeout, network switch), no WebSocket close/error event fires. The bot appears connected but stops receiving messages — potentially for 12+ hours (#50138) before detection.
Why it matters: Multiple users on macOS arm64 + Node.js v25.8.x report silent WebSocket death with no automatic recovery (#51104, #44160, #41837, #50138).
What changed: Added standard WebSocket ping/pong keepalive (30s interval) alongside the existing getBotUpdateAt health check. On pong timeout, the connection is terminated so runWithReconnect kicks in.
What did NOT change (scope boundary): Existing health check logic (getBotUpdateAt, healthCheckIntervalMs) is untouched. No changes to reconnect logic, message handling, or auth flow.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Related #51104
Related #44160
Related #41837
Related #50138
Supersedes #46345 (closed by author for PR count management, not for code quality — Greptile rated 4/5 "safe to merge")
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: WebSocket connections behind NAT/proxy can enter a half-open state where the TCP connection is dead but no FIN/RST is received. Without application-level keepalive, the ws library never fires close/error events.
Missing detection / guardrail: No ping/pong keepalive was implemented. The existing getBotUpdateAt health check only detects bot account modifications, not dead TCP connections.
Prior context: PR #46345 by @Br1an67 implemented the same fix but was closed for PR count management. Greptile review noted a minor robustness issue with clearTimeout ordering.
Why this regressed now: Not a regression — this has always been the behavior. It surfaces in environments with NAT/proxy layers (reverse proxies, corporate firewalls).
If unknown, what was ruled out: N/A

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: monitor-websocket.test.ts
Scenario the test should lock in: When ping is sent and no pong received within the next interval, ws.terminate() should be called.
Why this is the smallest reliable guardrail: Unit test on the FakeWebSocket can verify timer behavior without network dependencies.
Existing test that already covers this: None.
If no new test is added, why not: The existing FakeWebSocket class needs ping() and pong event support. Happy to add tests with maintainer guidance on the preferred mock pattern.

User-visible / Behavior Changes

Bots behind NAT/proxy now automatically reconnect within ~60s of a silent TCP drop (previously could hang indefinitely).
New optional config: pingIntervalMs (default 30000ms, set to 0 to disable).
No change to existing behavior when connections are healthy.

Diagram (if applicable)

Before:
[TCP dies silently] -> [no close/error event] -> [bot hangs forever] -> [manual restart needed]

After:
[TCP dies silently] -> [ping sent, no pong] -> [next ping: awaitingPong=true] -> [ws.terminate()]
  -> [close event fires] -> [runWithReconnect] -> [fresh connection] -> [bot recovers]

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No — WebSocket ping/pong frames are protocol-level, not application-level HTTP calls.
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS 15.4 (Apple Silicon, Mac mini M4)
Runtime/container: Node.js 25.8.1
Model/provider: openai/gpt-5.4 and vllm/openai/gpt-oss-120b
Integration/channel: Mattermost (self-hosted behind Nginx reverse proxy)

Relevant config (redacted):

{
  "channels": {
    "mattermost": {
      "websocket": {
        "pingIntervalMs": 10000,
        "reconnectIntervalMs": 3000
      }
    }
  }
}

Steps

Deploy OpenClaw with two Mattermost bot accounts behind a reverse proxy.
Simulate silent TCP drop (e.g., firewall rule, proxy timeout, network switch).
Without patch: bot hangs indefinitely, no recovery.
With patch: bot detects pong timeout within ~60s and reconnects.

Expected

Bot automatically reconnects within one ping interval after silent TCP drop.

Actual

Verified on production Mattermost instance with two bot accounts. Both accounts maintain keepalive and would auto-recover on connection loss.

Evidence

Trace/log snippets

Gateway log on successful keepalive (silent — no log output when healthy):

[mattermost] connected as @jason_wang_mbot
[mattermost] connected as @gpt-oss-120b

On pong timeout (would appear):

mattermost websocket pong timeout — terminating connection for reconnect

Prior art: PR #46345 validated the same approach (Greptile 4/5 confidence).

Human Verification (required)

Verified scenarios: Deployed to production with two Mattermost bot accounts (codex + got-oss-120b). Both accounts connected successfully with keepalive active. Gateway running stable.
Edge cases checked: Verified clearTimers() cleans up both ping timer and health check timer. Verified awaitingPong is reset on close. Verified pingIntervalMs: 0 disables keepalive.
What you did not verify: Actual silent TCP drop simulation (impractical in production). Relied on code review and prior art (#46345) for correctness.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No — new optional pingIntervalMs defaults to 30000ms.
Migration needed? No

Risks and Mitigations

Risk: ws.ping() not available on all WebSocket implementations.
- Mitigation: The ws library (used by OpenClaw) natively supports ping()/pong. The MattermostWebSocketLike interface is extended to require it, so any custom implementation must provide it.
Risk: Ping/pong traffic adds overhead.
- Mitigation: Ping frames are 2 bytes + framing. At 30s intervals, this is ~4 bytes/minute — negligible.

Changed files

extensions/mattermost/src/mattermost/monitor-websocket.test.ts (modified, +14/-1)
extensions/mattermost/src/mattermost/monitor-websocket.ts (modified, +32/-0)

Code Example

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

---

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart

RAW_BUFFERClick to expand / collapse

Bug Description

The Mattermost WebSocket channel can silently disconnect and remain unresponsive for 12+ hours while the gateway process itself stays alive. The channel health monitor fails to detect and recover from this state.

Environment

OpenClaw version: 2026.3.13
OS: macOS 15 (arm64)
Channel: Mattermost (WebSocket mode)
Node.js: v25.8.1

Steps to Reproduce

Run OpenClaw gateway with Mattermost channel connected via WebSocket
Wait for a network interruption or WebSocket disconnect event
Observe that the gateway process remains alive (confirmed via macOS system log processID tracking), but the Mattermost channel stops processing messages
No error or warning is logged after the disconnect

Expected Behavior

The channel health monitor should detect the silent disconnection within a few minutes and automatically restart the Mattermost channel.

Actual Behavior

Gateway process (PID 85440) stayed alive from 21:04 to 09:27 next day (12+ hours)
macOS system log confirmed the process was running the entire time (network path checks every minute showed satisfied)
Last OpenClaw application-level log entry was at 21:05:10 (delivered reply to user:...)
Zero log entries from 21:05 to 09:27 — no errors, no warnings, no health monitor output
The health monitor (configured at default 300s interval) did not detect or restart the channel

Root Cause Analysis

Looking at the source code in gateway-cli-Ol-vpIk7.js, the evaluateChannelHealth() function has a gap in its detection logic:

// Line ~1596: stale-socket detection is gated on connected === true
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    if (now - snapshot.lastEventAt > staleEventThresholdMs) 
        return { healthy: false, reason: "stale-socket" };
}

Suggested Fixes

Add active liveness probing: The health monitor should periodically send a ping/probe through the actual WebSocket connection, not just check metadata snapshots
Treat prolonged silence as unhealthy: If a channel has connected === true but lastEventAt has not updated in staleEventThresholdMs (default 30 min), it should be considered unhealthy regardless — the current code does this, but the zombie state where connected is neither clearly true nor false may bypass all checks
Log health monitor evaluations: Even when healthy, periodically log the health status so that silent failures are diagnosable (currently zero log output for 12 hours makes debugging very difficult)
Add a watchdog for application-level log silence: If the gateway writes no application log entries for N minutes while the process is alive, something is wrong

Workaround

Reduce the health check interval to catch issues faster:

openclaw config set gateway.channelHealthCheckMinutes 1
openclaw gateway restart

This does not fix the root cause but reduces the window of undetected failure.

extent analysis

Fix Plan

To address the silent disconnection issue in the Mattermost WebSocket channel, we will implement the following fixes:

Add active liveness probing to the health monitor
Treat prolonged silence as unhealthy
Log health monitor evaluations
Add a watchdog for application-level log silence

Step-by-Step Solution

Modify the evaluateChannelHealth() function to include active liveness probing:

// Send a ping/probe through the actual WebSocket connection
if (policy.channelId !== "telegram" && snapshot.mode !== "webhook" 
    && snapshot.connected === true && snapshot.lastEventAt != null) {
    // Add a ping/probe to check the actual connection
    const pingResult = sendPingProbe(snapshot);
    if (!pingResult) {
        return { healthy: false, reason: "unresponsive-connection" };
    }
    // ...
}

Treat prolonged silence as unhealthy:

// If a channel has `connected === true` but `lastEventAt` has not updated in `staleEventThresholdMs`
if (now - snapshot.lastEventAt > staleEventThresholdMs) {
    return { healthy: false, reason: "stale-socket" };
}

Log health monitor evaluations:

// Log the health status periodically
console.log(`Health monitor evaluation: ${new Date().toISOString()} - ${policy.channelId} is ${healthy ? 'healthy' : 'unhealthy'}`);

Add a watchdog for application-level log silence:

// Set a timer to check for log silence
const logSilenceTimeout = 10 * 60 * 1000; // 10 minutes
const lastLogTime = new Date();
setInterval(() => {
    const now = new Date();
    if (now - lastLogTime > logSilenceTimeout) {
        console.error('Application-level log silence detected');
        // Take action to restart the channel or alert the administrator
    }
}, logSilenceTimeout);

Verification

To verify that the fix worked, you can:

Run the OpenClaw gateway with the modified code
Simulate a network interruption or WebSocket disconnect event
Check the logs to see if the health monitor detects the issue and restarts the channel
Verify that the channel is processing messages again after the restart

Extra Tips

Make sure to test the modified code thoroughly to ensure that it works as expected
Consider adding additional logging and monitoring to detect similar issues in the future
Review the OpenClaw documentation and source code to ensure that the fixes are compatible with the latest version of the software.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #runtime error #dependency conflict #environment setup #docker error #permission error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Mattermost channel health monitor fails to detect and recover from silent WebSocket disconnection [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #50143: Fix Mattermost stale-zombie sockets by tracking lastEventAt + health policy fallback

Description (problem / solution / changelog)

Summary

Changes

Why

Notes

Changed files

PR #57621: fix(mattermost): add WebSocket ping/pong keepalive to detect silent connection drops

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Suggested Fixes

Workaround

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING