openclaw - ✅(Solved) Fix Matrix provider connection failure causes rapid gateway process crash loop [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62376Fetched 2026-04-08 03:05:16
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.

Error Message

The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop. The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.

Root Cause

  • 04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
  • PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
  • Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
  • channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
  • Other channels (webchat) were repeatedly disconnected with code=1012 reason=service restart

Fix Action

Workaround

Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.

PR fix notes

PR #62779: fix(matrix): contain sync outage failures

Description (problem / solution / changelog)

Summary

  • Problem: Matrix startup reported success before sync was actually ready, and detached Matrix monitor tasks could reject without an owner.
  • Why it matters: a homeserver outage could escalate from a channel-scoped failure into a process-wide unhandled rejection crash loop, bypassing gateway.channelMaxRestartsPerHour.
  • What changed: Matrix startup now waits for ready sync states, monitor status/fatal sync handling is owned inside the Matrix plugin, and detached monitor work is centrally contained and drained on shutdown.
  • What did NOT change (scope boundary): no core Matrix special-casing in gateway orchestration, no new config surface, no global unhandled-rejection policy change.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #62376
  • Related #62168
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: the Matrix plugin treated matrix-js-sdk startup as ready too early, then left background room-message/verification tasks detached from channel lifecycle ownership.
  • Missing detection / guardrail: Matrix sync fatality and detached task rejection never fed back into the Matrix channel task, so global unhandled rejection policy killed the whole gateway before channel restart budgeting could apply.
  • Contributing context (if known): matrix-js-sdk emits long-lived sync state transitions after startClient() returns, and replayed/in-flight events during outage windows made orphaned async failures much easier to hit.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/matrix/src/matrix/sdk.test.ts, extensions/matrix/src/matrix/monitor/index.test.ts, extensions/matrix/src/matrix/monitor/sync-lifecycle.test.ts
  • Scenario the test should lock in: startup does not resolve before ready sync, startup times out/fails on sync fatal, detached monitor task failures do not escape as unhandled rejections, and fatal sync errors reject the Matrix channel task.
  • Why this is the smallest reliable guardrail: the bug lives at the Matrix SDK/monitor seam, below a full gateway e2e but above pure helper-local logic.
  • Existing test that already covers this (if any): none sufficiently covered the startup-readiness or detached-task ownership seam.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Matrix startup now waits for real sync readiness before the channel is considered started.
  • Fatal Matrix sync failures now stop/restart the Matrix channel instead of crashing the whole gateway process.
  • Matrix channel runtime status now reflects starting, healthy, error, and stopped transitions more accurately.

Diagram (if applicable)

Before:
[matrix startClient returns] -> [Matrix marked started] -> [background task rejects] -> [global unhandled rejection] -> [gateway exits]

After:
[matrix startClient returns] -> [wait for ready sync] -> [background task or sync fatal] -> [Matrix channel task rejects] -> [gateway channel restart policy]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 24 / local repo workspace
  • Model/provider: N/A
  • Integration/channel (if any): Matrix
  • Relevant config (redacted): Matrix account with unreachable homeserver or fatal sync failure after startup

Steps

  1. Configure Matrix and start the gateway with an unreachable or failing homeserver.
  2. Let Matrix startup or replayed inbound event handling hit a sync/background-task failure.
  3. Observe whether the whole gateway exits or only the Matrix channel fails.

Expected

  • Matrix stays channel-scoped: startup waits for readiness, fatal sync failures reject the Matrix channel task, and gateway restart budgeting applies at the channel layer.

Actual

  • Before this change the gateway could exit from unhandled Matrix monitor rejections, bypassing channel restart controls.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: targeted Matrix tests for startup readiness, startup timeout, unexpected sync fatal, detached room-message failure containment, and intentional shutdown STOPPED handling; local pnpm build.
  • Edge cases checked: fatal sync error after startup, detached task rejection sink, intentional shutdown not misclassified as fatal, startup timeout branch with fake timers.
  • What you did not verify: full repo pnpm check remains blocked by unrelated preexisting tsgo failures in extensions/msteams/src/attachments.graph.test.ts, src/agents/subagent-registry.test.ts, and src/infra/host-env-security.test.ts.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: Matrix sync.state = ERROR remains SDK-owned reconnect behavior and is not automatically escalated into channel restart.
    • Mitigation: this PR only escalates clear fatal paths (sync.unexpected_error, unexpected STOPPED, startup readiness failure) to avoid fighting the SDK on transient reconnects.

Changed files

  • CHANGELOG.md (modified, +3/-0)
  • extensions/matrix/src/channel.ts (modified, +1/-0)
  • extensions/matrix/src/matrix/client/shared.test.ts (modified, +80/-0)
  • extensions/matrix/src/matrix/client/shared.ts (modified, +24/-16)
  • extensions/matrix/src/matrix/monitor/events.ts (modified, +28/-3)
  • extensions/matrix/src/matrix/monitor/index.test.ts (modified, +257/-18)
  • extensions/matrix/src/matrix/monitor/index.ts (modified, +41/-35)
  • extensions/matrix/src/matrix/monitor/startup.test.ts (modified, +19/-0)
  • extensions/matrix/src/matrix/monitor/startup.ts (modified, +22/-0)
  • extensions/matrix/src/matrix/monitor/status.ts (added, +111/-0)
  • extensions/matrix/src/matrix/monitor/sync-lifecycle.test.ts (added, +224/-0)
  • extensions/matrix/src/matrix/monitor/sync-lifecycle.ts (added, +91/-0)
  • extensions/matrix/src/matrix/monitor/task-runner.ts (added, +38/-0)
  • extensions/matrix/src/matrix/sdk.test.ts (modified, +155/-2)
  • extensions/matrix/src/matrix/sdk.ts (modified, +139/-7)
  • extensions/matrix/src/matrix/sdk/types.ts (modified, +3/-0)
  • extensions/matrix/src/matrix/startup-abort.ts (added, +44/-0)
  • extensions/matrix/src/matrix/sync-state.ts (added, +27/-0)
RAW_BUFFERClick to expand / collapse

Description

When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.

Steps to Reproduce

  1. Configure OpenClaw with a Matrix channel pointing to a self-hosted Synapse instance
  2. Take the Matrix homeserver offline (power off the host)
  3. Observe gateway logs

Expected Behavior

The Matrix provider should:

  • Catch connection errors gracefully
  • Use exponential backoff for reconnection attempts
  • Keep the gateway process alive (other channels like webchat should remain functional)
  • Respect channelMaxRestartsPerHour for health-monitor-initiated restarts

Actual Behavior

The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop.

Evidence from logs (2026-04-07):

  • 04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
  • PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
  • Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
  • channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
  • Other channels (webchat) were repeatedly disconnected with code=1012 reason=service restart

Separate from health monitor restarts

The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.

Environment

  • OpenClaw version: 2026.3.13
  • Node.js: v22.22.0
  • OS: macOS (arm64)
  • Matrix homeserver: self-hosted Synapse on NixOS
  • Matrix config: allowPrivateNetwork: true

Workaround

Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.

Suggestion

The Matrix provider (or the SDK integration layer) needs a top-level try/catch or process-level unhandled rejection handler that prevents connection failures from crashing the gateway process. Connection errors should be caught and retried with backoff, keeping the rest of the gateway operational.

extent analysis

TL;DR

Implement a top-level try/catch or process-level unhandled rejection handler in the Matrix provider to catch connection errors and prevent the gateway process from crashing.

Guidance

  • Identify the specific point in the Matrix provider code where the uncaught exception occurs and wrap it in a try/catch block to handle connection failures.
  • Implement exponential backoff for reconnection attempts to prevent rapid crash loops.
  • Consider adding a process-level unhandled rejection handler to catch any unexpected errors that may not be caught by the try/catch block.
  • Review the auto-restart attempt N/10 mechanism to ensure it is not interfering with the new error handling mechanism.

Example

process.on('unhandledRejection', (reason, promise) => {
  console.error('Unhandled Rejection at:', promise, 'reason:', reason);
  // Implement backoff and retry logic here
});

Notes

The solution may require modifications to the Matrix provider code or the SDK integration layer, and may not be applicable to all versions of OpenClaw or Node.js.

Recommendation

Apply a workaround by implementing a top-level try/catch or process-level unhandled rejection handler, as this will prevent the crash loop and allow the gateway process to remain operational while the Matrix homeserver is unreachable.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING