openclaw - ✅(Solved) Fix Matrix provider connection failure causes rapid gateway process crash loop [1 pull requests, 1 participants]

janstenpickle · 2026-04-07T07:57:43Z

[openclaw] When the Matrix homeserver becomes unreachable e.g., host server goes down , the gateway process enters a rapid crash loop — spawning a new process… When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off. # PR #62779: fix(matrix): contain sync outage failures - Repository: openclaw/openclaw - Author: gumadeiras - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/62779 ## Description (problem / solution / changelog) ## Summary - Problem: Matrix startup reported success before sync was actually ready, and detached Matrix monitor tasks could reject without an owner. - Why it matters: a homeserver outage could escalate from a channel-scoped failure into a process-wide unhandled rejection crash loop, bypassing `gateway.channelMaxRestartsPerHour`. - What changed: Matrix startup now waits for ready sync states, monitor status/fatal sync handling is owned inside the Matrix plugin, and detached monitor work is centrally contained and drained on shutdown. - What did NOT change (scope boundary): no core Matrix special-casing in gateway orchestration, no new config surface, no global unhandled-rejection policy change. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [x] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #62376 - Related #62168 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: the Matrix plugin treated `matrix-js-sdk` startup as ready too early, then left background room-message/verification tasks detached from channel lifecycle ownership. - Missing detection / guardrail: Matrix sync fatality and detached task rejection never fed back into the Matrix channel task, so global unhandled rejection policy killed the whole gateway before channel restart budgeting could apply. - Contributing context (if known): `matrix-js-sdk` emits long-lived sync state transitions after `startClient()` returns, and replayed/in-flight events during outage windows made orphaned async failures much easier to hit. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [x] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `extensions/matrix/src/matrix/sdk.test.ts`, `extensions/matrix/src/matrix/monitor/index.test.ts`, `extensions/matrix/src/matrix/monitor/sync-lifecycle.test.ts` - Scenario the test should lock in: startup does not resolve before ready sync, startup times out/fails on sync fatal, detached monitor task failures do not escape as unhandled rejections, and fatal sync errors reject the Matrix channel task. - Why this is the smallest reliable guardrail: the bug lives at the Matrix SDK/monitor seam, below a full gateway e2e but above pure helper-local logic. - Existing test that already covers this (if any): none sufficiently covered the startup-readiness or detached-task ownership seam. - If no new test is added, why not: N/A ## User-visible / Behavior Changes - Matrix startup now waits for real sync readiness before the channel is considered started. - Fatal Matrix sync failures now stop/restart the Matrix channel instead of crashing the whole gateway process. - Matrix channel runtime status now reflects `starting`, `healthy`, `error`, and `stopped` transitions more accurately. ## Diagram (if applicable) ```text Before: [matrix startClient returns] -> [Matrix marked started] -> [background task rejects] -> [global unhandled rejection] -> [gateway exits] After: [matrix startClient returns] -> [wait for ready sync] -> [background task or sync fatal] -> [Matrix channel task rejects] -> [gateway channel restart policy] ``` ## Security Impact (required) - New permissions/capabilities? (No) - Secrets/tokens handling changed? (No) - New/changed network calls? (No) - Command/tool execution surface changed? (No) - Data access scope changed? (No) - If any `Yes`, explain risk + mitigation: ## Repro + Verification ### Environment - OS: macOS - Runtime/container: Node 24 / local repo workspace - Model/provider: N/A - Integration/channel (if any): Matrix - Relevant config (redacted): Matrix account with unreachable homeserver or fatal sync failure after startup ### Steps 1. Configure Matrix and start the gateway with an unreachable or failing homeserver. 2. Let Matrix startup or replayed inbound event handling hit a sync/background-task failure. 3. Observe whether the whole gateway exits or only the Matrix channel fails. ### Expected - Matrix stays channel-scoped: startup waits for readiness, fatal sync failures re

openclaw2026-04-07 07:57:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62376•Fetched 2026-04-08 03:05:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

janstenpickle

Participants

janstenpickle

Timeline (top)

cross-referenced ×1

When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.

Error Message

The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop. The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.

Root Cause

04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
Other channels (webchat) were repeatedly disconnected with code=1012 reason=service restart

Fix Action

Workaround

Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.

PR fix notes

PR #62779: fix(matrix): contain sync outage failures

Repository: openclaw/openclaw
Author: gumadeiras
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/62779

Description (problem / solution / changelog)

Summary

Problem: Matrix startup reported success before sync was actually ready, and detached Matrix monitor tasks could reject without an owner.
Why it matters: a homeserver outage could escalate from a channel-scoped failure into a process-wide unhandled rejection crash loop, bypassing gateway.channelMaxRestartsPerHour.
What changed: Matrix startup now waits for ready sync states, monitor status/fatal sync handling is owned inside the Matrix plugin, and detached monitor work is centrally contained and drained on shutdown.
What did NOT change (scope boundary): no core Matrix special-casing in gateway orchestration, no new config surface, no global unhandled-rejection policy change.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #62376
Related #62168
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: the Matrix plugin treated matrix-js-sdk startup as ready too early, then left background room-message/verification tasks detached from channel lifecycle ownership.
Missing detection / guardrail: Matrix sync fatality and detached task rejection never fed back into the Matrix channel task, so global unhandled rejection policy killed the whole gateway before channel restart budgeting could apply.
Contributing context (if known): matrix-js-sdk emits long-lived sync state transitions after startClient() returns, and replayed/in-flight events during outage windows made orphaned async failures much easier to hit.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/matrix/src/matrix/sdk.test.ts, extensions/matrix/src/matrix/monitor/index.test.ts, extensions/matrix/src/matrix/monitor/sync-lifecycle.test.ts
Scenario the test should lock in: startup does not resolve before ready sync, startup times out/fails on sync fatal, detached monitor task failures do not escape as unhandled rejections, and fatal sync errors reject the Matrix channel task.
Why this is the smallest reliable guardrail: the bug lives at the Matrix SDK/monitor seam, below a full gateway e2e but above pure helper-local logic.
Existing test that already covers this (if any): none sufficiently covered the startup-readiness or detached-task ownership seam.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Matrix startup now waits for real sync readiness before the channel is considered started.
Fatal Matrix sync failures now stop/restart the Matrix channel instead of crashing the whole gateway process.
Matrix channel runtime status now reflects starting, healthy, error, and stopped transitions more accurately.

Diagram (if applicable)

Before:
[matrix startClient returns] -> [Matrix marked started] -> [background task rejects] -> [global unhandled rejection] -> [gateway exits]

After:
[matrix startClient returns] -> [wait for ready sync] -> [background task or sync fatal] -> [Matrix channel task rejects] -> [gateway channel restart policy]

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 24 / local repo workspace
Model/provider: N/A
Integration/channel (if any): Matrix
Relevant config (redacted): Matrix account with unreachable homeserver or fatal sync failure after startup

Steps

Configure Matrix and start the gateway with an unreachable or failing homeserver.
Let Matrix startup or replayed inbound event handling hit a sync/background-task failure.
Observe whether the whole gateway exits or only the Matrix channel fails.

Expected

Matrix stays channel-scoped: startup waits for readiness, fatal sync failures reject the Matrix channel task, and gateway restart budgeting applies at the channel layer.

Actual

Before this change the gateway could exit from unhandled Matrix monitor rejections, bypassing channel restart controls.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: targeted Matrix tests for startup readiness, startup timeout, unexpected sync fatal, detached room-message failure containment, and intentional shutdown STOPPED handling; local pnpm build.
Edge cases checked: fatal sync error after startup, detached task rejection sink, intentional shutdown not misclassified as fatal, startup timeout branch with fake timers.
What you did not verify: full repo pnpm check remains blocked by unrelated preexisting tsgo failures in extensions/msteams/src/attachments.graph.test.ts, src/agents/subagent-registry.test.ts, and src/infra/host-env-security.test.ts.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risk: Matrix sync.state = ERROR remains SDK-owned reconnect behavior and is not automatically escalated into channel restart.
- Mitigation: this PR only escalates clear fatal paths (sync.unexpected_error, unexpected STOPPED, startup readiness failure) to avoid fighting the SDK on transient reconnects.

Changed files

CHANGELOG.md (modified, +3/-0)
extensions/matrix/src/channel.ts (modified, +1/-0)
extensions/matrix/src/matrix/client/shared.test.ts (modified, +80/-0)
extensions/matrix/src/matrix/client/shared.ts (modified, +24/-16)
extensions/matrix/src/matrix/monitor/events.ts (modified, +28/-3)
extensions/matrix/src/matrix/monitor/index.test.ts (modified, +257/-18)
extensions/matrix/src/matrix/monitor/index.ts (modified, +41/-35)
extensions/matrix/src/matrix/monitor/startup.test.ts (modified, +19/-0)
extensions/matrix/src/matrix/monitor/startup.ts (modified, +22/-0)
extensions/matrix/src/matrix/monitor/status.ts (added, +111/-0)
extensions/matrix/src/matrix/monitor/sync-lifecycle.test.ts (added, +224/-0)
extensions/matrix/src/matrix/monitor/sync-lifecycle.ts (added, +91/-0)
extensions/matrix/src/matrix/monitor/task-runner.ts (added, +38/-0)
extensions/matrix/src/matrix/sdk.test.ts (modified, +155/-2)
extensions/matrix/src/matrix/sdk.ts (modified, +139/-7)
extensions/matrix/src/matrix/sdk/types.ts (modified, +3/-0)
extensions/matrix/src/matrix/startup-abort.ts (added, +44/-0)
extensions/matrix/src/matrix/sync-state.ts (added, +27/-0)

RAW_BUFFERClick to expand / collapse

Description

Steps to Reproduce

Configure OpenClaw with a Matrix channel pointing to a self-hosted Synapse instance
Take the Matrix homeserver offline (power off the host)
Observe gateway logs

Expected Behavior

The Matrix provider should:

Catch connection errors gracefully
Use exponential backoff for reconnection attempts
Keep the gateway process alive (other channels like webchat should remain functional)
Respect channelMaxRestartsPerHour for health-monitor-initiated restarts

Actual Behavior

Evidence from logs (2026-04-07):

04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
Other channels (webchat) were repeatedly disconnected with code=1012 reason=service restart

Separate from health monitor restarts

The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.

Environment

OpenClaw version: 2026.3.13
Node.js: v22.22.0
OS: macOS (arm64)
Matrix homeserver: self-hosted Synapse on NixOS
Matrix config: allowPrivateNetwork: true

Workaround

Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.

Suggestion

The Matrix provider (or the SDK integration layer) needs a top-level try/catch or process-level unhandled rejection handler that prevents connection failures from crashing the gateway process. Connection errors should be caught and retried with backoff, keeping the rest of the gateway operational.

extent analysis

TL;DR

Implement a top-level try/catch or process-level unhandled rejection handler in the Matrix provider to catch connection errors and prevent the gateway process from crashing.

Guidance

Identify the specific point in the Matrix provider code where the uncaught exception occurs and wrap it in a try/catch block to handle connection failures.
Implement exponential backoff for reconnection attempts to prevent rapid crash loops.
Consider adding a process-level unhandled rejection handler to catch any unexpected errors that may not be caught by the try/catch block.
Review the auto-restart attempt N/10 mechanism to ensure it is not interfering with the new error handling mechanism.

Example

process.on('unhandledRejection', (reason, promise) => {
  console.error('Unhandled Rejection at:', promise, 'reason:', reason);
  // Implement backoff and retry logic here
});

Notes

The solution may require modifications to the Matrix provider code or the SDK integration layer, and may not be applicable to all versions of OpenClaw or Node.js.

Recommendation

Apply a workaround by implementing a top-level try/catch or process-level unhandled rejection handler, as this will prevent the crash loop and allow the gateway process to remain operational while the Matrix homeserver is unreachable.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Matrix provider connection failure causes rapid gateway process crash loop [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #62779: fix(matrix): contain sync outage failures

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence from logs (2026-04-07):

Separate from health monitor restarts

Environment

Workaround

Suggestion

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING