openclaw - ✅(Solved) Fix Matrix plugin: ECONNRESET crash-loop on local Synapse homeserver (v2026.4.1-beta.1) [1 pull requests, 7 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60783Fetched 2026-04-08 02:47:16
View on GitHub
Comments
7
Participants
2
Timeline
16
Reactions
0
Timeline (top)
commented ×7subscribed ×4mentioned ×3closed ×1

The Matrix plugin enters an infinite crash-restart loop when connecting to a local Synapse homeserver. All three agent accounts crash with:

fetch failed | read ECONNRESET

The plugin hard-exits on ECONNRESET instead of treating it as a transient network error and reconnecting gracefully.

Error Message

The plugin hard-exits on ECONNRESET instead of treating it as a transient network error and reconnecting gracefully.

  • Line ~3148: throw new Error(exit ${code}) — hard exit path triggered by ECONNRESET The plugin should catch ECONNRESET as a transient network error and reconnect, not crash-exit.

Root Cause

This appears to be a 30s keep-alive race condition: Synapse closes idle long-poll sync connections (timeout=30000ms) and Node.js/undici treats the TCP close as ECONNRESET. The plugin then hard-exits rather than reconnecting.

From the plugin source (matrix-runtime-surface-BQb_BOSl.js):

  • Line ~3421: throw err — the plugin exits on transient network errors rather than catching and reconnecting
  • Line ~3148: throw new Error(exit ${code}) — hard exit path triggered by ECONNRESET

Fix Action

Fix / Workaround

Is there a workaround available? Related issues #42234 and #7474 suggest this was known. Happy to test a fix or beta build.

PR fix notes

PR #61383: fix(matrix): harden startup auth bootstrap

Description (problem / solution / changelog)

Summary

  • Problem: Matrix startup could fail before sync if token auth made a blocking whoami request only to fill optional device metadata, or if startup auth/login hit a transient network reset.
  • Why it matters: affected Matrix accounts entered restart loops before matrix: client started / matrix: logged in as ..., including the local Synapse case from #60783.
  • What changed: only require whoami when token auth is actually missing userId, retry transient startup auth requests, and backfill missing deviceId after successful startup as best-effort.
  • What did NOT change (scope boundary): no /sync reconnect behavior changes and no broader Matrix transport retry policy changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #60783
  • Related #42234
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: token-auth Matrix startup treated missing optional deviceId the same as missing required userId, so startup could block on /_matrix/client/v3/account/whoami and crash before sync on transient network failures.
  • Missing detection / guardrail: no test locked in the case where token auth already had userId and only deviceId was missing, and startup auth/login had no transient retry.
  • Contributing context (if known): reporter logs showed restart before matrix: client started and matrix: logged in as ..., which ruled out steady-state /sync as the primary failure path.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/matrix/src/matrix/client.test.ts, extensions/matrix/src/matrix/monitor/index.test.ts
  • Scenario the test should lock in: startup auth retries transient ECONNRESET, token auth does not call whoami when userId is already configured, and best-effort device backfill does not block monitor startup.
  • Why this is the smallest reliable guardrail: the bug is in Matrix auth/bootstrap control flow before sync starts.
  • Existing test that already covers this (if any): N/A
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Matrix accounts with token auth no longer fail startup just because deviceId is missing, transient startup auth/login resets are retried, and missing device IDs are backfilled after successful startup.

Diagram (if applicable)

Before:
[startup] -> [optional deviceId missing] -> [blocking whoami] -> [transient ECONNRESET] -> [restart loop]

After:
[startup] -> [userId already known] -> [start client] -> [best-effort deviceId backfill]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Matrix
  • Relevant config (redacted): Matrix token auth with configured userId and missing deviceId

Steps

  1. Configure Matrix token auth with userId set and no deviceId.
  2. Trigger Matrix startup with a transient auth bootstrap failure on whoami or login.
  3. Observe startup behavior.

Expected

  • Startup should not require whoami just to fill optional deviceId.
  • Transient startup auth failures should retry.
  • Successful startup should backfill missing deviceId without blocking monitor startup.

Actual

  • Before this change, startup could crash before sync and enter a restart loop.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: ran focused Matrix auth + monitor tests covering retry, optional deviceId, and non-blocking backfill; ran full pnpm build.
  • Edge cases checked: transient ECONNRESET during startup whoami, transient ECONNRESET during password login, token auth with configured userId and missing deviceId.
  • What you did not verify: live Matrix/Synapse repro against a real homeserver in this branch.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: startup may now begin without a persisted deviceId until the best-effort backfill finishes.
    • Mitigation: backfill runs immediately after successful startup and existing runtime/device-status paths already tolerate missing auth-level deviceId.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/matrix/src/matrix/client.test.ts (modified, +261/-0)
  • extensions/matrix/src/matrix/client.ts (modified, +1/-0)
  • extensions/matrix/src/matrix/client/config.ts (modified, +160/-25)
  • extensions/matrix/src/matrix/client/file-sync-store.test.ts (modified, +25/-0)
  • extensions/matrix/src/matrix/client/file-sync-store.ts (modified, +5/-0)
  • extensions/matrix/src/matrix/client/storage.test.ts (modified, +153/-8)
  • extensions/matrix/src/matrix/client/storage.ts (modified, +104/-14)
  • extensions/matrix/src/matrix/credentials-write.runtime.ts (modified, +8/-0)
  • extensions/matrix/src/matrix/credentials.test.ts (modified, +131/-0)
  • extensions/matrix/src/matrix/credentials.ts (modified, +66/-16)
  • extensions/matrix/src/matrix/monitor/index.test.ts (modified, +26/-0)
  • extensions/matrix/src/matrix/monitor/index.ts (modified, +8/-0)
  • extensions/matrix/src/matrix/thread-bindings.test.ts (modified, +106/-7)
  • extensions/matrix/src/matrix/thread-bindings.ts (modified, +5/-1)

Code Example

fetch failed | read ECONNRESET

---

{
  "channels": {
    "matrix": {
      "homeserver": "http://localhost:8008",
      "allowPrivateNetwork": true,
      "accounts": {
        "nova": { "userId": "@nova:localhost", ... },
        "page": { "userId": "@page:localhost", ... },
        "junior": { "userId": "@junior:localhost", ... }
      }
    }
  }
}

---

[matrix] [nova] channel exited: fetch failed | read ECONNRESET
[matrix] [nova] auto-restart attempt 4/10 in 44s
[matrix] [page] channel exited: fetch failed | read ECONNRESET
[matrix] [page] auto-restart attempt 4/10 in 43s
[matrix] [junior] channel exited: fetch failed | read ECONNRESET
[matrix] [junior] auto-restart attempt 4/10 in 42s
RAW_BUFFERClick to expand / collapse

Bug Report

OpenClaw version: 2026.4.2 Matrix plugin version: @openclaw/matrix v2026.4.1-beta.1 Synapse version: v1.123.0 (local Docker, OrbStack) Platform: macOS 26.3.1 (arm64), Mac Mini M-series

Summary

The Matrix plugin enters an infinite crash-restart loop when connecting to a local Synapse homeserver. All three agent accounts crash with:

fetch failed | read ECONNRESET

The plugin hard-exits on ECONNRESET instead of treating it as a transient network error and reconnecting gracefully.

Configuration

{
  "channels": {
    "matrix": {
      "homeserver": "http://localhost:8008",
      "allowPrivateNetwork": true,
      "accounts": {
        "nova": { "userId": "@nova:localhost", ... },
        "page": { "userId": "@page:localhost", ... },
        "junior": { "userId": "@junior:localhost", ... }
      }
    }
  }
}

What works

  • Direct curl sync requests to Synapse succeed perfectly (both IPv4 and IPv6)
  • allowPrivateNetwork: true is set
  • Tokens and deviceIds are valid (verified via curl)
  • Login via Element Web works fine manually
  • Synapse health check passes

What fails

Gateway logs show all three accounts in constant restart loop:

[matrix] [nova] channel exited: fetch failed | read ECONNRESET
[matrix] [nova] auto-restart attempt 4/10 in 44s
[matrix] [page] channel exited: fetch failed | read ECONNRESET
[matrix] [page] auto-restart attempt 4/10 in 43s
[matrix] [junior] channel exited: fetch failed | read ECONNRESET
[matrix] [junior] auto-restart attempt 4/10 in 42s

Root cause analysis

This appears to be a 30s keep-alive race condition: Synapse closes idle long-poll sync connections (timeout=30000ms) and Node.js/undici treats the TCP close as ECONNRESET. The plugin then hard-exits rather than reconnecting.

From the plugin source (matrix-runtime-surface-BQb_BOSl.js):

  • Line ~3421: throw err — the plugin exits on transient network errors rather than catching and reconnecting
  • Line ~3148: throw new Error(exit ${code}) — hard exit path triggered by ECONNRESET

Things tried

  • Updated tokens (fresh login tokens confirmed valid via curl)
  • Cleared credentials and matrix account storage
  • Pointed homeserver at nginx proxy (port 8009) to handle keep-alive
  • Added rc_login rate limit config to Synapse
  • Restarted gateway multiple times

Expected behaviour

The plugin should catch ECONNRESET as a transient network error and reconnect, not crash-exit.

Question

Is there a workaround available? Related issues #42234 and #7474 suggest this was known. Happy to test a fix or beta build.

extent analysis

TL;DR

The Matrix plugin can be modified to catch and handle ECONNRESET errors as transient network errors, allowing it to reconnect instead of crashing.

Guidance

  • Review the plugin source code (matrix-runtime-surface-BQb_BOSl.js) and modify the error handling to catch ECONNRESET errors and implement a reconnect mechanism.
  • Consider increasing the keep-alive timeout in Synapse or using a reverse proxy like nginx to handle keep-alive connections.
  • Verify that the plugin is properly configured to reconnect on transient network errors by testing with a simulated ECONNRESET error.
  • Check related issues #42234 and #7474 for potential workarounds or fixes.

Example

// Modified error handling to catch ECONNRESET errors
try {
  // existing code
} catch (err) {
  if (err.code === 'ECONNRESET') {
    // reconnect logic
  } else {
    throw err;
  }
}

Notes

The provided code snippet is a minimal example and may require additional modifications to work correctly. The root cause analysis suggests a 30s keep-alive race condition, but further testing is needed to confirm.

Recommendation

Apply a workaround by modifying the plugin source code to catch and handle ECONNRESET errors, as this is a known issue with a potential fix available in related issues #42234 and #7474.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING