openclaw - ✅(Solved) Fix [Bug]: Matrix native-approvals handler thrashes loopback gateway with handshake-timeout on multi-account installs [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70641Fetched 2026-04-24 05:55:17
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
cross-referenced ×2closed ×1commented ×1mentioned ×1

On a gateway running 2026.4.20 with 11 configured Matrix accounts and no explicit execApprovals config, the loopback gateway enters a persistent 1 Hz native-approvals / handshake-timeout reconnect storm that degrades sessions_list / sessions_send and artificially elevates gateway-process CPU; reproduced for hours on the affected host and confirmed 1:1 in gateway logs.

Error Message

[matrix/native-approvals] connect error: gateway closed (1000): gateway connect failed: Error: gateway closed (1000): [gateway/ws] closed before connect conn=<uuid> cause=handshake-timeout ... code=1000 [matrix/native-approvals] failed to start native approval handler: Error: gateway closed: 1000

Root Cause

with handshakeMs frequently exceeding the nominal 10 s configured/default timer. sessions_list / sessions_send / openclaw status intermittently time out. Gateway-process CPU is artificially elevated because the event loop is busy servicing self-inflicted preauth churn.

Fix Action

Fix / Workaround

  • Setting channels.matrix.execApprovals.enabled: false (where it inherits per-account) papers over the bug only for operators who know to set it, does not fix the enabled === undefined → enabled semantics, and does not stop the retry stampede if anything else ever causes a transient failure.
  • Clearing dm.allowFrom is not a valid workaround — it breaks DM access control.
  • Reducing the Matrix account count is not a real fix — it just raises the failure threshold.
  • OPENCLAW_HANDSHAKE_TIMEOUT_MS tuning treats symptoms, not cause.

Operator workaround until a fix ships

PR fix notes

PR #2: fix(approvals): coordinate native handler startup to avoid loopback storms

Description (problem / solution / changelog)

Summary

  • Problem: on multi-account installs (e.g. 11 Matrix accounts, 4+ Telegram bots, Discord with many servers), every account independently opens a native-approvals gateway client on the same startup tick. The loopback gateway has to complete N concurrent preauth WebSocket handshakes at once, saturates its own handshake timer, fails them all with handshake-timeout / gateway closed (1000), and each failure retries at 1 Hz — a sustained self-DoS of the process's own loopback.
  • Why it matters: every internal consumer hitting the local gateway (session RPC, tool calls, node messaging, openclaw status) pays the latency tax. On the reproducer host (11 Matrix accounts, headless older Intel Mac mini, 2026.4.20) the gateway logged ~157 handshake-timeout entries and ~129 matrix/native-approvals errors 1:1 correlated in a single 24 h window, with handshakeMs spiking to 13–17 s against a nominal 10 s timer — the event loop was falling behind.
  • What changed: startChannelApprovalHandlerBootstrap now goes through a new process-scoped approval-handler-start-coordinator that applies (1) randomized startup jitter and (2) a FIFO concurrency cap on concurrent handler starts. Applied to the initial startup path only — retries keep their existing 1 s retry-timer cadence.
  • What did NOT change (scope boundary): approver resolution, isChannelExecApprovalClientEnabledFromConfig semantics, the retry-after-failure backoff layer (openclaw#68283 is the in-flight work there), the ChannelApprovalHandler runtime contract, or startChannelApprovalHandlerBootstrap's existing signature/cleanup behavior. The startCoordinator param is optional and documented as a test seam.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes openclaw#70641
  • Related openclaw#68283 (retry-after-failure backoff; complementary)
  • Related openclaw#69936 (target-accountId scoping; complementary)
  • Related openclaw#70568 (Telegram ambiguous-account fan-out; complementary)
  • Related openclaw#68223 (multi-telegram handshake cascade report)
  • Related openclaw#69012 (telegram native-approvals handshake timeout on fresh boot)
  • Related openclaw#67034 (multi-account polling avalanche)
  • Prior art: commit c23ad91a14 fix(matrix): keep DM allowlist out of room commands (merged upstream, same "tighten DM allowlist scope" direction, different layer)
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: there was no coordination between the N per-account invocations of startChannelApprovalHandlerBootstrap. server-channels.ts:304 fans out accounts via Promise.all, each bootstrap awaits startHandlerForContext which in turn synchronously constructs a fresh GatewayClient and starts its preauth handshake. With jitter = 0 and concurrency = ∞, N accounts produce N near-simultaneous loopback handshakes. Under pressure the handshake timer on the server side misses its deadline, each fails, each retries at 1 Hz, and the thrash stays persistent because there is no jitter and no concurrency ceiling.
  • Missing detection / guardrail: no existing test exercised startChannelApprovalHandlerBootstrap under N concurrent bootstraps sharing a loopback — the handler-bootstrap unit suite covers single-account lifecycle only.
  • Contributing context (if known): the bug is latent regardless of channel; it is most visible on multi-account Matrix/Telegram installs because those channels' approver-resolvers can implicitly activate the handler from dm.allowFrom / owner-allowlists. That implicit-activation semantic is a separate discussion — see the exec-approvals.test.ts assertion at lines 121–139 of extensions/matrix/src/exec-approvals.test.ts which test-enforces the current behavior — and is intentionally out of scope here.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/infra/approval-handler-bootstrap.test.ts and src/infra/approval-handler-start-coordinator.test.ts.
  • Scenario the test should lock in: (1) jitter delays first handler start by the configured interval; (2) unregister during jitter cancels cleanly with no handler construction; (3) with maxConcurrentStarts: 1, three concurrent bootstraps drain FIFO with only one handler in flight at a time; (4) a slot releases after a thrown handler.start(); (5) retry attempts do not re-apply jitter; (6) env-var parsing rejects partial-number values like "100ms" and "3.5"; (7) jitter sampling clamps to [0, jitterMs) even when an injected RNG returns 1.
  • Why this is the smallest reliable guardrail: the coordinator is a small pure module with a narrow contract (waitJitter + acquireStartSlot); every failure mode of the storm reduces to one of those two primitives behaving wrong, so unit-level coverage is sufficient. The bootstrap tests cover the threading into startHandlerForContext so a regression that accidentally disables the coordinator in production is also caught.
  • Existing test that already covers this (if any): none — the pre-existing 6 tests in approval-handler-bootstrap.test.ts cover single-account lifecycle and are preserved unchanged.
  • If no new test is added, why not: N/A — tests added.

User-visible / Behavior Changes

  • On fresh boot, the first native-approvals handler start for each account is delayed by a uniformly-random [0, 2000) ms window (configurable). Single-account installs see at most one such delay per channel. Multi-account installs stop self-inflicted handshake storms.
  • Two new env vars, both optional with safe defaults: OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS (default 2000, set 0 to restore pre-patch timing) and OPENCLAW_APPROVAL_HANDLER_MAX_CONCURRENT_STARTS (default 3, set to a large value to restore pre-patch free-for-all).
  • No user-visible config schema change, no UI change, no protocol change.

Diagram (if applicable)

Before (N=4 accounts, same channel):
  t=0ms   account:a -> handshake
  t=0ms   account:b -> handshake    all N handshakes at t=0
  t=0ms   account:c -> handshake    -> server event loop stalls
  t=0ms   account:d -> handshake    -> all time out -> retry at t=1000ms, repeat

After (same N, jitter=2000ms, maxConcurrentStarts=3):
  t=~400ms    account:a acquires slot, handshake starts
  t=~700ms    account:b acquires slot, handshake starts
  t=~1100ms   account:c acquires slot, handshake starts
  t=~1600ms   account:d jittered; slots 1-2 have finished; d acquires, handshake starts
  -> server event loop keeps up, handshakes complete, no storm

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No — same connections, just serialized/jittered.
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS 15.x, older Intel Mac mini (CPU impact figures are anecdotal to this hardware)
  • Runtime/container: Node 22+, launchd-managed headless gateway
  • Model/provider: N/A (storm is in gateway preauth layer, no model calls involved)
  • Integration/channel (if any): Matrix with 11 configured accounts; each account has dm.allowFrom set, no explicit execApprovals config
  • Relevant config (redacted): channels.matrix.accounts.*, channels.matrix.dm.allowFrom: [<owner-mxid>], channels.matrix.execApprovals unset, gateway bind: loopback mode: local token auth

Steps

  1. Configure 6+ Matrix accounts under channels.matrix.accounts.*, each with the same owner in dm.allowFrom and execApprovals unset.
  2. Start the gateway and let it run >5 minutes.
  3. Tail the gateway log — expect repeating [matrix/native-approvals] connect error: gateway closed (1000) plus [gateway/ws] closed before connect ... cause=handshake-timeout at ~1 Hz, with handshakeMs frequently exceeding the configured/default 10 s timer.

Expected

  • On startup, all account handlers complete their preauth handshake once; no subsequent handshake-timeout entries until a real transient failure occurs.

Actual (before this PR)

  • Persistent 1 Hz loop of handshake-timeout / gateway closed (1000) / failed to start native approval handler, 1:1 correlated. openclaw status hangs intermittently; sessions_send times out; gateway-process CPU is artificially elevated for hours.

Evidence

  • Failing test/log before + passing after — see the "Live smoke on 11-account Matrix reproducer" section below for the end-to-end before/after.
  • Trace/log snippets — [matrix/native-approvals] connect error: gateway closed (1000): + matching [gateway/ws] closed before connect conn=... cause=handshake-timeout.
  • Screenshot/recording — N/A (no UI change).
  • Perf numbers (if relevant) — handshakeMs values of 13–17 s against a nominal 10 s timer before the fix; event loop saturated by self-inflicted preauth churn. Post-fix: zero handshake-timeout / native-approvals errors in 7 minutes of runtime against the same config.

Live smoke on 11-account Matrix reproducer

Applied on the original affected host on 2026-04-23 14:37 UTC by cherry-picking the three commits (8270f8322e, f4524ad390, c4d83d79b2) onto v2026.4.20, running pnpm install --frozen-lockfile && pnpm build && pnpm ui:build, and triggering a gateway restart via kill <pid> (launchd KeepAlive=true auto-respawns).

Pre-restart pattern (representative sample from the host's gateway err log, last ~10 min before SIGTERM):

2026-04-23T10:26:09.267  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.268  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:26:09.302  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.303  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:26:09.350  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.351  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:28:07.250  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.334  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.383  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.416  [matrix] connect error: gateway request timeout for connect

Post-restart (all err-log entries from SIGTERM through +7 min, unfiltered):

2026-04-23T10:39:21.278  [secrets] gateway.auth.token is inactive (env var configured)
2026-04-23T10:40:56.740  [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-04-23T10:43:58.166  [bonjour] restarting advertiser (stuck in probing)
2026-04-23T10:43:58.235  [model-pricing] OpenRouter pricing fetch failed (timeout 15s)
2026-04-23T10:43:58.243  [model-pricing] LiteLLM pricing fetch failed (timeout 15s)
2026-04-23T10:44:07.687  [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-04-23T10:44:19.332  [bonjour] restarting advertiser (stuck in unannounced)

All 7 post-restart entries are unrelated to the approvals storm (two benign bonjour advertise cycles, a secrets-config surface hint, two outbound pricing-fetch timeouts to OpenRouter/LiteLLM).

Counts:

WindowStorm entries (handshake-timeoutgateway closed (1000)[matrix] connect error)
Pre-restart, last 90 min (09:00–10:37 local, err log)239
Post-restart, full window (10:38–10:45 local, 7 min, err log)0
Post-restart, JSON main log (14:38+ UTC, 22 total entries in window)0

Gateway health post-restart: port 18789 rebound (3:03 boot-to-listen), single live PID 24714 under user mercury, openclaw status reports app 2026.4.20. No crash-loop observed.

Human Verification (required)

  • Verified scenarios:
    • local pnpm test src/infra/approval-handler-bootstrap.test.ts src/infra/approval-handler-start-coordinator.test.ts (27/27 passed)
    • pre-commit smart gate pnpm check:changed on both commits — 208 files / 2468 tests passed on the more recent commit
    • pnpm tsgo + pnpm tsgo:test clean; pnpm lint:core 0 warnings 0 errors
    • live smoke on the 11-account Matrix reproducer host: 239 storm entries in the 90 min preceding the cutover → 0 storm entries in the 7 minutes after (see the "Live smoke" section above for the log samples and counts)
  • Edge cases checked: jitter = 0 (regression-safe / opt-out), cancellation during jitter wait, cancellation while queued for a slot, slot release after a thrown handler.start(), replacement-during-jitter stops the previous handler immediately (regression test with jitterMs=5000), retry path does not re-jitter (regression test with jitterMs=10000, retry fires on exact 1 s boundary), env parsing rejection of "100ms" / "3.5" / " 250 " / "7x" / "2.5", jitter clamp when injected random() returns 1.
  • What I did not verify: a multi-hour soak (only 7+ min of live runtime so far); behavior on non-Matrix channels (Telegram/Discord/QQbot) under equivalent multi-account load — reasoning for those is identical at the bootstrap layer, but the live smoke here was Matrix-only.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Codex review on this PR: P1 (discussion_r3130645782, stopHandler ordering vs jitter/slot wait) resolved after commit f4524ad390; P2 (discussion_r3131032761, retry attempts being re-jittered) resolved after commit c4d83d79b2. Both threads marked resolved on the PR.

Compatibility / Migration

  • Backward compatible? Yes — no signature, schema, or protocol change; new optional startCoordinator param on startChannelApprovalHandlerBootstrap defaults to the process-scoped singleton.
  • Config/env changes? Yes — two new optional env vars, both with safe defaults, both accept only pure non-negative integers (invalid values fall back to default).
  • Migration needed? No. Operators who explicitly want pre-patch timing can set OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS=0 and OPENCLAW_APPROVAL_HANDLER_MAX_CONCURRENT_STARTS to a large value.

Risks and Mitigations

  • Risk: startup is slightly slower for installs that were not actually hitting the storm — each account's approval handler may wait up to jitterMs (default 2 s) before the first handshake.
    • Mitigation: default is a one-shot per-account delay bounded by jitterMs; operators can set OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS=0 to disable. The delay only applies to the initial start, not to retries.
  • Risk: the FIFO slot queue could starve a waiter if preceding slots never release (handler's start() / stop() never returns).
    • Mitigation: handler.stop() is best-effort with .catch(() => {}) and release happens in a finally; a thrown or swallowed start() still releases the slot. Tests cover the throw path.
  • Risk: process-scoped singleton state could leak between tests.
    • Mitigation: _resetDefaultApprovalHandlerStartCoordinatorForTests() is called in beforeEach/afterEach in the new bootstrap and coordinator test suites.

AI-assisted PR notice

  • Mark as AI-assisted — this PR was authored via Claude Code (Claude Opus 4.7, 1M context), iterating against real source, real tests, and GitHub Codex review.
  • Degree of testing — fully tested: 27 targeted unit tests (10 bootstrap + 12 coordinator + 5 added for codex review hardening), 2468-test pre-commit smart gate, typecheck + lint clean. No live production smoke against the 11-account Matrix reproducer on this branch yet (see Human Verification above).
  • Confirm I understand what the code does — yes; rationale for every primitive documented in the coordinator header comment and in-line in startHandlerForContext.
  • GitHub Codex review ran and its P1/P2 findings are addressed (see Review Conversations).

Changed files

  • src/infra/approval-handler-bootstrap.test.ts (modified, +374/-6)
  • src/infra/approval-handler-bootstrap.ts (modified, +70/-27)
  • src/infra/approval-handler-start-coordinator.test.ts (added, +356/-0)
  • src/infra/approval-handler-start-coordinator.ts (added, +209/-0)

PR #70649: fix(approvals): coordinate native handler startup to avoid loopback storms

Description (problem / solution / changelog)

Summary

  • Problem: on multi-account installs (e.g. 11 Matrix accounts, 4+ Telegram bots, Discord with many servers), every account independently opens a native-approvals gateway client on the same startup tick. The loopback gateway has to complete N concurrent preauth WebSocket handshakes at once, saturates its own handshake timer, fails them all with handshake-timeout / gateway closed (1000), and each failure retries at 1 Hz — a sustained self-DoS of the process's own loopback.
  • Why it matters: every internal consumer hitting the local gateway (session RPC, tool calls, node messaging, openclaw status) pays the latency tax. On the reproducer host (11 Matrix accounts, headless older Intel Mac mini, 2026.4.20) the gateway logged ~157 handshake-timeout entries and ~129 matrix/native-approvals errors 1:1 correlated in a single 24 h window, with handshakeMs spiking to 13–17 s against a nominal 10 s timer — the event loop was falling behind.
  • What changed: startChannelApprovalHandlerBootstrap now goes through a new process-scoped approval-handler-start-coordinator that applies (1) randomized startup jitter and (2) a FIFO concurrency cap on concurrent handler starts. Applied to the initial startup path only — retries keep their existing 1 s retry-timer cadence.
  • What did NOT change (scope boundary): approver resolution, isChannelExecApprovalClientEnabledFromConfig semantics, the retry-after-failure backoff layer (openclaw#68283 is the in-flight work there), the ChannelApprovalHandler runtime contract, or startChannelApprovalHandlerBootstrap's existing signature/cleanup behavior. The startCoordinator param is optional and documented as a test seam.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #70641
  • Related openclaw#68283 (retry-after-failure backoff; complementary)
  • Related openclaw#69936 (target-accountId scoping; complementary)
  • Related openclaw#70568 (Telegram ambiguous-account fan-out; complementary)
  • Related openclaw#68223 (multi-telegram handshake cascade report)
  • Related openclaw#69012 (telegram native-approvals handshake timeout on fresh boot)
  • Related openclaw#67034 (multi-account polling avalanche)
  • Prior art: commit c23ad91a14 fix(matrix): keep DM allowlist out of room commands (merged upstream, same "tighten DM allowlist scope" direction, different layer)
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: there was no coordination between the N per-account invocations of startChannelApprovalHandlerBootstrap. server-channels.ts:304 fans out accounts via Promise.all, each bootstrap awaits startHandlerForContext which in turn synchronously constructs a fresh GatewayClient and starts its preauth handshake. With jitter = 0 and concurrency = ∞, N accounts produce N near-simultaneous loopback handshakes. Under pressure the handshake timer on the server side misses its deadline, each fails, each retries at 1 Hz, and the thrash stays persistent because there is no jitter and no concurrency ceiling.
  • Missing detection / guardrail: no existing test exercised startChannelApprovalHandlerBootstrap under N concurrent bootstraps sharing a loopback — the handler-bootstrap unit suite covers single-account lifecycle only.
  • Contributing context (if known): the bug is latent regardless of channel; it is most visible on multi-account Matrix/Telegram installs because those channels' approver-resolvers can implicitly activate the handler from dm.allowFrom / owner-allowlists. That implicit-activation semantic is a separate discussion — see the exec-approvals.test.ts assertion at lines 121–139 of extensions/matrix/src/exec-approvals.test.ts which test-enforces the current behavior — and is intentionally out of scope here.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/infra/approval-handler-bootstrap.test.ts and src/infra/approval-handler-start-coordinator.test.ts.
  • Scenario the test should lock in: (1) jitter delays first handler start by the configured interval; (2) unregister during jitter cancels cleanly with no handler construction; (3) with maxConcurrentStarts: 1, three concurrent bootstraps drain FIFO with only one handler in flight at a time; (4) a slot releases after a thrown handler.start(); (5) retry attempts do not re-apply jitter; (6) env-var parsing rejects partial-number values like "100ms" and "3.5"; (7) jitter sampling clamps to [0, jitterMs) even when an injected RNG returns 1.
  • Why this is the smallest reliable guardrail: the coordinator is a small pure module with a narrow contract (waitJitter + acquireStartSlot); every failure mode of the storm reduces to one of those two primitives behaving wrong, so unit-level coverage is sufficient. The bootstrap tests cover the threading into startHandlerForContext so a regression that accidentally disables the coordinator in production is also caught.
  • Existing test that already covers this (if any): none — the pre-existing 6 tests in approval-handler-bootstrap.test.ts cover single-account lifecycle and are preserved unchanged.
  • If no new test is added, why not: N/A — tests added.

User-visible / Behavior Changes

  • On fresh boot, the first native-approvals handler start for each account is delayed by a uniformly-random [0, 2000) ms window (configurable). Single-account installs see at most one such delay per channel. Multi-account installs stop self-inflicted handshake storms.
  • Two new env vars, both optional with safe defaults: OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS (default 2000, set 0 to restore pre-patch timing) and OPENCLAW_APPROVAL_HANDLER_MAX_CONCURRENT_STARTS (default 3, set to a large value to restore pre-patch free-for-all).
  • No user-visible config schema change, no UI change, no protocol change.

Diagram (if applicable)

Before (N=4 accounts, same channel):
  t=0ms   account:a -> handshake
  t=0ms   account:b -> handshake    all N handshakes at t=0
  t=0ms   account:c -> handshake    -> server event loop stalls
  t=0ms   account:d -> handshake    -> all time out -> retry at t=1000ms, repeat

After (same N, jitter=2000ms, maxConcurrentStarts=3):
  t=~400ms    account:a acquires slot, handshake starts
  t=~700ms    account:b acquires slot, handshake starts
  t=~1100ms   account:c acquires slot, handshake starts
  t=~1600ms   account:d jittered; slots 1-2 have finished; d acquires, handshake starts
  -> server event loop keeps up, handshakes complete, no storm

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No — same connections, just serialized/jittered.
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS 15.x, older Intel Mac mini (CPU impact figures are anecdotal to this hardware)
  • Runtime/container: Node 22+, launchd-managed headless gateway
  • Model/provider: N/A (storm is in gateway preauth layer, no model calls involved)
  • Integration/channel (if any): Matrix with 11 configured accounts; each account has dm.allowFrom set, no explicit execApprovals config
  • Relevant config (redacted): channels.matrix.accounts.*, channels.matrix.dm.allowFrom: [<owner-mxid>], channels.matrix.execApprovals unset, gateway bind: loopback mode: local token auth

Steps

  1. Configure 6+ Matrix accounts under channels.matrix.accounts.*, each with the same owner in dm.allowFrom and execApprovals unset.
  2. Start the gateway and let it run >5 minutes.
  3. Tail the gateway log — expect repeating [matrix/native-approvals] connect error: gateway closed (1000) plus [gateway/ws] closed before connect ... cause=handshake-timeout at ~1 Hz, with handshakeMs frequently exceeding the configured/default 10 s timer.

Expected

  • On startup, all account handlers complete their preauth handshake once; no subsequent handshake-timeout entries until a real transient failure occurs.

Actual (before this PR)

  • Persistent 1 Hz loop of handshake-timeout / gateway closed (1000) / failed to start native approval handler, 1:1 correlated. openclaw status hangs intermittently; sessions_send times out; gateway-process CPU is artificially elevated for hours.

Evidence

  • Failing test/log before + passing after — see the "Live smoke on 11-account Matrix reproducer" section below for the end-to-end before/after.
  • Trace/log snippets — [matrix/native-approvals] connect error: gateway closed (1000): + matching [gateway/ws] closed before connect conn=... cause=handshake-timeout.
  • Screenshot/recording — N/A (no UI change).
  • Perf numbers (if relevant) — handshakeMs values of 13–17 s against a nominal 10 s timer before the fix; event loop saturated by self-inflicted preauth churn. Post-fix: zero handshake-timeout / native-approvals errors in 7 minutes of runtime against the same config.

Live smoke on 11-account Matrix reproducer

Applied on the original affected host on 2026-04-23 14:37 UTC by cherry-picking the three commits (8270f8322e, f4524ad390, c4d83d79b2) onto v2026.4.20, running pnpm install --frozen-lockfile && pnpm build && pnpm ui:build, and triggering a gateway restart via kill <pid> (launchd KeepAlive=true auto-respawns).

Pre-restart pattern (representative sample from the host's gateway err log, last ~10 min before SIGTERM):

2026-04-23T10:26:09.267  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.268  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:26:09.302  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.303  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:26:09.350  [matrix] connect error: gateway closed (1000):
2026-04-23T10:26:09.351  gateway connect failed: Error: gateway closed (1000):
2026-04-23T10:28:07.250  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.334  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.383  [matrix] connect error: gateway request timeout for connect
2026-04-23T10:28:07.416  [matrix] connect error: gateway request timeout for connect

Post-restart (all err-log entries from SIGTERM through +7 min, unfiltered):

2026-04-23T10:39:21.278  [secrets] gateway.auth.token is inactive (env var configured)
2026-04-23T10:40:56.740  [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-04-23T10:43:58.166  [bonjour] restarting advertiser (stuck in probing)
2026-04-23T10:43:58.235  [model-pricing] OpenRouter pricing fetch failed (timeout 15s)
2026-04-23T10:43:58.243  [model-pricing] LiteLLM pricing fetch failed (timeout 15s)
2026-04-23T10:44:07.687  [bonjour] watchdog detected non-announced service; attempting re-advertise
2026-04-23T10:44:19.332  [bonjour] restarting advertiser (stuck in unannounced)

All 7 post-restart entries are unrelated to the approvals storm (two benign bonjour advertise cycles, a secrets-config surface hint, two outbound pricing-fetch timeouts to OpenRouter/LiteLLM).

Counts:

WindowStorm entries (handshake-timeoutgateway closed (1000)[matrix] connect error)
Pre-restart, last 90 min (09:00–10:37 local, err log)239
Post-restart, full window (10:38–10:45 local, 7 min, err log)0
Post-restart, JSON main log (14:38+ UTC, 22 total entries in window)0

Gateway health post-restart: port 18789 rebound (3:03 boot-to-listen), single live PID 24714 under user mercury, openclaw status reports app 2026.4.20. No crash-loop observed.

Human Verification (required)

  • Verified scenarios:
    • local pnpm test src/infra/approval-handler-bootstrap.test.ts src/infra/approval-handler-start-coordinator.test.ts (34/34 passed)
    • pre-commit smart gate pnpm check:changed on each commit — 208 files / 2475 tests passed on the most recent commit
    • pnpm tsgo + pnpm tsgo:test clean; pnpm lint:core 0 warnings 0 errors
    • live smoke on the 11-account Matrix reproducer host: 239 storm entries in the 90 min preceding the cutover → 0 storm entries in the 7 minutes after (see the "Live smoke" section above for the log samples and counts)
  • Edge cases checked: jitter = 0 (regression-safe / opt-out), cancellation during jitter wait, cancellation while queued for a slot, slot release after a thrown handler.start(), replacement-during-jitter stops the previous handler immediately (regression test with jitterMs=5000), retry path does not re-jitter (regression test with jitterMs=10000, retry fires on exact 1 s boundary), env parsing rejection of "100ms" / "3.5" / " 250 " / "7x" / "2.5", jitter clamp when injected random() returns 1.
  • What I did not verify: a multi-hour soak (only 7+ min of live runtime so far); behavior on non-Matrix channels (Telegram/Discord/QQbot) under equivalent multi-account load — reasoning for those is identical at the bootstrap layer, but the live smoke here was Matrix-only.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Codex review on this PR:

  • P1 (discussion_r3131925898, start slot held through handler.start() could block queued bootstraps indefinitely if start() hangs on unbounded post-handshake work) — addressed in commit 4fcdeb7a06 by adding a per-acquisition hold-timeout watchdog (startSlotMaxHoldMs, default 30 s). Thread resolved.
  • P2 (discussion_r3132157416, acquireStartSlot only checked isCanceled pre-enqueue; stale generations could burn FIFO turns) — addressed in commit 0a4f955d51 by re-evaluating isCanceled at dequeue time and skipping canceled waiters. Thread resolved.

Prior-iteration codex findings on the fork PR (lukeboyett/openclaw#2, now closed) are carried forward in commits f4524ad390 (stopHandler before jitter/slot wait) and c4d83d79b2 (limit jitter to initial starts; env-parsing hardening; [0, jitterMs) clamp). Those fork threads are not visible on this PR but the fixes are in the commit history here.

Compatibility / Migration

  • Backward compatible? Yes — no signature, schema, or protocol change; new optional startCoordinator param on startChannelApprovalHandlerBootstrap defaults to the process-scoped singleton.
  • Config/env changes? Yes — two new optional env vars, both with safe defaults, both accept only pure non-negative integers (invalid values fall back to default).
  • Migration needed? No. Operators who explicitly want pre-patch timing can set OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS=0 and OPENCLAW_APPROVAL_HANDLER_MAX_CONCURRENT_STARTS to a large value.

Risks and Mitigations

  • Risk: startup is slightly slower for installs that were not actually hitting the storm — each account's approval handler may wait up to jitterMs (default 2 s) before the first handshake.
    • Mitigation: default is a one-shot per-account delay bounded by jitterMs; operators can set OPENCLAW_APPROVAL_HANDLER_START_JITTER_MS=0 to disable. The delay only applies to the initial start, not to retries.
  • Risk: the FIFO slot queue could starve a waiter if preceding slots never release (handler's start() / stop() never returns).
    • Mitigation: handler.stop() is best-effort with .catch(() => {}) and release happens in a finally; a thrown or swallowed start() still releases the slot. Tests cover the throw path.
  • Risk: process-scoped singleton state could leak between tests.
    • Mitigation: _resetDefaultApprovalHandlerStartCoordinatorForTests() is called in beforeEach/afterEach in the new bootstrap and coordinator test suites.

AI-assisted PR notice

  • Mark as AI-assisted — this PR was authored via Claude Code (Claude Opus 4.7, 1M context), iterating against real source, real tests, and GitHub Codex review.
  • Degree of testing — fully tested: 34 targeted unit tests (11 bootstrap + 23 coordinator, including regression tests for each codex finding), 2475-test pre-commit smart gate on every commit, typecheck + lint clean. Live smoke on the 11-account Matrix reproducer host also completed (see "Live smoke" and Human Verification sections).
  • Confirm I understand what the code does — yes; rationale for every primitive documented in the coordinator header comment and in-line in startHandlerForContext.
  • GitHub Codex review ran and its P1/P2 findings are addressed (see Review Conversations).

Changed files

  • src/infra/approval-handler-bootstrap.test.ts (modified, +374/-6)
  • src/infra/approval-handler-bootstrap.ts (modified, +70/-27)
  • src/infra/approval-handler-start-coordinator.test.ts (added, +415/-0)
  • src/infra/approval-handler-start-coordinator.ts (added, +237/-0)

Code Example

[matrix/native-approvals] connect error: gateway closed (1000):
gateway connect failed: Error: gateway closed (1000):
[gateway/ws] closed before connect conn=<uuid> cause=handshake-timeout ... code=1000
[matrix/native-approvals] failed to start native approval handler: Error: gateway closed: 1000

---

[ws] handshake timeout conn=<uuid> peer=127.0.0.1:<a>-><gateway-port> remote=127.0.0.1 handshakeMs=13427
[ws] handshake timeout conn=<uuid> peer=127.0.0.1:<b>-><gateway-port> remote=127.0.0.1 handshakeMs=17104
[matrix/native-approvals] connect error: gateway closed (1000):
gateway connect failed: Error: gateway closed (1000):
[matrix/native-approvals] failed to start native approval handler: Error: gateway closed: 1000

---

export function isChannelExecApprovalClientEnabledFromConfig(params: {
  enabled?: ChannelExecApprovalEnableMode;
  approverCount: number;
}): boolean {
  if (params.approverCount <= 0) {
    return false;
  }
  return params.enabled !== false; // undefined → true
}
RAW_BUFFERClick to expand / collapse

Summary

On a gateway running 2026.4.20 with 11 configured Matrix accounts and no explicit execApprovals config, the loopback gateway enters a persistent 1 Hz native-approvals / handshake-timeout reconnect storm that degrades sessions_list / sessions_send and artificially elevates gateway-process CPU; reproduced for hours on the affected host and confirmed 1:1 in gateway logs.

Steps to reproduce

  1. Configure 6+ Matrix accounts under channels.matrix.accounts.* with a shared owner in channels.matrix.dm.allowFrom: ["@<owner>:<hs>"] and channels.matrix.execApprovals unset (no top-level override, no per-account override).
  2. Start the gateway normally (bind: loopback, mode: local, token auth) and let it run for more than 5 minutes.
  3. Tail the gateway log.

Expected behavior

On startup, each account's native-approvals subsystem completes its loopback preauth handshake once, then is silent until a real transient failure or context replacement. sessions_list / sessions_send / openclaw status remain responsive. This is the observed behavior on single-account installs and on prior versions that did not implicitly activate native approvals.

Actual behavior

A persistent 1 Hz loop of

[matrix/native-approvals] connect error: gateway closed (1000):
gateway connect failed: Error: gateway closed (1000):
[gateway/ws] closed before connect conn=<uuid> cause=handshake-timeout ... code=1000
[matrix/native-approvals] failed to start native approval handler: Error: gateway closed: 1000

with handshakeMs frequently exceeding the nominal 10 s configured/default timer. sessions_list / sessions_send / openclaw status intermittently time out. Gateway-process CPU is artificially elevated because the event loop is busy servicing self-inflicted preauth churn.

Environment

  • OpenClaw version: 2026.4.20
  • Operating system: macOS (tested on an older Intel Mac mini running headless). The storm itself is platform-agnostic; the absolute CPU numbers in the evidence are anecdotal to that hardware.
  • Install method: pnpm dev against the installed gateway binary, managed by launchd.
  • Model: N/A — the storm is in the gateway preauth layer and does not involve model inference. Primary agent runtime on the host: anthropic/claude-opus-4-7.
  • Provider / routing chain: N/A — no outbound provider traffic is involved. Traffic is entirely between the OpenClaw process and its own loopback gateway on 127.0.0.1:<gateway-port>.

Relevant non-provider config:

  • channels.matrix.accounts.* — 11 accounts plus a default block.
  • channels.matrix.dm.allowFrom: [<owner-mxid>].
  • channels.matrix.execApprovals: unset at both top-level and per-account.
  • exec-approvals.json: defaults: {}, agents: {}.
  • Gateway: bind: loopback, mode: local, token auth.

Logs, screenshots, and evidence

[ws] handshake timeout conn=<uuid> peer=127.0.0.1:<a>-><gateway-port> remote=127.0.0.1 handshakeMs=13427
[ws] handshake timeout conn=<uuid> peer=127.0.0.1:<b>-><gateway-port> remote=127.0.0.1 handshakeMs=17104
[matrix/native-approvals] connect error: gateway closed (1000):
gateway connect failed: Error: gateway closed (1000):
[matrix/native-approvals] failed to start native approval handler: Error: gateway closed: 1000

Single 24 h gateway log window on the affected instance: ~157 handshake-timeout entries and ~129 matrix/native-approvals error entries, closely correlated 1:1.

Impact and severity

  • Affected: any install with ≥6 Matrix accounts configured and no explicit execApprovals block. Also observed with similar shape on multi-Telegram-bot installs (see cluster references below).
  • Severity: annoying to high — blocks fast sessions_send, intermittently blocks openclaw status, and keeps the gateway process visibly "busy" on otherwise idle hosts.
  • Frequency: always reproducible on the affected configuration; steady-state once boot completes, does not self-clear.
  • Consequence: real work hitting the local gateway pays latency tax; restart-warning broadcasts time out; internal reliability degrades without any external network cause.

Root cause (analysis)

Three reinforcing issues combine:

1. isChannelExecApprovalClientEnabledFromConfig treats enabled === undefined as enabled

src/plugin-sdk/approval-client-helpers.ts:

export function isChannelExecApprovalClientEnabledFromConfig(params: {
  enabled?: ChannelExecApprovalEnableMode;
  approverCount: number;
}): boolean {
  if (params.approverCount <= 0) {
    return false;
  }
  return params.enabled !== false; // undefined → true
}

Combined with the Matrix approver resolver in extensions/matrix/src/exec-approvals.ts, which falls back to dm.allowFrom when execApprovals.approvers is not explicit, any Matrix account that has dm.allowFrom set implicitly opts into native approval delivery, even when the operator has never configured execApprovals. That turns "DM access policy" into "approval delivery policy" without opt-in. Note: this behavior is currently test-enforced at extensions/matrix/src/exec-approvals.test.ts:121–139 ("auto-enables when approvers resolve"), so reversing it is a cross-channel maintainer decision.

2. One native-approvals gateway client per Matrix account

startChannelApprovalHandlerBootstrap in src/infra/approval-handler-bootstrap.ts is invoked per configured Matrix account during channel startup (src/gateway/server-channels.ts:304–401, inside a Promise.all over account IDs). Each account creates its own GatewayClient → preauth websocket to the loopback gateway. On an 11-account install that is 11 concurrent preauth websockets to 127.0.0.1:<port> during startup and every subsequent rebuild cycle — all sharing the same loopback preauth budget key.

3. Double-layer retry with no coordination

  • APPROVAL_HANDLER_BOOTSTRAP_RETRY_MS = 1000 in startChannelApprovalHandlerBootstrap reschedules startHandlerForContext on any failure.
  • Each startHandlerForContext constructs a new GatewayClient, whose own scheduleReconnect() also re-attempts with exponential backoff starting at 1 s.
  • No circuit breaker, no jitter.

Under any transient pressure (multi-account startup, upgrade, a single slow handshake), the inner GatewayClient reconnect and the outer startHandlerForContext retry compound into a persistent ~1 Hz reconnect storm across N accounts. The 32-per-IP preauth budget (DEFAULT_MAX_PREAUTH_CONNECTIONS_PER_IP) does not stop the storm — it causes some attempts to be rejected, which feeds right back into the 1 s retry. Once the event loop is saturated, setTimeout(handshakeTimer, 10000) routinely fires well past 10 s, so failing handshakes log handshakeMs values of 13–17+ seconds — itself a symptom of the pressure, not an extra cause.

Why this is not fixable purely via config

  • Setting channels.matrix.execApprovals.enabled: false (where it inherits per-account) papers over the bug only for operators who know to set it, does not fix the enabled === undefined → enabled semantics, and does not stop the retry stampede if anything else ever causes a transient failure.
  • Clearing dm.allowFrom is not a valid workaround — it breaks DM access control.
  • Reducing the Matrix account count is not a real fix — it just raises the failure threshold.
  • OPENCLAW_HANDSHAKE_TIMEOUT_MS tuning treats symptoms, not cause.

Suggested fixes (pick some combination)

  1. Require explicit opt-in for native approval clients. Change isChannelExecApprovalClientEnabledFromConfig and/or the Matrix resolver so the delivery client only starts when execApprovals.enabled === true. Default to off. Keep dm.allowFrom strictly about DM access, not approval approverhood. Note: the current test at exec-approvals.test.ts:121–139 documents the opposite behavior as intended, so this is a maintainer-level scope decision.
  2. Coordinate / deduplicate native-approvals handlers per channel. If multiple accounts share the same approver set, run one handler (or at most one active client) rather than N. Bonus: stagger startup to kill the thundering herd.
  3. Collapse the double retry layer. Either remove the 1 s APPROVAL_HANDLER_BOOTSTRAP_RETRY_MS and rely on GatewayClient.scheduleReconnect(), or bypass the GatewayClient's own reconnect when the bootstrap owns lifecycle. Pick one loop and add jitter + a simple circuit breaker. Partially in flight as #68283.
  4. Cap concurrent preauth reconnects from a single process. A small in-process semaphore around GatewayClient start prevents any single OpenClaw instance from self-DOS-ing its own loopback.
  5. Emit a distinct warning when native-approvals fails N consecutive times, so operators see a named root cause instead of only generic handshake-timeout log spam.

Gateway-side observations worth preserving in the fix

  • Server-side handshake timer uses setTimeout(handshakeTimeoutMs). Under the current storm this timer is itself the victim — handshakeMs of ~17 s against a ~10 s configured timeout strongly implies the server event loop is falling behind. Any fix should reduce preauth churn rather than raise the timeout.

Operator workaround until a fix ships

  • Least-intrusive: set channels.matrix.execApprovals.enabled: false on each affected Matrix account (and/or top-level, depending on inheritance) if you do not use Matrix native approval delivery. The matrix/native-approvals noise should stop within a few seconds of hot reload / restart.
  • Fallback: restart the gateway to clear accumulated preauth pressure; temporary, because the storm returns on the next multi-account startup.

Already in flight — different axes of the same cluster

These PRs each address a sibling reliability/scoping problem for native approvals but do not overlap in file scope with each other or with the root causes above:

  • #68283 fix(approvals): back off native handler bootstrap retries. Addresses root cause #3 — bounded exponential backoff for retries, 300 s special case for PAIRING_REQUIRED. Reduces the retry-storm side of the problem. Does not touch root cause #1 (opt-in semantics) or #2 (N-clients-per-N-accounts) or the initial-startup thundering herd.
  • #69936 fix: scope exec/plugin approval delivery to configured target accountId. Tightens request routing so a targeted approval does not leak to other accounts.
  • #70568 fix(telegram): scope ambiguous exec approvals to one account. Closes #69916. Treats Telegram-sourced approvals without a bound accountId as ambiguous in multi-account installs unless exactly one Telegram account is eligible.

Prior art in maintainer direction

  • c23ad91a14 fix(matrix): keep DM allowlist out of room commands (merged 2026-04-23 on main). Different layer — the Matrix room-command access monitor — but the same theme as root cause #1 here: dm.allowFrom should not silently extend into policies it was never configured for. Suggests maintainers are already willing to draw the "DM allowFrom is narrowly about DM access, not a global approver allowlist" line.

extent analysis

TL;DR

The most likely fix for the persistent reconnect storm issue is to require explicit opt-in for native approval clients and coordinate/deduplicate native-approvals handlers per channel.

Guidance

  • Identify and address the three reinforcing issues: isChannelExecApprovalClientEnabledFromConfig treating enabled === undefined as enabled, one native-approvals gateway client per Matrix account, and double-layer retry with no coordination.
  • Consider implementing a combination of suggested fixes, such as requiring explicit opt-in for native approval clients, coordinating/deduplicating native-approvals handlers per channel, collapsing the double retry layer, capping concurrent preauth reconnects, and emitting a distinct warning for consecutive native-approvals failures.
  • Verify the fix by monitoring the gateway logs for handshake-timeout and matrix/native-approvals error entries, and checking the gateway process CPU usage.

Example

No code snippet is provided as the issue is complex and requires a combination of fixes.

Notes

The issue is specific to the OpenClaw gateway with multiple Matrix accounts configured and no explicit execApprovals block. The suggested fixes may not apply to other configurations or versions.

Recommendation

Apply a workaround by setting channels.matrix.execApprovals.enabled: false on each affected Matrix account until a fix ships, as this is the least-intrusive solution that can stop the matrix/native-approvals noise.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On startup, each account's native-approvals subsystem completes its loopback preauth handshake once, then is silent until a real transient failure or context replacement. sessions_list / sessions_send / openclaw status remain responsive. This is the observed behavior on single-account installs and on prior versions that did not implicitly activate native approvals.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING