openclaw - ✅(Solved) Fix [Matrix] Lowercased room ID in delivery-recovery causes sync loop crash and permanent message loss [3 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57321Fetched 2026-04-08 01:51:02
View on GitHub
Comments
1
Participants
1
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
referenced ×8cross-referenced ×2commented ×1

A failed delivery-recovery entry with a lowercased Matrix room ID causes the Matrix sync loop to crash permanently on gateway restart. The gateway stays running but Matrix is completely dead — no inbound messages are processed. The poisoned delivery persists across restarts, causing a crash loop.

Related: #19278 (same root cause — room ID case normalization)

Error Message

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

Root Cause

Session keys normalize Matrix room IDs to lowercase (e.g. !IEjZDNPucuFvKLrAQC:server!iejzdnpucufvklraqc:server). During normal operation, the Matrix SDK sends messages using the correct-case room ID from sync state, so this is invisible. However, when a message is queued for delivery recovery, the lowercased room ID from the session key is stored as the "to" target. On retry, Synapse returns 403 because the lowercased ID does not match.

Fix Action

Workaround

Manually move the poisoned delivery JSON from ~/.openclaw/delivery-queue/ to delivery-queue/failed/ and restart the gateway.

PR fix notes

PR #57337: fix(matrix): preserve case in group/channel peer IDs for session keys

Description (problem / solution / changelog)

Summary

Matrix room IDs (e.g. !IEjZDNPucuFvKLrAQC:server) are case-sensitive opaque identifiers per the Matrix spec. The buildAgentPeerSessionKey() function was calling .toLowerCase() on group/channel peer IDs, which destroyed the original case.

This caused a critical failure path:

  1. Session key is built with lowercased room ID
  2. Delivery queue entry stores the lowercased room ID as "to" target
  3. On gateway restart, delivery-recovery retries the send using the lowercased room ID
  4. Homeserver returns 403 (M_FORBIDDEN: User not in room)
  5. The 403 crashes the Matrix sync loop permanently
  6. The poisoned delivery persists across restarts, causing a crash loop

Changes

  • src/routing/session-key.ts: Remove .toLowerCase() from the group/channel peer ID path in buildAgentPeerSessionKey(). Direct/DM peer IDs still lowercase (Matrix user IDs are case-insensitive).
  • src/routing/session-key.continuity.test.ts: Add tests verifying mixed-case room IDs are preserved in session keys.

Testing

  • All 59 existing + new session-key and delivery-recovery tests pass
  • No existing tests relied on lowercased group/channel peer IDs

Breaking change note

Existing Matrix group/channel session keys were lowercased. After this change, new sessions will use the original-case room ID, creating new session keys. Existing lowercased sessions will become orphaned. This is acceptable since group chat sessions are ephemeral and regularly compacted/reset.

Fixes #57321 Related: #19278, PR #31023

Changed files

  • src/routing/session-key.continuity.test.ts (modified, +49/-0)
  • src/routing/session-key.ts (modified, +7/-2)

PR #57426: fix(delivery): treat Matrix "User not in room" as permanent delivery error

Description (problem / solution / changelog)

Summary

When delivery-recovery retries a queued message with a lowercased Matrix room ID (from case-normalized session keys), Synapse returns a 403 "User not in room" error. Previously this error was not recognized as permanent, so:

  1. The entry retried up to 5 times across restarts
  2. Each retry propagated the 403 error, which crashed the Matrix sync loop
  3. The gateway stayed running but Matrix was completely dead
  4. The poisoned entry persisted, causing the same crash on every restart

Fix

Add User .* not in room to PERMANENT_ERROR_PATTERNS in delivery-queue-recovery. This causes delivery-recovery to immediately move the poisoned entry to failed/ instead of retrying, preventing the sync loop crash.

What this does NOT fix

The root cause — session key normalization lowercasing Matrix room IDs — remains. That requires a broader strategy touching the session store stack (as noted by @vincentkoc in #57337). This PR prevents the catastrophic crash-loop symptom while that work is scoped.

Changes

  • src/infra/outbound/delivery-queue-recovery.ts: Add User .* not in room to permanent error patterns
  • src/infra/outbound/delivery-queue.recovery.test.ts: Add test for Matrix 403 permanent error handling

Testing

All 12 delivery-queue recovery tests pass (11 existing + 1 new).

Fixes the crash-loop symptom of #57321 Supersedes #57337

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/infra/outbound/delivery-queue-recovery.ts (modified, +1/-0)
  • src/infra/outbound/delivery-queue.policy.test.ts (modified, +1/-0)
  • src/infra/outbound/delivery-queue.recovery.test.ts (modified, +22/-0)

PR #64391: fix: preserve canonical restart sentinel routes

Description (problem / solution / changelog)

Summary

  • Problem: restart-sentinel notices could fall back to reconstructing outbound targets from session keys when canonical delivery context was missing after restart.
  • Why it matters: case-sensitive channels like Matrix can carry lossy/lowercased session keys, so that fallback can misroute the post-restart notice even though the wake event is still valid.
  • What changed: extractDeliveryInfo() now synthesizes canonical stored delivery routes via deliveryContextFromSession(...), and restart-sentinel notice delivery no longer falls back to session-key target reconstruction.
  • What did NOT change (scope boundary): this PR does not change normal outbound delivery-queue recovery behavior; it only hardens the restart-sentinel notice path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #57321
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: restart-sentinel notice delivery merged sentinel context with stored session context, then fell back to resolveAnnounceTargetFromKey(...) when canonical route data was missing. That fallback reconstructs a destination from session-key identity, which is unsafe for case-sensitive channels.
  • Missing detection / guardrail: no regression covered the "stored route missing, session key lossy" restart path.
  • Contributing context (if known): normal Matrix queue recovery already prefers canonical stored route metadata; the restart-sentinel path had older fallback behavior.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-restart-sentinel.test.ts, src/config/sessions/delivery-info.test.ts
  • Scenario the test should lock in: restart-sentinel skips outbound notice when no canonical route survives restart, and stored session metadata still yields the canonical Matrix route when deliveryContext is absent.
  • Why this is the smallest reliable guardrail: the bug lives at the boundary between session metadata recovery and restart-sentinel delivery routing.
  • Existing test that already covers this (if any): src/infra/outbound/delivery-queue.recovery.test.ts already covers the permanent-failure recovery side, but not restart-sentinel fallback.
  • If no new test is added, why not:

User-visible / Behavior Changes

  • Restart notices no longer guess a target from a lossy session key when canonical route metadata is unavailable after restart.
  • When canonical route metadata still exists, restart notices continue routing normally.

Diagram (if applicable)

Before:
[restart sentinel] -> [missing canonical route] -> [session-key fallback target] -> [possible misdelivery]

After:
[restart sentinel] -> [missing canonical route] -> [skip outbound notice] -> [wake/system event still delivered]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node 22 + pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Matrix restart notice routing
  • Relevant config (redacted): session store entries with canonical Matrix lastTo / origin.to

Steps

  1. Persist a Matrix session whose session key is lossy/lowercased.
  2. Remove canonical restart delivery context from the sentinel payload.
  3. Trigger restart-sentinel wake handling.

Expected

  • No outbound restart notice is sent from a reconstructed session-key target.
  • Canonical stored route metadata is used when present.

Actual

  • Previously, restart-sentinel could fall back to session-key-derived routing.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: targeted tests for delivery-info, restart-sentinel, Matrix session-route, and delivery-queue recovery/policy.
  • Edge cases checked: Matrix mixed-case canonical route stored without deliveryContext; restart-sentinel with no surviving canonical route; existing permanent Matrix 403 recovery coverage still green.
  • What you did not verify: full end-to-end gateway restart against a live Matrix server.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: skipping outbound restart notice when canonical route metadata is missing can suppress a notice in some stale-session cases.
    • Mitigation: the wake/system event still fires, and we avoid misdelivery to the wrong conversation.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/config/sessions/delivery-info.test.ts (modified, +54/-0)
  • src/config/sessions/delivery-info.ts (modified, +23/-8)
  • src/gateway/server-restart-sentinel.test.ts (modified, +89/-8)
  • src/gateway/server-restart-sentinel.ts (modified, +13/-8)

Code Example

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]
RAW_BUFFERClick to expand / collapse

Summary

A failed delivery-recovery entry with a lowercased Matrix room ID causes the Matrix sync loop to crash permanently on gateway restart. The gateway stays running but Matrix is completely dead — no inbound messages are processed. The poisoned delivery persists across restarts, causing a crash loop.

Related: #19278 (same root cause — room ID case normalization)

Root cause

Session keys normalize Matrix room IDs to lowercase (e.g. !IEjZDNPucuFvKLrAQC:server!iejzdnpucufvklraqc:server). During normal operation, the Matrix SDK sends messages using the correct-case room ID from sync state, so this is invisible. However, when a message is queued for delivery recovery, the lowercased room ID from the session key is stored as the "to" target. On retry, Synapse returns 403 because the lowercased ID does not match.

Steps to reproduce

  1. Configure Matrix channel with a room that has mixed-case characters in its room ID
  2. Trigger a gateway crash while a message send is in-flight (e.g. via Anthropic API overload → websocket reconnect failure)
  3. A delivery entry is persisted with the lowercased room ID
  4. Restart the gateway
  5. delivery-recovery retries the send with the lowercased room ID
  6. Synapse returns 403 M_FORBIDDEN: User @bot:server not in room !lowercase...

Observed behavior

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

After this, the gateway is running but Matrix sync is permanently dead. The delivery entry persists in delivery-queue/, so every subsequent restart triggers the same crash.

Expected behavior

  1. Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry)
  2. A failed delivery should not crash the Matrix sync loop — it should be moved to failed/ and sync should continue
  3. Delivery entries should expire after N retries instead of retrying indefinitely across restarts

Workaround

Manually move the poisoned delivery JSON from ~/.openclaw/delivery-queue/ to delivery-queue/failed/ and restart the gateway.

Environment

  • OpenClaw: 2026.3.24
  • Homeserver: Synapse (self-hosted)
  • OS: Manjaro Linux (x64)
  • Install method: npm

extent analysis

Fix Plan

To address the issue, we need to modify the delivery recovery mechanism to handle room IDs with original case and prevent the Matrix sync loop from crashing due to a failed delivery. Here are the steps:

  • Modify the delivery-recovery mechanism to store room IDs in their original case.
  • Implement a retry limit for delivery attempts to prevent indefinite retries.
  • Handle failed deliveries by moving them to a failed queue instead of crashing the sync loop.

Code Changes

# Modify the delivery recovery mechanism to store room IDs in their original case
def store_delivery_entry(room_id, message):
    # Store the room ID in its original case
    delivery_entry = {
        'room_id': room_id,  # Do not lowercase the room ID
        'message': message
    }
    # Store the delivery entry in the queue
    queue.put(delivery_entry)

# Implement a retry limit for delivery attempts
def retry_delivery(delivery_entry, max_retries=5):
    retries = 0
    while retries < max_retries:
        try:
            # Attempt to deliver the message
            deliver_message(delivery_entry['room_id'], delivery_entry['message'])
            break
        except MatrixError as e:
            # Handle the error and retry
            retries += 1
            if retries >= max_retries:
                # Move the delivery entry to the failed queue
                move_to_failed_queue(delivery_entry)
                break

# Handle failed deliveries by moving them to a failed queue
def move_to_failed_queue(delivery_entry):
    # Move the delivery entry to the failed queue
    failed_queue.put(delivery_entry)

Verification

To verify that the fix worked, you can test the delivery recovery mechanism with a room ID that has mixed-case characters. The gateway should no longer crash due to a failed delivery, and the delivery entry should be moved to the failed queue after the retry limit is reached.

Extra Tips

  • Make sure to handle errors properly in the delivery recovery mechanism to prevent crashes.
  • Implement a mechanism to clean up the failed queue periodically to prevent it from growing indefinitely.
  • Consider adding logging to track failed deliveries and retries to help with debugging and monitoring.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry)
  2. A failed delivery should not crash the Matrix sync loop — it should be moved to failed/ and sync should continue
  3. Delivery entries should expire after N retries instead of retrying indefinitely across restarts

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Matrix] Lowercased room ID in delivery-recovery causes sync loop crash and permanent message loss [3 pull requests, 1 comments, 1 participants]