1. Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry) 2. A failed delivery should not crash the Matrix sync loop — it should be moved to `failed/` and sync should continue 3. Delivery entries should expire after N retries instead of retrying indefinitely across restarts

openclaw - ✅(Solved) Fix [Matrix] Lowercased room ID in delivery-recovery causes sync loop crash and permanent message loss [3 pull requests, 1 comments, 1 participants]

dlardo · 2026-03-29T22:36:44Z

[openclaw] A failed delivery-recovery entry with a lowercased Matrix room ID causes the Matrix sync loop to crash permanently on gateway restart. The gateway s… A failed delivery-recovery entry with a **lowercased Matrix room ID** causes the Matrix sync loop to crash permanently on gateway restart. The gateway stays running but Matrix is completely dead — no inbound messages are processed. The poisoned delivery persists across restarts, causing a crash loop. **Related:** #19278 (same root cause — room ID case normalization) # PR #57337: fix(matrix): preserve case in group/channel peer IDs for session keys - Repository: openclaw/openclaw - Author: dlardo - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/57337 ## Description (problem / solution / changelog) ## Summary Matrix room IDs (e.g. `!IEjZDNPucuFvKLrAQC:server`) are **case-sensitive opaque identifiers** per the [Matrix spec](https://spec.matrix.org/v1.13/appendices/#room-ids). The `buildAgentPeerSessionKey()` function was calling `.toLowerCase()` on group/channel peer IDs, which destroyed the original case. This caused a critical failure path: 1. Session key is built with lowercased room ID 2. Delivery queue entry stores the lowercased room ID as `"to"` target 3. On gateway restart, `delivery-recovery` retries the send using the lowercased room ID 4. Homeserver returns 403 (`M_FORBIDDEN: User not in room`) 5. The 403 crashes the Matrix sync loop permanently 6. The poisoned delivery persists across restarts, causing a crash loop ## Changes - **`src/routing/session-key.ts`**: Remove `.toLowerCase()` from the group/channel peer ID path in `buildAgentPeerSessionKey()`. Direct/DM peer IDs still lowercase (Matrix user IDs are case-insensitive). - **`src/routing/session-key.continuity.test.ts`**: Add tests verifying mixed-case room IDs are preserved in session keys. ## Testing - All 59 existing + new session-key and delivery-recovery tests pass - No existing tests relied on lowercased group/channel peer IDs ## Breaking change note Existing Matrix group/channel session keys were lowercased. After this change, new sessions will use the original-case room ID, creating new session keys. Existing lowercased sessions will become orphaned. This is acceptable since group chat sessions are ephemeral and regularly compacted/reset. Fixes #57321 Related: #19278, PR #31023 ## Changed files - `src/routing/session-key.continuity.test.ts` (modified, +49/-0) - `src/routing/session-key.ts` (modified, +7/-2) --- # PR #57426: fix(delivery): treat Matrix "User not in room" as permanent delivery error - Repository: openclaw/openclaw - Author: dlardo - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/57426 ## Description (problem / solution / changelog) ## Summary When delivery-recovery retries a queued message with a **lowercased Matrix room ID** (from case-normalized session keys), Synapse returns a 403 "User not in room" error. Previously this error was not recognized as permanent, so: 1. The entry retried up to 5 times across restarts 2. Each retry propagated the 403 error, which **crashed the Matrix sync loop** 3. The gateway stayed running but Matrix was completely dead 4. The poisoned entry persisted, causing the **same crash on every restart** ## Fix Add `User .* not in room` to `PERMANENT_ERROR_PATTERNS` in delivery-queue-recovery. This causes delivery-recovery to immediately move the poisoned entry to `failed/` instead of retrying, preventing the sync loop crash. ## What this does NOT fix The root cause — session key normalization lowercasing Matrix room IDs — remains. That requires a broader strategy touching the session store stack (as noted by @vincentkoc in #57337). This PR prevents the catastrophic crash-loop symptom while that work is scoped. ## Changes - **`src/infra/outbound/delivery-queue-recovery.ts`**: Add `User .* not in room` to permanent error patterns - **`src/infra/outbound/delivery-queue.recovery.test.ts`**: Add test for Matrix 403 permanent error handling ## Testing All 12 delivery-queue recovery tests pass (11 existing + 1 new). Fixes the crash-loop symptom of #57321 Supersedes #57337 ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/infra/outbound/delivery-queue-recovery.ts` (modified, +1/-0) - `src/infra/outbound/delivery-queue.policy.test.ts` (modified, +1/-0) - `src/infra/outbound/delivery-queue.recovery.test.ts` (modified, +22/-0) --- # PR #64391: fix: preserve canonical restart sentinel routes - Repository: openclaw/openclaw - Author: gumadeiras - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/64391 ## Description (problem / solution / changelog) ## Summary - Problem: restart-sentinel notices could fall back to reconstructing outbound targets from session keys when canonical delivery context was missing after restart. - Why it matters: case-sensitive channels like Matrix can carry lossy/lowercased session keys, so that fallback can misroute the p

openclaw2026-03-29 22:36:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#57321•Fetched 2026-04-08 01:51:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dlardo

Participants

dlardo

Timeline (top)

referenced ×8cross-referenced ×2commented ×1

A failed delivery-recovery entry with a lowercased Matrix room ID causes the Matrix sync loop to crash permanently on gateway restart. The gateway stays running but Matrix is completely dead — no inbound messages are processed. The poisoned delivery persists across restarts, causing a crash loop.

Related: #19278 (same root cause — room ID case normalization)

Error Message

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

Root Cause

Session keys normalize Matrix room IDs to lowercase (e.g. !IEjZDNPucuFvKLrAQC:server → !iejzdnpucufvklraqc:server). During normal operation, the Matrix SDK sends messages using the correct-case room ID from sync state, so this is invisible. However, when a message is queued for delivery recovery, the lowercased room ID from the session key is stored as the "to" target. On retry, Synapse returns 403 because the lowercased ID does not match.

Fix Action

Workaround

Manually move the poisoned delivery JSON from ~/.openclaw/delivery-queue/ to delivery-queue/failed/ and restart the gateway.

PR fix notes

PR #57337: fix(matrix): preserve case in group/channel peer IDs for session keys

Repository: openclaw/openclaw
Author: dlardo
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/57337

Description (problem / solution / changelog)

Summary

Matrix room IDs (e.g. !IEjZDNPucuFvKLrAQC:server) are case-sensitive opaque identifiers per the Matrix spec. The buildAgentPeerSessionKey() function was calling .toLowerCase() on group/channel peer IDs, which destroyed the original case.

This caused a critical failure path:

Session key is built with lowercased room ID
Delivery queue entry stores the lowercased room ID as "to" target
On gateway restart, delivery-recovery retries the send using the lowercased room ID
Homeserver returns 403 (M_FORBIDDEN: User not in room)
The 403 crashes the Matrix sync loop permanently
The poisoned delivery persists across restarts, causing a crash loop

Changes

src/routing/session-key.ts: Remove .toLowerCase() from the group/channel peer ID path in buildAgentPeerSessionKey(). Direct/DM peer IDs still lowercase (Matrix user IDs are case-insensitive).
src/routing/session-key.continuity.test.ts: Add tests verifying mixed-case room IDs are preserved in session keys.

Testing

All 59 existing + new session-key and delivery-recovery tests pass
No existing tests relied on lowercased group/channel peer IDs

Breaking change note

Existing Matrix group/channel session keys were lowercased. After this change, new sessions will use the original-case room ID, creating new session keys. Existing lowercased sessions will become orphaned. This is acceptable since group chat sessions are ephemeral and regularly compacted/reset.

Fixes #57321 Related: #19278, PR #31023

Changed files

src/routing/session-key.continuity.test.ts (modified, +49/-0)
src/routing/session-key.ts (modified, +7/-2)

PR #57426: fix(delivery): treat Matrix "User not in room" as permanent delivery error

Repository: openclaw/openclaw
Author: dlardo
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/57426

Description (problem / solution / changelog)

Summary

When delivery-recovery retries a queued message with a lowercased Matrix room ID (from case-normalized session keys), Synapse returns a 403 "User not in room" error. Previously this error was not recognized as permanent, so:

The entry retried up to 5 times across restarts
Each retry propagated the 403 error, which crashed the Matrix sync loop
The gateway stayed running but Matrix was completely dead
The poisoned entry persisted, causing the same crash on every restart

Fix

Add User .* not in room to PERMANENT_ERROR_PATTERNS in delivery-queue-recovery. This causes delivery-recovery to immediately move the poisoned entry to failed/ instead of retrying, preventing the sync loop crash.

What this does NOT fix

The root cause — session key normalization lowercasing Matrix room IDs — remains. That requires a broader strategy touching the session store stack (as noted by @vincentkoc in #57337). This PR prevents the catastrophic crash-loop symptom while that work is scoped.

Changes

src/infra/outbound/delivery-queue-recovery.ts: Add User .* not in room to permanent error patterns
src/infra/outbound/delivery-queue.recovery.test.ts: Add test for Matrix 403 permanent error handling

Testing

All 12 delivery-queue recovery tests pass (11 existing + 1 new).

Fixes the crash-loop symptom of #57321 Supersedes #57337

Changed files

CHANGELOG.md (modified, +1/-0)
src/infra/outbound/delivery-queue-recovery.ts (modified, +1/-0)
src/infra/outbound/delivery-queue.policy.test.ts (modified, +1/-0)
src/infra/outbound/delivery-queue.recovery.test.ts (modified, +22/-0)

PR #64391: fix: preserve canonical restart sentinel routes

Repository: openclaw/openclaw
Author: gumadeiras
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/64391

Description (problem / solution / changelog)

Summary

Problem: restart-sentinel notices could fall back to reconstructing outbound targets from session keys when canonical delivery context was missing after restart.
Why it matters: case-sensitive channels like Matrix can carry lossy/lowercased session keys, so that fallback can misroute the post-restart notice even though the wake event is still valid.
What changed: extractDeliveryInfo() now synthesizes canonical stored delivery routes via deliveryContextFromSession(...), and restart-sentinel notice delivery no longer falls back to session-key target reconstruction.
What did NOT change (scope boundary): this PR does not change normal outbound delivery-queue recovery behavior; it only hardens the restart-sentinel notice path.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #57321
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: restart-sentinel notice delivery merged sentinel context with stored session context, then fell back to resolveAnnounceTargetFromKey(...) when canonical route data was missing. That fallback reconstructs a destination from session-key identity, which is unsafe for case-sensitive channels.
Missing detection / guardrail: no regression covered the "stored route missing, session key lossy" restart path.
Contributing context (if known): normal Matrix queue recovery already prefers canonical stored route metadata; the restart-sentinel path had older fallback behavior.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-restart-sentinel.test.ts, src/config/sessions/delivery-info.test.ts
Scenario the test should lock in: restart-sentinel skips outbound notice when no canonical route survives restart, and stored session metadata still yields the canonical Matrix route when deliveryContext is absent.
Why this is the smallest reliable guardrail: the bug lives at the boundary between session metadata recovery and restart-sentinel delivery routing.
Existing test that already covers this (if any): src/infra/outbound/delivery-queue.recovery.test.ts already covers the permanent-failure recovery side, but not restart-sentinel fallback.
If no new test is added, why not:

User-visible / Behavior Changes

Restart notices no longer guess a target from a lossy session key when canonical route metadata is unavailable after restart.
When canonical route metadata still exists, restart notices continue routing normally.

Diagram (if applicable)

Before:
[restart sentinel] -> [missing canonical route] -> [session-key fallback target] -> [possible misdelivery]

After:
[restart sentinel] -> [missing canonical route] -> [skip outbound notice] -> [wake/system event still delivered]

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS
Runtime/container: local Node 22 + pnpm
Model/provider: N/A
Integration/channel (if any): Matrix restart notice routing
Relevant config (redacted): session store entries with canonical Matrix lastTo / origin.to

Steps

Persist a Matrix session whose session key is lossy/lowercased.
Remove canonical restart delivery context from the sentinel payload.
Trigger restart-sentinel wake handling.

Expected

No outbound restart notice is sent from a reconstructed session-key target.
Canonical stored route metadata is used when present.

Actual

Previously, restart-sentinel could fall back to session-key-derived routing.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: targeted tests for delivery-info, restart-sentinel, Matrix session-route, and delivery-queue recovery/policy.
Edge cases checked: Matrix mixed-case canonical route stored without deliveryContext; restart-sentinel with no surviving canonical route; existing permanent Matrix 403 recovery coverage still green.
What you did not verify: full end-to-end gateway restart against a live Matrix server.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps:

Risks and Mitigations

Risk: skipping outbound restart notice when canonical route metadata is missing can suppress a notice in some stale-session cases.
- Mitigation: the wake/system event still fires, and we avoid misdelivery to the wrong conversation.

Changed files

CHANGELOG.md (modified, +1/-0)
src/config/sessions/delivery-info.test.ts (modified, +54/-0)
src/config/sessions/delivery-info.ts (modified, +23/-8)
src/gateway/server-restart-sentinel.test.ts (modified, +89/-8)
src/gateway/server-restart-sentinel.ts (modified, +13/-8)

Code Example

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

RAW_BUFFERClick to expand / collapse

Summary

Related: #19278 (same root cause — room ID case normalization)

Root cause

Steps to reproduce

Configure Matrix channel with a room that has mixed-case characters in its room ID
Trigger a gateway crash while a message send is in-flight (e.g. via Anthropic API overload → websocket reconnect failure)
A delivery entry is persisted with the lowercased room ID
Restart the gateway
delivery-recovery retries the send with the lowercased room ID
Synapse returns 403 M_FORBIDDEN: User @bot:server not in room !lowercase...

Observed behavior

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

After this, the gateway is running but Matrix sync is permanently dead. The delivery entry persists in delivery-queue/, so every subsequent restart triggers the same crash.

Expected behavior

Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry)
A failed delivery should not crash the Matrix sync loop — it should be moved to failed/ and sync should continue
Delivery entries should expire after N retries instead of retrying indefinitely across restarts

Workaround

Manually move the poisoned delivery JSON from ~/.openclaw/delivery-queue/ to delivery-queue/failed/ and restart the gateway.

Environment

OpenClaw: 2026.3.24
Homeserver: Synapse (self-hosted)
OS: Manjaro Linux (x64)
Install method: npm

extent analysis

Fix Plan

To address the issue, we need to modify the delivery recovery mechanism to handle room IDs with original case and prevent the Matrix sync loop from crashing due to a failed delivery. Here are the steps:

Modify the delivery-recovery mechanism to store room IDs in their original case.
Implement a retry limit for delivery attempts to prevent indefinite retries.
Handle failed deliveries by moving them to a failed queue instead of crashing the sync loop.

Code Changes

# Modify the delivery recovery mechanism to store room IDs in their original case
def store_delivery_entry(room_id, message):
    # Store the room ID in its original case
    delivery_entry = {
        'room_id': room_id,  # Do not lowercase the room ID
        'message': message
    }
    # Store the delivery entry in the queue
    queue.put(delivery_entry)

# Implement a retry limit for delivery attempts
def retry_delivery(delivery_entry, max_retries=5):
    retries = 0
    while retries < max_retries:
        try:
            # Attempt to deliver the message
            deliver_message(delivery_entry['room_id'], delivery_entry['message'])
            break
        except MatrixError as e:
            # Handle the error and retry
            retries += 1
            if retries >= max_retries:
                # Move the delivery entry to the failed queue
                move_to_failed_queue(delivery_entry)
                break

# Handle failed deliveries by moving them to a failed queue
def move_to_failed_queue(delivery_entry):
    # Move the delivery entry to the failed queue
    failed_queue.put(delivery_entry)

Verification

To verify that the fix worked, you can test the delivery recovery mechanism with a room ID that has mixed-case characters. The gateway should no longer crash due to a failed delivery, and the delivery entry should be moved to the failed queue after the retry limit is reached.

Extra Tips

Make sure to handle errors properly in the delivery recovery mechanism to prevent crashes.
Implement a mechanism to clean up the failed queue periodically to prevent it from growing indefinitely.
Consider adding logging to track failed deliveries and retries to help with debugging and monitoring.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry)
A failed delivery should not crash the Matrix sync loop — it should be moved to failed/ and sync should continue
Delivery entries should expire after N retries instead of retrying indefinitely across restarts

#api #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Matrix] Lowercased room ID in delivery-recovery causes sync loop crash and permanent message loss [3 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #57337: fix(matrix): preserve case in group/channel peer IDs for session keys

Description (problem / solution / changelog)

Summary

Changes

Testing

Breaking change note

Changed files

PR #57426: fix(delivery): treat Matrix "User not in room" as permanent delivery error

Description (problem / solution / changelog)

Summary

Fix

What this does NOT fix

Changes

Testing

Changed files

PR #64391: fix: preserve canonical restart sentinel routes

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Summary

Root cause

Steps to reproduce

Observed behavior

Expected behavior

Workaround

Environment

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING