openclaw - ✅(Solved) Fix BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) [1 pull requests, 1 participants]

omarshahine · 2026-04-14T23:33:40Z

[openclaw] PR 66857: feat bluebubbles : replay missed webhook messages after gateway restart 66721 - Repository: openclaw/openclaw - Author: omarshahine - Stat… # PR #66857: feat(bluebubbles): replay missed webhook messages after gateway restart (#66721) - Repository: openclaw/openclaw - Author: omarshahine - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/66857 ## Description (problem / solution / changelog) ## Summary Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing `processMessage` pipeline. **The hole this closes:** BB Server's `WebhookService` is fire-and-forget on POST failure (no retries) and BB's `MessagePoller` only re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. Messages delivered while the gateway was down, restarting, or wedged were permanently lost — verified with a controlled experiment on macOS. This PR supersedes #66853 (which was the stacked follow-up to #66760 / dedupe PR #66816). Same diff, collapsed to a single commit for cleaner review. History of review feedback is preserved in the superseded PR trail; all P1 and P2 findings from Greptile / Codex / Aisle were addressed in-branch before this squash. ## Design - **New `extensions/bluebubbles/src/catchup.ts`:** - `fetchBlueBubblesMessagesSince(sinceMs, limit, opts)` calls `/api/v1/message/query` with `{after, sort:"ASC", with:["chat","chat.participants","attachment"]}` so replays carry the same shape `normalizeWebhookMessage` already handles on live dispatch. - `loadBlueBubblesCatchupCursor` / `saveBlueBubblesCatchupCursor` persist a single `{lastSeenMs, updatedAt}` per account under ` /bluebubbles/catchup/ __ .json`, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from #66816, and the resolver is the canonical `openclaw/plugin-sdk/state-paths.resolveStateDir` (same helper dedupe uses) so the two stores share a single root. - `runBlueBubblesCatchup(target)` orchestrates: clamp config, fetch, filter `isFromMe` and pre-cursor records, dispatch to `processMessage`, advance cursor. - **Modified `monitor.ts`:** fire catchup as a background task after webhook target registers; errors are logged but never block the channel-ready signal. - **Modified `config-schema.ts`:** new optional `catchup` block (`enabled`, `maxAgeMinutes`, `perRunLimit`, `firstRunLookbackMinutes`); defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback. - **Modified `accounts.ts`:** adds `catchup` to the account-merge `nestedObjectKeys` list so per-account overrides deep-merge on top of channel-level defaults, mirroring the existing `network` precedent. ## Why this approach The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls `processMessage` directly (no re-POST hop), and benefits from #66816's persistent dedupe automatically. ## Safety - Goes through the same `processMessage` path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged. - Dedupes against #66816's persistent inbound GUID cache: a webhook delivery that already succeeded cannot be reprocessed by catchup. - Never dispatches `isFromMe` records (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path. - Catchup runs once per gateway startup and does NOT skip on rapid restarts — skipping would permanently lose any messages that arrived during the brief downtime between the two startups. - Cursor only advances to `nowMs` on fully-successful runs. On `processMessage` failure, the cursor is held just before the earliest failure timestamp so the next run retries from there. On truncation (`fetchedCount === perRunLimit`), the cursor advances only to the last-fetched timestamp so the next gateway startup picks up the unfetched tail. - A future-dated cursor (NTP rollback, manual clock adjust) is treated as unusable and falls through to the firstRunLookback path; the cursor is repaired at the end of the run. - First-run lookback clamped to the maxAge ceiling so `maxAgeMinutes: 5, firstRunLookbackMinutes: 30` cannot exceed the operator's stated cap. - Hard ceilings: 12h max lookback, 500 messages per run. - Loud WARNING emitted when `fetchedCount` hits `perRunLimit` so operators know a single startup didn't drain the full backlog. ## Validation ### Automated - New scoped tests in `extensions/bluebubbles/src/catchup.test.ts` (**21 cases**): cursor round-trip, per-account scoping, filesystem-unsafe account IDs, firstRunLookback default and maxAge clamp, `enabled: false`, rapid-restart-still-runs, `isFromMe` filter (pre- and

openclaw2026-04-14 23:33:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#66870•Fetched 2026-04-16 06:37:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

omarshahine

Participants

omarshahine

Assignees

omarshahine

Timeline (top)

referenced ×3cross-referenced ×2assigned ×1closed ×1

Error Message

After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).

Fix Action

Fixed

Fixed by PR: feat(bluebubbles): replay missed webhook messages after gateway restart (#66721) (https://github.com/openclaw/openclaw/pull/66857)

PR fix notes

PR #66857: feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)

Repository: openclaw/openclaw
Author: omarshahine
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/66857

Description (problem / solution / changelog)

Summary

Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing processMessage pipeline.

The hole this closes: BB Server's WebhookService is fire-and-forget on POST failure (no retries) and BB's MessagePoller only re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. Messages delivered while the gateway was down, restarting, or wedged were permanently lost — verified with a controlled experiment on macOS.

This PR supersedes #66853 (which was the stacked follow-up to #66760 / dedupe PR #66816). Same diff, collapsed to a single commit for cleaner review. History of review feedback is preserved in the superseded PR trail; all P1 and P2 findings from Greptile / Codex / Aisle were addressed in-branch before this squash.

Design

New extensions/bluebubbles/src/catchup.ts:
- fetchBlueBubblesMessagesSince(sinceMs, limit, opts) calls /api/v1/message/query with {after, sort:"ASC", with:["chat","chat.participants","attachment"]} so replays carry the same shape normalizeWebhookMessage already handles on live dispatch.
- loadBlueBubblesCatchupCursor / saveBlueBubblesCatchupCursor persist a single {lastSeenMs, updatedAt} per account under <stateDir>/bluebubbles/catchup/<accountId>__<hash>.json, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from #66816, and the resolver is the canonical openclaw/plugin-sdk/state-paths.resolveStateDir (same helper dedupe uses) so the two stores share a single root.
- runBlueBubblesCatchup(target) orchestrates: clamp config, fetch, filter isFromMe and pre-cursor records, dispatch to processMessage, advance cursor.
Modified monitor.ts: fire catchup as a background task after webhook target registers; errors are logged but never block the channel-ready signal.
Modified config-schema.ts: new optional catchup block (enabled, maxAgeMinutes, perRunLimit, firstRunLookbackMinutes); defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback.
Modified accounts.ts: adds catchup to the account-merge nestedObjectKeys list so per-account overrides deep-merge on top of channel-level defaults, mirroring the existing network precedent.

Why this approach

The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls processMessage directly (no re-POST hop), and benefits from #66816's persistent dedupe automatically.

Safety

Goes through the same processMessage path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged.
Dedupes against #66816's persistent inbound GUID cache: a webhook delivery that already succeeded cannot be reprocessed by catchup.
Never dispatches isFromMe records (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path.
Catchup runs once per gateway startup and does NOT skip on rapid restarts — skipping would permanently lose any messages that arrived during the brief downtime between the two startups.
Cursor only advances to nowMs on fully-successful runs. On processMessage failure, the cursor is held just before the earliest failure timestamp so the next run retries from there. On truncation (fetchedCount === perRunLimit), the cursor advances only to the last-fetched timestamp so the next gateway startup picks up the unfetched tail.
A future-dated cursor (NTP rollback, manual clock adjust) is treated as unusable and falls through to the firstRunLookback path; the cursor is repaired at the end of the run.
First-run lookback clamped to the maxAge ceiling so maxAgeMinutes: 5, firstRunLookbackMinutes: 30 cannot exceed the operator's stated cap.
Hard ceilings: 12h max lookback, 500 messages per run.
Loud WARNING emitted when fetchedCount hits perRunLimit so operators know a single startup didn't drain the full backlog.

Validation

Automated

New scoped tests in extensions/bluebubbles/src/catchup.test.ts (21 cases): cursor round-trip, per-account scoping, filesystem-unsafe account IDs, firstRunLookback default and maxAge clamp, enabled: false, rapid-restart-still-runs, isFromMe filter (pre- and post-normalization), query-failure-preserves-cursor, per-message failure isolation, held-cursor-on-retryable-failure, clamp-to-prior-cursor, future-cursor recovery, pre-cursor defense-in-depth, perRunLimit warn / no-warn, and truncation-cursor advances only to page boundary.
Full BlueBubbles suite passes: 410/410.
pnpm check green (madge, tsgo, oxlint, webhook-auth-body-order, no-pairing-store-group, pairing-account-scope).

Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)

Repeating the original repro from #66721's issue body with the new in-process catchup:

Stopped gateway cleanly. Verified port refused, no process.
Sent 3 distinct iMessages from a second device. BB-server log shows all 3 dispatches failed with connect ECONNREFUSED 127.0.0.1:18789 and never retried.
Started gateway. Webhook target registered; catchup fired in the background.

Gateway log:

[bluebubbles] [default] BlueBubbles catchup:
  replayed=3 skipped_fromMe=0 skipped_preCursor=0 failed=0 fetched=3 window_ms=517184

All 3 messages produced agent replies delivered back via BB outbound. Persistent cursor file appeared at ~/.openclaw/bluebubbles/catchup/<accountId>__<hash>.json. Subsequent gateway restart with no new inbound activity logged replayed=0 fetched=0 (no-op).

Test plan

pnpm test extensions/bluebubbles/src/catchup.test.ts — 21/21
pnpm test extensions/bluebubbles/ — 410/410
pnpm check — green
Live macOS end-to-end repro
Maintainer review

History (for reviewer context)

This PR carries ~11 hours of iterative bot review that happened on the prior PRs (#66760 → #66853). Squashing here for clean review; the findings addressed were:

Greptile P2 — align state-dir with canonical SDK resolver; warn on perRunLimit truncation
Codex P1 — hold cursor on retryable processMessage failures
Codex P1 — always run catchup on startup (no min-interval skip)
Codex P1 — keep cursor behind unfetched pages when perRunLimit is hit
Codex P2 — clamp first-run window to maxAge
Codex P2 — deep-merge catchup overrides at account level
Codex P2 — treat future-dated cursor as unusable
Codex P2 — clock-skew gate precondition (later obviated by removing the gate)
Aisle — 2 of 5 findings apply (password-in-URL and OPENCLAW_STATE_DIR symlink); both are cross-cutting BB-plugin patterns best addressed in separate PRs against the SDK/plugin conventions. Other 3 Aisle findings were in files this PR doesn't touch (stale-SHA scan from pre-rebase).

Changed files

CHANGELOG.md (modified, +1/-0)
extensions/bluebubbles/src/accounts.ts (modified, +1/-1)
extensions/bluebubbles/src/catchup.test.ts (added, +621/-0)
extensions/bluebubbles/src/catchup.ts (added, +430/-0)
extensions/bluebubbles/src/config-schema.ts (modified, +15/-0)
extensions/bluebubbles/src/monitor.ts (modified, +15/-2)

RAW_BUFFERClick to expand / collapse

Problem

runBlueBubblesCatchup in extensions/bluebubbles/src/catchup.ts holds the cursor just before the earliest failed message's timestamp so retries pick up where they stopped. This is correct for transient failures (e.g., disk full, network blip, downstream plugin hiccup). But for a persistently-failing message — one whose processMessage call throws every time due to a malformed payload or a schema mismatch — the cursor stays wedged forever at that message's timestamp minus 1ms. Every subsequent gateway startup re-queries the same window, hits the same failure, and advances no further.

This was flagged as a Greptile P2 on PR #66857 ("design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path").

Why the current design is intentional

Without holding the cursor, a transient failure permanently drops the failed message. That's the more severe failure mode (silent message loss vs. loud replay loop). The current tradeoff favors visibility:

Every run logs processMessage failed: for the wedged message
The catchup log line shows non-zero failed count every restart
Operators see the pattern and can intervene

But "operators see the pattern" is a human-in-the-loop assumption. For unattended installs, the wedge can sit for a long time.

Proposed fix (Option C from #66721's implementation plan)

Add a per-message retry counter. The cursor state evolves from { lastSeenMs, updatedAt } to { lastSeenMs, updatedAt, failureRetries: { [messageGuid]: count } }. On each run:

Count failed processMessage attempts per message GUID.
After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).
On successful processing of a message, clear its retry counter.

This preserves the "retry transient failures" behavior while putting a ceiling on "keep retrying forever" behavior.

Alternative (simpler)

Add a maxTotalFailuresPerRun ceiling: if any single sweep produces more than N failures, force-advance to nowMs (treating the run as "too broken to hold"). Less granular but easier to reason about.

Out of scope here

The current behavior is a documented tradeoff and not a correctness bug. This issue is a hardening follow-up. It should not block PR #66857 from merging.

Greptile P2 on PR #66857 — deferred here
extensions/bluebubbles/src/catchup.ts:runBlueBubblesCatchupInner — earliestProcessFailureTs tracking
Original design discussion in #66721's issue body (Option C)

extent analysis

TL;DR

Implement a per-message retry counter to force-advance the cursor past persistently-failing messages after a configurable number of retries.

Guidance

Introduce a failureRetries object in the cursor state to track the number of failed processMessage attempts per message GUID.
Set a threshold (e.g., 10) for consecutive failures on the same GUID, after which the cursor is force-advanced past that message.
Log a warning when giving up on a message after the retry threshold is reached.
Consider implementing a maxTotalFailuresPerRun ceiling as a simpler alternative, which force-advances to the current time if a single sweep produces more than a certain number of failures.

Example

// Example of updated cursor state with failureRetries object
const cursorState = {
  lastSeenMs: 1643723400,
  updatedAt: 1643723400,
  failureRetries: {
    'message-guid-1': 5,
    'message-guid-2': 3,
  },
};

Notes

The proposed fix aims to balance the tradeoff between retrying transient failures and avoiding permanent wedging due to persistently-failing messages. The choice between the per-message retry counter and the maxTotalFailuresPerRun ceiling depends on the specific requirements and constraints of the system.

Recommendation

Apply the per-message retry counter workaround, as it provides a more granular and flexible solution for handling persistently-failing messages. This approach allows for a configurable retry threshold and provides more informative logging when giving up on a message.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #66857: feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)

Description (problem / solution / changelog)

Summary

Design

Why this approach

Safety

Validation

Automated

Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)

Test plan

History (for reviewer context)

Changed files

Problem

Why the current design is intentional

Proposed fix (Option C from #66721's implementation plan)

Alternative (simpler)

Out of scope here

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING