openclaw - ✅(Solved) Fix BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66870Fetched 2026-04-16 06:37:31
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Participants
Assignees
Timeline (top)
referenced ×3cross-referenced ×2assigned ×1closed ×1

Error Message

  1. After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).

Fix Action

Fixed

PR fix notes

PR #66857: feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)

Description (problem / solution / changelog)

Summary

Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing processMessage pipeline.

The hole this closes: BB Server's WebhookService is fire-and-forget on POST failure (no retries) and BB's MessagePoller only re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. Messages delivered while the gateway was down, restarting, or wedged were permanently lost — verified with a controlled experiment on macOS.

This PR supersedes #66853 (which was the stacked follow-up to #66760 / dedupe PR #66816). Same diff, collapsed to a single commit for cleaner review. History of review feedback is preserved in the superseded PR trail; all P1 and P2 findings from Greptile / Codex / Aisle were addressed in-branch before this squash.

Design

  • New extensions/bluebubbles/src/catchup.ts:
    • fetchBlueBubblesMessagesSince(sinceMs, limit, opts) calls /api/v1/message/query with {after, sort:"ASC", with:["chat","chat.participants","attachment"]} so replays carry the same shape normalizeWebhookMessage already handles on live dispatch.
    • loadBlueBubblesCatchupCursor / saveBlueBubblesCatchupCursor persist a single {lastSeenMs, updatedAt} per account under <stateDir>/bluebubbles/catchup/<accountId>__<hash>.json, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from #66816, and the resolver is the canonical openclaw/plugin-sdk/state-paths.resolveStateDir (same helper dedupe uses) so the two stores share a single root.
    • runBlueBubblesCatchup(target) orchestrates: clamp config, fetch, filter isFromMe and pre-cursor records, dispatch to processMessage, advance cursor.
  • Modified monitor.ts: fire catchup as a background task after webhook target registers; errors are logged but never block the channel-ready signal.
  • Modified config-schema.ts: new optional catchup block (enabled, maxAgeMinutes, perRunLimit, firstRunLookbackMinutes); defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback.
  • Modified accounts.ts: adds catchup to the account-merge nestedObjectKeys list so per-account overrides deep-merge on top of channel-level defaults, mirroring the existing network precedent.

Why this approach

The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls processMessage directly (no re-POST hop), and benefits from #66816's persistent dedupe automatically.

Safety

  • Goes through the same processMessage path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged.
  • Dedupes against #66816's persistent inbound GUID cache: a webhook delivery that already succeeded cannot be reprocessed by catchup.
  • Never dispatches isFromMe records (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path.
  • Catchup runs once per gateway startup and does NOT skip on rapid restarts — skipping would permanently lose any messages that arrived during the brief downtime between the two startups.
  • Cursor only advances to nowMs on fully-successful runs. On processMessage failure, the cursor is held just before the earliest failure timestamp so the next run retries from there. On truncation (fetchedCount === perRunLimit), the cursor advances only to the last-fetched timestamp so the next gateway startup picks up the unfetched tail.
  • A future-dated cursor (NTP rollback, manual clock adjust) is treated as unusable and falls through to the firstRunLookback path; the cursor is repaired at the end of the run.
  • First-run lookback clamped to the maxAge ceiling so maxAgeMinutes: 5, firstRunLookbackMinutes: 30 cannot exceed the operator's stated cap.
  • Hard ceilings: 12h max lookback, 500 messages per run.
  • Loud WARNING emitted when fetchedCount hits perRunLimit so operators know a single startup didn't drain the full backlog.

Validation

Automated

  • New scoped tests in extensions/bluebubbles/src/catchup.test.ts (21 cases): cursor round-trip, per-account scoping, filesystem-unsafe account IDs, firstRunLookback default and maxAge clamp, enabled: false, rapid-restart-still-runs, isFromMe filter (pre- and post-normalization), query-failure-preserves-cursor, per-message failure isolation, held-cursor-on-retryable-failure, clamp-to-prior-cursor, future-cursor recovery, pre-cursor defense-in-depth, perRunLimit warn / no-warn, and truncation-cursor advances only to page boundary.
  • Full BlueBubbles suite passes: 410/410.
  • pnpm check green (madge, tsgo, oxlint, webhook-auth-body-order, no-pairing-store-group, pairing-account-scope).

Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)

Repeating the original repro from #66721's issue body with the new in-process catchup:

  1. Stopped gateway cleanly. Verified port refused, no process.
  2. Sent 3 distinct iMessages from a second device. BB-server log shows all 3 dispatches failed with connect ECONNREFUSED 127.0.0.1:18789 and never retried.
  3. Started gateway. Webhook target registered; catchup fired in the background.
  4. Gateway log:
    [bluebubbles] [default] BlueBubbles catchup:
      replayed=3 skipped_fromMe=0 skipped_preCursor=0 failed=0 fetched=3 window_ms=517184
  5. All 3 messages produced agent replies delivered back via BB outbound. Persistent cursor file appeared at ~/.openclaw/bluebubbles/catchup/<accountId>__<hash>.json. Subsequent gateway restart with no new inbound activity logged replayed=0 fetched=0 (no-op).

Test plan

  • pnpm test extensions/bluebubbles/src/catchup.test.ts — 21/21
  • pnpm test extensions/bluebubbles/ — 410/410
  • pnpm check — green
  • Live macOS end-to-end repro
  • Maintainer review

History (for reviewer context)

This PR carries ~11 hours of iterative bot review that happened on the prior PRs (#66760 → #66853). Squashing here for clean review; the findings addressed were:

  • Greptile P2 — align state-dir with canonical SDK resolver; warn on perRunLimit truncation
  • Codex P1 — hold cursor on retryable processMessage failures
  • Codex P1 — always run catchup on startup (no min-interval skip)
  • Codex P1 — keep cursor behind unfetched pages when perRunLimit is hit
  • Codex P2 — clamp first-run window to maxAge
  • Codex P2 — deep-merge catchup overrides at account level
  • Codex P2 — treat future-dated cursor as unusable
  • Codex P2 — clock-skew gate precondition (later obviated by removing the gate)
  • Aisle — 2 of 5 findings apply (password-in-URL and OPENCLAW_STATE_DIR symlink); both are cross-cutting BB-plugin patterns best addressed in separate PRs against the SDK/plugin conventions. Other 3 Aisle findings were in files this PR doesn't touch (stale-SHA scan from pre-rebase).

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/bluebubbles/src/accounts.ts (modified, +1/-1)
  • extensions/bluebubbles/src/catchup.test.ts (added, +621/-0)
  • extensions/bluebubbles/src/catchup.ts (added, +430/-0)
  • extensions/bluebubbles/src/config-schema.ts (modified, +15/-0)
  • extensions/bluebubbles/src/monitor.ts (modified, +15/-2)
RAW_BUFFERClick to expand / collapse

Problem

runBlueBubblesCatchup in extensions/bluebubbles/src/catchup.ts holds the cursor just before the earliest failed message's timestamp so retries pick up where they stopped. This is correct for transient failures (e.g., disk full, network blip, downstream plugin hiccup). But for a persistently-failing message — one whose processMessage call throws every time due to a malformed payload or a schema mismatch — the cursor stays wedged forever at that message's timestamp minus 1ms. Every subsequent gateway startup re-queries the same window, hits the same failure, and advances no further.

This was flagged as a Greptile P2 on PR #66857 ("design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path").

Why the current design is intentional

Without holding the cursor, a transient failure permanently drops the failed message. That's the more severe failure mode (silent message loss vs. loud replay loop). The current tradeoff favors visibility:

  • Every run logs processMessage failed: for the wedged message
  • The catchup log line shows non-zero failed count every restart
  • Operators see the pattern and can intervene

But "operators see the pattern" is a human-in-the-loop assumption. For unattended installs, the wedge can sit for a long time.

Proposed fix (Option C from #66721's implementation plan)

Add a per-message retry counter. The cursor state evolves from { lastSeenMs, updatedAt } to { lastSeenMs, updatedAt, failureRetries: { [messageGuid]: count } }. On each run:

  1. Count failed processMessage attempts per message GUID.
  2. After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).
  3. On successful processing of a message, clear its retry counter.

This preserves the "retry transient failures" behavior while putting a ceiling on "keep retrying forever" behavior.

Alternative (simpler)

Add a maxTotalFailuresPerRun ceiling: if any single sweep produces more than N failures, force-advance to nowMs (treating the run as "too broken to hold"). Less granular but easier to reason about.

Out of scope here

The current behavior is a documented tradeoff and not a correctness bug. This issue is a hardening follow-up. It should not block PR #66857 from merging.

Related

  • Greptile P2 on PR #66857 — deferred here
  • extensions/bluebubbles/src/catchup.ts:runBlueBubblesCatchupInnerearliestProcessFailureTs tracking
  • Original design discussion in #66721's issue body (Option C)

extent analysis

TL;DR

Implement a per-message retry counter to force-advance the cursor past persistently-failing messages after a configurable number of retries.

Guidance

  • Introduce a failureRetries object in the cursor state to track the number of failed processMessage attempts per message GUID.
  • Set a threshold (e.g., 10) for consecutive failures on the same GUID, after which the cursor is force-advanced past that message.
  • Log a warning when giving up on a message after the retry threshold is reached.
  • Consider implementing a maxTotalFailuresPerRun ceiling as a simpler alternative, which force-advances to the current time if a single sweep produces more than a certain number of failures.

Example

// Example of updated cursor state with failureRetries object
const cursorState = {
  lastSeenMs: 1643723400,
  updatedAt: 1643723400,
  failureRetries: {
    'message-guid-1': 5,
    'message-guid-2': 3,
  },
};

Notes

The proposed fix aims to balance the tradeoff between retrying transient failures and avoiding permanent wedging due to persistently-failing messages. The choice between the per-message retry counter and the maxTotalFailuresPerRun ceiling depends on the specific requirements and constraints of the system.

Recommendation

Apply the per-message retry counter workaround, as it provides a more granular and flexible solution for handling persistently-failing messages. This approach allows for a configurable retry threshold and provides more informative logging when giving up on a message.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) [1 pull requests, 1 participants]