openclaw - ✅(Solved) Fix WhatsApp: reconnect drain replays all pending deliveries, causing 7-12x message duplication on cron sends [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70386Fetched 2026-04-23 07:25:31
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
referenced ×2closed ×1cross-referenced ×1

Every WhatsApp socket reconnect now re-delivers all pending delivery-queue entries in full, including entries that are currently being delivered for the first time. On installations where the 30-minute inbound-silence watchdog fires regularly, this causes every outbound cron message to be sent 7–12 times.

Root Cause

In login-BmQuD5Uw.js (v2026.4.14), drainReconnectQueue was replaced by drainPendingDeliveries with this selectEntry:

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

The old drainReconnectQueue (now a deprecated shim) required entry.lastError to match "No active WhatsApp Web listener" before replaying an entry. A fresh entry with no lastError would not match and was left alone.

The new code matches any pending WhatsApp entry. Combined with isEntryEligibleForRecoveryRetry returning { eligible: true } for fresh entries (retryCount === 0 && lastAttemptAt === undefined), a delivery that was just enqueued milliseconds ago is immediately picked up and re-delivered by the drain — while the original deliverOutboundPayloadsCore call is still running. There is no in-memory claim held by the original delivery path, so claimRecoveryEntry does not protect against this.

Fix Action

Workaround

Pin to v2026.3.13-1.

PR fix notes

PR #70428: fix(outbound): hold active-delivery claim so reconnect drain skips live sends

Description (problem / solution / changelog)

Summary

Fixes #70386. Reconnect drain was re-driving the same queue entry while the original deliverOutboundPayloads was still mid-send, producing 7-12x outbound duplication on WhatsApp cron sends whenever the 30-minute inbound-silence watchdog fired during a delivery window.

Root cause

drainPendingDeliveries is intentionally allowed to replay fresh entries (retryCount === 0 && lastAttemptAt === undefined) to preserve crash-replay. That contract was safe while the reconnect drain previously required a specific "No active ... listener" lastError on the entry; once the drain selector was widened to match any pending entry for the reconnecting account, nothing prevented it from firing against an entry whose original deliverOutboundPayloads caller was still executing. The live delivery path persisted via enqueueDelivery but never held an in-memory claim on queueId during the send, so a concurrent reconnect passed claimRecoveryEntry, passed isEntryEligibleForRecoveryRetry (fresh entries are drain-eligible by design), and invoked deliver(...) a second (or Nth) time — once per reconnect that occurred inside the live send window.

Cron run records showed only one delivered: true per scheduled run because the drain's deliver() enqueues and acks a new entry path, leaving no visible audit in the scheduler.

Fix

Claim queueId against the existing entriesInProgress set immediately after enqueueDelivery, and release in the finally block that wraps ack/fail in deliverOutboundPayloads. Drain already consults the same set via claimRecoveryEntry and skips claimed ids with an "already being recovered" log, so no drain-side logic change is needed.

Two new thin exports on delivery-queue.ts:

  • tryClaimActiveDelivery(entryId) — returns false if already claimed.
  • releaseActiveDelivery(entryId) — pair in finally.

Why the fix is safe

  • The claim is a process-local Set. A crashed owner leaves no claim behind, so recoverPendingDeliveries on the next startup reclaims any orphaned entries exactly as before — crash replay for fresh entries is preserved intentionally.
  • Drain already had the skip path ("already being recovered") for in-progress recovery callers; this reuses it and adds a second legitimate owner: the live delivery path.
  • No queue file format change, no migration, no persistence change.
  • The shared concurrency guard claimRecoveryEntry / releaseRecoveryEntry is unchanged; the new helpers are a stable public wrapper over the same set so other callers (live sends) can participate without importing a private symbol.
  • Regression test covers exactly the race: claim the entry (simulating a live send in flight), run the drain, assert deliver is not called and the "already being recovered" skip log is emitted; after release, a follow-up drain delivers exactly once.

Security / runtime controls unchanged

  • No prompt-derived policy surface is involved. Concurrency is enforced structurally via the in-memory claim set, not by message text, metadata, or any LLM-facing behavior.
  • No config defaults, no help metadata, no generated artifacts change.
  • Plugin SDK public surface (src/plugin-sdk/infra-runtime.ts) selectively re-exports only drainPendingDeliveries; the two new helpers are deliberately not re-exposed there, so third-party plugin contracts and docs/.generated/plugin-sdk-api-baseline.* are intentionally untouched.
  • Outbound delivery adapters, channel plugins, session/policy keys, transcript mirroring, and best-effort / abort behavior are all unmodified — the claim wraps the existing ack/fail flow in a finally without changing any of its branches.

Tests

Added regression test: src/infra/outbound/delivery-queue.reconnect-drain.test.ts"skips entries that an in-flight live delivery has actively claimed".

Exact commands run locally:

  • pnpm test src/infra/outbound/delivery-queue.reconnect-drain.test.ts — 14 passed (includes new case).
  • pnpm test src/infra/outbound — 452 passed across 35 files.
  • pnpm test src/gateway/server-restart-sentinel.test.ts src/gateway/server-runtime-services.test.ts src/plugin-sdk/infra-runtime.test.ts — the other consumers of the ./delivery-queue.js mock, 27 passed.
  • pnpm test src/cron/isolated-agent — cron delivery-dispatch consumers of deliverOutboundPayloads, 222 passed across 20 files.
  • pnpm check:changed — scoped typecheck/lint/guards green.
  • pnpm tsgo — core prod typecheck green.

The pnpm tsgo:prod failures on unrelated qa-lab, qqbot, and tokenjuice extensions reproduce on origin/main at the same SHA and are not introduced by this change. Same for the pre-existing plugin-sdk-api-baseline hash drift — this PR does not add any public plugin-sdk surface.

Checklist

  • Noted as AI-assisted
  • Fully tested (new regression test plus full outbound/cron/infra lanes)
  • Understand what the code does
  • Will resolve/reply to bot review conversations after addressing them

Made with Cursor

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/infra/outbound/deliver.test.ts (modified, +37/-0)
  • src/infra/outbound/deliver.ts (modified, +25/-1)
  • src/infra/outbound/delivery-queue-recovery.ts (modified, +29/-11)
  • src/infra/outbound/delivery-queue.reconnect-drain.test.ts (modified, +67/-0)
  • src/infra/outbound/delivery-queue.ts (modified, +2/-0)

Code Example

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

---

13:00:20.860  Sending message → sha256:3e25bf2f15db   ← original delivery
13:00:21.267  Sending message → sha256:3e25bf2f15db   ← drain burst (4 sends in ~80ms)
13:00:21.289  Sending message → sha256:3e25bf2f15db
13:00:21.304  Sending message → sha256:3e25bf2f15db
13:00:21.320  Sending message → sha256:3e25bf2f15db
13:00:31.257  Sending message → sha256:3e25bf2f15db   ← second drain (reconnect #2)
13:00:31.272  Sending message → sha256:3e25bf2f15db
13:00:31.287  Sending message → sha256:3e25bf2f15db

---

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),  // ← don't drain in-flight entries
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})
RAW_BUFFERClick to expand / collapse

Version

v2026.4.14 (regression from v2026.3.13-1)

Summary

Every WhatsApp socket reconnect now re-delivers all pending delivery-queue entries in full, including entries that are currently being delivered for the first time. On installations where the 30-minute inbound-silence watchdog fires regularly, this causes every outbound cron message to be sent 7–12 times.

Steps to reproduce

  1. Configure a cron job with announce or isolated-agent delivery targeting WhatsApp.
  2. Ensure the WhatsApp watchdog fires at its 30-minute timeout during a cron delivery window (i.e., no inbound messages are received to reset the timer).
  3. Observe the cron fires at a scheduled time — the outbound message is sent N+1 times, where N is the number of reconnects that occur while the original delivery is in-flight.

Root cause

In login-BmQuD5Uw.js (v2026.4.14), drainReconnectQueue was replaced by drainPendingDeliveries with this selectEntry:

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

The old drainReconnectQueue (now a deprecated shim) required entry.lastError to match "No active WhatsApp Web listener" before replaying an entry. A fresh entry with no lastError would not match and was left alone.

The new code matches any pending WhatsApp entry. Combined with isEntryEligibleForRecoveryRetry returning { eligible: true } for fresh entries (retryCount === 0 && lastAttemptAt === undefined), a delivery that was just enqueued milliseconds ago is immediately picked up and re-delivered by the drain — while the original deliverOutboundPayloadsCore call is still running. There is no in-memory claim held by the original delivery path, so claimRecoveryEntry does not protect against this.

Evidence

Railway logs for Apr 19–21 show two distinct burst patterns per cron window for the same destination JID hash:

13:00:20.860  Sending message → sha256:3e25bf2f15db   ← original delivery
13:00:21.267  Sending message → sha256:3e25bf2f15db   ← drain burst (4 sends in ~80ms)
13:00:21.289  Sending message → sha256:3e25bf2f15db
13:00:21.304  Sending message → sha256:3e25bf2f15db
13:00:21.320  Sending message → sha256:3e25bf2f15db
13:00:31.257  Sending message → sha256:3e25bf2f15db   ← second drain (reconnect #2)
13:00:31.272  Sending message → sha256:3e25bf2f15db
13:00:31.287  Sending message → sha256:3e25bf2f15db

Each burst has its own correlationId, confirming they are separate sendMessageWhatsApp invocations, not retries within a single send.

The cron/runs/*.jsonl records show only one delivered: true entry per run — the duplicate sends are invisible to the cron scheduler because the drain's deliver() call enqueues and acks a new entry, leaving no trace in run records.

Proposed fix

Restore the guard that prevents fresh entries from being drain-eligible:

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),  // ← don't drain in-flight entries
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})

Or add a minimum-age guard in isEntryEligibleForRecoveryRetry so entries younger than a configurable threshold (e.g. 10s) are not yet drain-eligible.

Workaround

Pin to v2026.3.13-1.

extent analysis

TL;DR

  • The most likely fix is to restore the guard that prevents fresh entries from being drain-eligible by modifying the selectEntry function in drainPendingDeliveries.

Guidance

  • Identify the selectEntry function in login-BmQuD5Uw.js and modify it to include a check for entry.retryCount > 0 || entry.lastAttemptAt !== undefined to prevent fresh entries from being drain-eligible.
  • Alternatively, consider adding a minimum-age guard in isEntryEligibleForRecoveryRetry to prevent entries younger than a configurable threshold from being drain-eligible.
  • Verify the fix by checking the Railway logs for duplicate sends and ensuring that the cron/runs/*.jsonl records show only one delivered: true entry per run.
  • If modifying the code is not feasible, consider pinning to v2026.3.13-1 as a temporary workaround.

Example

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})

Notes

  • The proposed fix assumes that the issue is caused by the modification of the drainReconnectQueue function in v2026.4.14.
  • The minimum-age guard approach may require additional configuration and testing to determine the optimal threshold value.

Recommendation

  • Apply the proposed fix by modifying the selectEntry function to prevent fresh entries from being drain-eligible, as this approach directly addresses the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix WhatsApp: reconnect drain replays all pending deliveries, causing 7-12x message duplication on cron sends [1 pull requests, 1 participants]