openclaw - ✅(Solved) Fix WhatsApp: reconnect drain replays all pending deliveries, causing 7-12x message duplication on cron sends [1 pull requests, 1 participants]

natekot · 2026-04-22T23:21:47Z

[openclaw] Every WhatsApp socket reconnect now re-delivers all pending delivery-queue entries in full, including entries that are currently being delivered for… Every WhatsApp socket reconnect now re-delivers all pending delivery-queue entries in full, including entries that are currently being delivered for the first time. On installations where the 30-minute inbound-silence watchdog fires regularly, this causes every outbound cron message to be sent 7–12 times. # PR #70428: fix(outbound): hold active-delivery claim so reconnect drain skips live sends - Repository: openclaw/openclaw - Author: neeravmakwana - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/70428 ## Description (problem / solution / changelog) ## Summary Fixes #70386. Reconnect drain was re-driving the same queue entry while the original `deliverOutboundPayloads` was still mid-send, producing 7-12x outbound duplication on WhatsApp cron sends whenever the 30-minute inbound-silence watchdog fired during a delivery window. ## Root cause `drainPendingDeliveries` is intentionally allowed to replay fresh entries (`retryCount === 0 && lastAttemptAt === undefined`) to preserve crash-replay. That contract was safe while the reconnect drain previously required a specific `"No active ... listener"` `lastError` on the entry; once the drain selector was widened to match any pending entry for the reconnecting account, nothing prevented it from firing against an entry whose original `deliverOutboundPayloads` caller was still executing. The live delivery path persisted via `enqueueDelivery` but never held an in-memory claim on `queueId` during the send, so a concurrent reconnect passed `claimRecoveryEntry`, passed `isEntryEligibleForRecoveryRetry` (fresh entries are drain-eligible by design), and invoked `deliver(...)` a second (or Nth) time — once per reconnect that occurred inside the live send window. Cron run records showed only one `delivered: true` per scheduled run because the drain's `deliver()` enqueues and acks a new entry path, leaving no visible audit in the scheduler. ## Fix Claim `queueId` against the existing `entriesInProgress` set immediately after `enqueueDelivery`, and release in the `finally` block that wraps ack/fail in `deliverOutboundPayloads`. Drain already consults the same set via `claimRecoveryEntry` and skips claimed ids with an "already being recovered" log, so no drain-side logic change is needed. Two new thin exports on `delivery-queue.ts`: - `tryClaimActiveDelivery(entryId)` — returns `false` if already claimed. - `releaseActiveDelivery(entryId)` — pair in `finally`. ## Why the fix is safe - The claim is a process-local `Set`. A crashed owner leaves no claim behind, so `recoverPendingDeliveries` on the next startup reclaims any orphaned entries exactly as before — crash replay for fresh entries is preserved intentionally. - Drain already had the skip path (`"already being recovered"`) for in-progress recovery callers; this reuses it and adds a second legitimate owner: the live delivery path. - No queue file format change, no migration, no persistence change. - The shared concurrency guard `claimRecoveryEntry` / `releaseRecoveryEntry` is unchanged; the new helpers are a stable public wrapper over the same set so other callers (live sends) can participate without importing a private symbol. - Regression test covers exactly the race: claim the entry (simulating a live send in flight), run the drain, assert `deliver` is not called and the "already being recovered" skip log is emitted; after release, a follow-up drain delivers exactly once. ## Security / runtime controls unchanged - No prompt-derived policy surface is involved. Concurrency is enforced structurally via the in-memory claim set, not by message text, metadata, or any LLM-facing behavior. - No config defaults, no help metadata, no generated artifacts change. - Plugin SDK public surface (`src/plugin-sdk/infra-runtime.ts`) selectively re-exports only `drainPendingDeliveries`; the two new helpers are deliberately not re-exposed there, so third-party plugin contracts and `docs/.generated/plugin-sdk-api-baseline.*` are intentionally untouched. - Outbound delivery adapters, channel plugins, session/policy keys, transcript mirroring, and best-effort / abort behavior are all unmodified — the claim wraps the existing ack/fail flow in a `finally` without changing any of its branches. ## Tests Added regression test: `src/infra/outbound/delivery-queue.reconnect-drain.test.ts` → _"skips entries that an in-flight live delivery has actively claimed"_. Exact commands run locally: - `pnpm test src/infra/outbound/delivery-queue.reconnect-drain.test.ts` — 14 passed (includes new case). - `pnpm test src/infra/outbound` — 452 passed across 35 files. - `pnpm test src/gateway/server-restart-sentinel.test.ts src/gateway/server-runtime-services.test.ts src/plugin-sdk/infra-runtime.test.ts` — the other consumers of the `./delivery-queue.js` mock, 27 passed. - `pnp

openclaw2026-04-22 23:21:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70386•Fetched 2026-04-23 07:25:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

natekot

Participants

natekot

Timeline (top)

referenced ×2closed ×1cross-referenced ×1

Every WhatsApp socket reconnect now re-delivers all pending delivery-queue entries in full, including entries that are currently being delivered for the first time. On installations where the 30-minute inbound-silence watchdog fires regularly, this causes every outbound cron message to be sent 7–12 times.

Root Cause

In login-BmQuD5Uw.js (v2026.4.14), drainReconnectQueue was replaced by drainPendingDeliveries with this selectEntry:

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

The old drainReconnectQueue (now a deprecated shim) required entry.lastError to match "No active WhatsApp Web listener" before replaying an entry. A fresh entry with no lastError would not match and was left alone.

The new code matches any pending WhatsApp entry. Combined with isEntryEligibleForRecoveryRetry returning { eligible: true } for fresh entries (retryCount === 0 && lastAttemptAt === undefined), a delivery that was just enqueued milliseconds ago is immediately picked up and re-delivered by the drain — while the original deliverOutboundPayloadsCore call is still running. There is no in-memory claim held by the original delivery path, so claimRecoveryEntry does not protect against this.

Fix Action

Workaround

Pin to v2026.3.13-1.

PR fix notes

PR #70428: fix(outbound): hold active-delivery claim so reconnect drain skips live sends

Repository: openclaw/openclaw
Author: neeravmakwana
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/70428

Description (problem / solution / changelog)

Summary

Fixes #70386. Reconnect drain was re-driving the same queue entry while the original deliverOutboundPayloads was still mid-send, producing 7-12x outbound duplication on WhatsApp cron sends whenever the 30-minute inbound-silence watchdog fired during a delivery window.

Root cause

drainPendingDeliveries is intentionally allowed to replay fresh entries (retryCount === 0 && lastAttemptAt === undefined) to preserve crash-replay. That contract was safe while the reconnect drain previously required a specific "No active ... listener" lastError on the entry; once the drain selector was widened to match any pending entry for the reconnecting account, nothing prevented it from firing against an entry whose original deliverOutboundPayloads caller was still executing. The live delivery path persisted via enqueueDelivery but never held an in-memory claim on queueId during the send, so a concurrent reconnect passed claimRecoveryEntry, passed isEntryEligibleForRecoveryRetry (fresh entries are drain-eligible by design), and invoked deliver(...) a second (or Nth) time — once per reconnect that occurred inside the live send window.

Cron run records showed only one delivered: true per scheduled run because the drain's deliver() enqueues and acks a new entry path, leaving no visible audit in the scheduler.

Fix

Claim queueId against the existing entriesInProgress set immediately after enqueueDelivery, and release in the finally block that wraps ack/fail in deliverOutboundPayloads. Drain already consults the same set via claimRecoveryEntry and skips claimed ids with an "already being recovered" log, so no drain-side logic change is needed.

Two new thin exports on delivery-queue.ts:

tryClaimActiveDelivery(entryId) — returns false if already claimed.
releaseActiveDelivery(entryId) — pair in finally.

Why the fix is safe

The claim is a process-local Set. A crashed owner leaves no claim behind, so recoverPendingDeliveries on the next startup reclaims any orphaned entries exactly as before — crash replay for fresh entries is preserved intentionally.
Drain already had the skip path ("already being recovered") for in-progress recovery callers; this reuses it and adds a second legitimate owner: the live delivery path.
No queue file format change, no migration, no persistence change.
The shared concurrency guard claimRecoveryEntry / releaseRecoveryEntry is unchanged; the new helpers are a stable public wrapper over the same set so other callers (live sends) can participate without importing a private symbol.
Regression test covers exactly the race: claim the entry (simulating a live send in flight), run the drain, assert deliver is not called and the "already being recovered" skip log is emitted; after release, a follow-up drain delivers exactly once.

Security / runtime controls unchanged

No prompt-derived policy surface is involved. Concurrency is enforced structurally via the in-memory claim set, not by message text, metadata, or any LLM-facing behavior.
No config defaults, no help metadata, no generated artifacts change.
Plugin SDK public surface (src/plugin-sdk/infra-runtime.ts) selectively re-exports only drainPendingDeliveries; the two new helpers are deliberately not re-exposed there, so third-party plugin contracts and docs/.generated/plugin-sdk-api-baseline.* are intentionally untouched.
Outbound delivery adapters, channel plugins, session/policy keys, transcript mirroring, and best-effort / abort behavior are all unmodified — the claim wraps the existing ack/fail flow in a finally without changing any of its branches.

Tests

Added regression test: src/infra/outbound/delivery-queue.reconnect-drain.test.ts → "skips entries that an in-flight live delivery has actively claimed".

Exact commands run locally:

pnpm test src/infra/outbound/delivery-queue.reconnect-drain.test.ts — 14 passed (includes new case).
pnpm test src/infra/outbound — 452 passed across 35 files.
pnpm test src/gateway/server-restart-sentinel.test.ts src/gateway/server-runtime-services.test.ts src/plugin-sdk/infra-runtime.test.ts — the other consumers of the ./delivery-queue.js mock, 27 passed.
pnpm test src/cron/isolated-agent — cron delivery-dispatch consumers of deliverOutboundPayloads, 222 passed across 20 files.
pnpm check:changed — scoped typecheck/lint/guards green.
pnpm tsgo — core prod typecheck green.

The pnpm tsgo:prod failures on unrelated qa-lab, qqbot, and tokenjuice extensions reproduce on origin/main at the same SHA and are not introduced by this change. Same for the pre-existing plugin-sdk-api-baseline hash drift — this PR does not add any public plugin-sdk surface.

Checklist

Noted as AI-assisted
Fully tested (new regression test plus full outbound/cron/infra lanes)
Understand what the code does
Will resolve/reply to bot review conversations after addressing them

Made with Cursor

Changed files

CHANGELOG.md (modified, +1/-0)
src/infra/outbound/deliver.test.ts (modified, +37/-0)
src/infra/outbound/deliver.ts (modified, +25/-1)
src/infra/outbound/delivery-queue-recovery.ts (modified, +29/-11)
src/infra/outbound/delivery-queue.reconnect-drain.test.ts (modified, +67/-0)
src/infra/outbound/delivery-queue.ts (modified, +2/-0)

Code Example

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

---

13:00:20.860  Sending message → sha256:3e25bf2f15db   ← original delivery
13:00:21.267  Sending message → sha256:3e25bf2f15db   ← drain burst (4 sends in ~80ms)
13:00:21.289  Sending message → sha256:3e25bf2f15db
13:00:21.304  Sending message → sha256:3e25bf2f15db
13:00:21.320  Sending message → sha256:3e25bf2f15db
13:00:31.257  Sending message → sha256:3e25bf2f15db   ← second drain (reconnect #2)
13:00:31.272  Sending message → sha256:3e25bf2f15db
13:00:31.287  Sending message → sha256:3e25bf2f15db

---

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),  // ← don't drain in-flight entries
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})

RAW_BUFFERClick to expand / collapse

Version

v2026.4.14 (regression from v2026.3.13-1)

Summary

Steps to reproduce

Configure a cron job with announce or isolated-agent delivery targeting WhatsApp.
Ensure the WhatsApp watchdog fires at its 30-minute timeout during a cron delivery window (i.e., no inbound messages are received to reset the timer).
Observe the cron fires at a scheduled time — the outbound message is sent N+1 times, where N is the number of reconnects that occur while the original delivery is in-flight.

Root cause

In login-BmQuD5Uw.js (v2026.4.14), drainReconnectQueue was replaced by drainPendingDeliveries with this selectEntry:

// v2026.4.14 — NEW (broken)
drainPendingDeliveries({
  drainKey: `whatsapp:${normalizedAccountId}`,
  selectEntry: (entry) => ({
    match: entry.channel === "whatsapp" &&
      normalizeReconnectAccountId(entry.accountId) === normalizedAccountId,
    bypassBackoff: isNoListenerReconnectError(entry.lastError)
  })
})

Evidence

Railway logs for Apr 19–21 show two distinct burst patterns per cron window for the same destination JID hash:

13:00:20.860  Sending message → sha256:3e25bf2f15db   ← original delivery
13:00:21.267  Sending message → sha256:3e25bf2f15db   ← drain burst (4 sends in ~80ms)
13:00:21.289  Sending message → sha256:3e25bf2f15db
13:00:21.304  Sending message → sha256:3e25bf2f15db
13:00:21.320  Sending message → sha256:3e25bf2f15db
13:00:31.257  Sending message → sha256:3e25bf2f15db   ← second drain (reconnect #2)
13:00:31.272  Sending message → sha256:3e25bf2f15db
13:00:31.287  Sending message → sha256:3e25bf2f15db

Each burst has its own correlationId, confirming they are separate sendMessageWhatsApp invocations, not retries within a single send.

The cron/runs/*.jsonl records show only one delivered: true entry per run — the duplicate sends are invisible to the cron scheduler because the drain's deliver() call enqueues and acks a new entry, leaving no trace in run records.

Proposed fix

Restore the guard that prevents fresh entries from being drain-eligible:

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),  // ← don't drain in-flight entries
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})

Or add a minimum-age guard in isEntryEligibleForRecoveryRetry so entries younger than a configurable threshold (e.g. 10s) are not yet drain-eligible.

Workaround

Pin to v2026.3.13-1.

extent analysis

TL;DR

The most likely fix is to restore the guard that prevents fresh entries from being drain-eligible by modifying the selectEntry function in drainPendingDeliveries.

Guidance

Identify the selectEntry function in login-BmQuD5Uw.js and modify it to include a check for entry.retryCount > 0 || entry.lastAttemptAt !== undefined to prevent fresh entries from being drain-eligible.
Alternatively, consider adding a minimum-age guard in isEntryEligibleForRecoveryRetry to prevent entries younger than a configurable threshold from being drain-eligible.
Verify the fix by checking the Railway logs for duplicate sends and ensuring that the cron/runs/*.jsonl records show only one delivered: true entry per run.
If modifying the code is not feasible, consider pinning to v2026.3.13-1 as a temporary workaround.

Example

selectEntry: (entry) => ({
  match: entry.channel === "whatsapp" &&
    normalizeReconnectAccountId(entry.accountId) === normalizedAccountId &&
    (entry.retryCount > 0 || entry.lastAttemptAt !== undefined),
  bypassBackoff: isNoListenerReconnectError(entry.lastError)
})

Notes

The proposed fix assumes that the issue is caused by the modification of the drainReconnectQueue function in v2026.4.14.
The minimum-age guard approach may require additional configuration and testing to determine the optimal threshold value.

Recommendation

Apply the proposed fix by modifying the selectEntry function to prevent fresh entries from being drain-eligible, as this approach directly addresses the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #integration issue #index setup #retrieval issue #search optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix WhatsApp: reconnect drain replays all pending deliveries, causing 7-12x message duplication on cron sends [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #70428: fix(outbound): hold active-delivery claim so reconnect drain skips live sends

Description (problem / solution / changelog)

Summary

Root cause

Fix

Why the fix is safe

Security / runtime controls unchanged

Tests

Checklist

Changed files

Code Example

Version

Summary

Steps to reproduce

Root cause

Evidence

Proposed fix

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING