hermes - 💡(How to fix) Fix WhatsApp bridge drops messages on bridge/gateway restart (no durable queue)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The WhatsApp bridge ACKs to WhatsApp the moment Baileys hands it a message, then holds it in a 100-slot in-memory array. Any bridge restart, gateway restart, or queue overflow loses messages with no recovery path. WhatsApp's server has already been told "delivered, you can forget" — so once they're gone from our process memory, they're gone for good.

This mirrors a real-world failure I hit: gateway did a "stale code" graceful restart, bridge process was killed (-15), and messages received in that window never reached the downstream consumer. The user-visible symptom: chats appear incomplete with no error logged.

Error Message

This mirrors a real-world failure I hit: gateway did a "stale code" graceful restart, bridge process was killed (-15), and messages received in that window never reached the downstream consumer. The user-visible symptom: chats appear incomplete with no error logged.

Root Cause

The WhatsApp bridge ACKs to WhatsApp the moment Baileys hands it a message, then holds it in a 100-slot in-memory array. Any bridge restart, gateway restart, or queue overflow loses messages with no recovery path. WhatsApp's server has already been told "delivered, you can forget" — so once they're gone from our process memory, they're gone for good.

This mirrors a real-world failure I hit: gateway did a "stale code" graceful restart, bridge process was killed (-15), and messages received in that window never reached the downstream consumer. The user-visible symptom: chats appear incomplete with no error logged.

RAW_BUFFERClick to expand / collapse

WhatsApp bridge drops messages on bridge/gateway restart (no durable queue)

Summary

The WhatsApp bridge ACKs to WhatsApp the moment Baileys hands it a message, then holds it in a 100-slot in-memory array. Any bridge restart, gateway restart, or queue overflow loses messages with no recovery path. WhatsApp's server has already been told "delivered, you can forget" — so once they're gone from our process memory, they're gone for good.

This mirrors a real-world failure I hit: gateway did a "stale code" graceful restart, bridge process was killed (-15), and messages received in that window never reached the downstream consumer. The user-visible symptom: chats appear incomplete with no error logged.

Current drop holes

All line refs against scripts/whatsapp-bridge/bridge.js and gateway/platforms/whatsapp.py on main:

  1. In-memory queue + silent overflowbridge.js:182,585-588. MAX_QUEUE_SIZE=100; oldest entries shift()-ed off without a log.
  2. Destructive pollbridge.js:629-632. GET /messages splices the entire queue out and ships it. If the gateway's HTTP response or message iteration fails, those events exist nowhere.
  3. Bridge restart wipes everything. No on-disk queue; messageQueue is rebuilt empty on every process start.
  4. Gateway has no replay path either. whatsapp.py:1121 polls, iterates handle_message(event), no ack call back to the bridge. _pending_messages in base.py is for mid-turn fragment merging, not crash recovery.

Proposed design

One durable buffer at the bridge — same pattern WhatsApp's phone app uses (write to local SQLite before acking):

  • ~/.hermes/whatsapp/queue.jsonl — append-only, fsync per append. Each line: {seq, event_uid, ...event} where seq is a bridge-generated monotonic integer and event_uid is chatId:messageId(:participant) for dedupe.
  • ~/.hermes/whatsapp/queue.offset — single integer acked_up_to_seq, written via temp + rename + dir fsync.
  • GET /messages?limit=N — non-destructive, returns events with seq > acked_up_to_seq.
  • POST /ack {up_to_seq: N} — gateway calls after handle_message(event) completes. Bridge advances offset, compacts the JSONL on every Nth ack.
  • Delivery semantics: at-least-once. Gateway crash after handle completes but before ack lands → event replays on next poll. Documented behavior.

Scope: ~80 lines in bridge.js, ~10 in whatsapp.py, one new test.

Open questions for maintainers

  1. Baileys' built-in makeInMemoryStore / message-store. Worth using instead of a custom JSONL? Last I checked it's still in-memory by default; happy to follow whatever shape you prefer for the on-disk part.
  2. Ack granularity. Per-event up_to_seq seems sufficient for a personal-bridge use case. Any reason to support batch ack or range ack?
  3. Compaction strategy. Every-N-acks vs. on graceful shutdown only — either is fine; preference?
  4. Outbound /send durability. I'd scope this PR to inbound only. Outbound (gateway → bridge → WhatsApp) has its own gap if the bridge dies mid-send, but it's a separate problem with different tradeoffs.

Happy to open a draft PR if the design direction is acceptable. Don't want to burn time on a shape you'd want different.


Repro context: hermes-agent gateway with self-chat mode, default config, no exotic flags.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING