openclaw - ✅(Solved) Fix [Bug]: Telegram update_id is acked before outbound sendMessage confirms — restart-during-write loses forum-topic replies permanently [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77000Fetched 2026-05-04 04:59:34
View on GitHub
Comments
2
Participants
3
Timeline
8
Reactions
2
Timeline (top)
cross-referenced ×4commented ×2mentioned ×1subscribed ×1

Telegram forum-topic messages can be permanently lost in a specific narrow window: openclaw acknowledges the inbound update_id to Telegram before the assistant's final outbound sendMessage has confirmed delivery. If the gateway is interrupted (watchdog restart, OOM kill, intentional systemctl restart, crash) between those two events, the new gateway never reprocesses the message — Telegram considers it delivered (offset advanced past it) and the user never sees a reply.

This is essentially the same symptom @bubucilo reported in #76554 (closed by maintainer 2026-05-03 16:01 UTC without an explicit resolution comment), but I'd like to file it again with a fresh production reproduction and a focused architectural ask, because the acked-vs-delivered race appears to be a structural issue independent of the 5.2-specific Telegram polling regression that's already merged in main (per #76735 / #76388).

Root Cause

This is essentially the same symptom @bubucilo reported in #76554 (closed by maintainer 2026-05-03 16:01 UTC without an explicit resolution comment), but I'd like to file it again with a fresh production reproduction and a focused architectural ask, because the acked-vs-delivered race appears to be a structural issue independent of the 5.2-specific Telegram polling regression that's already merged in main (per #76735 / #76388).

Fix Action

Fix / Workaround

This is essentially the same symptom @bubucilo reported in #76554 (closed by maintainer 2026-05-03 16:01 UTC without an explicit resolution comment), but I'd like to file it again with a fresh production reproduction and a focused architectural ask, because the acked-vs-delivered race appears to be a structural issue independent of the 5.2-specific Telegram polling regression that's already merged in main (per #76735 / #76388).

16:01:46  user → topic:31  "分析前面三支 etf 的价值"  (Telegram update_id 79669836)
16:02:43  session.jsonl: user message recorded; embedded run dispatched
            → at this point Telegram offset has already been advanced
              past 79669836 (openclaw's poller acks long-poll batches eagerly)
16:03:55  session.jsonl: assistant final text written
            "Got it — the original post by @tychozzz lists 5 ETFs he's DCA-ing,
             and 前面三支 = SMH, DRAM, BOTZ. Let me pull current data on all three."
            (no sendMessage logged yet — outbound queue still pending)
16:03:45  systemctl --user restart openclaw-gateway issued
            (in our case from a periodic health check; could equally be
             openclaw update, openclaw doctor --fix, OOM, or operator action)
16:04:20  Old gateway shutdown completes; new gateway PID 391548 starts
            → outbound queue from old PID is gone
            → Telegram offset is past 79669836, so new gateway does not see this update again
16:10:50  New gateway processes a different topic (topic:1) — 8 sendMessage events,
            none for topic:31

PR fix notes

PR #1: fix(telegram): defer update_id offset ack until after handler completes (#77000)

Description (problem / solution / changelog)

Fix: Telegram update_id acked before outbound sendMessage confirms

Problem

Telegram update_id was being acknowledged (and the offset persisted to disk) in beginUpdate, before the handler ran and before sendMessage confirmed delivery. If the gateway restarted between these events, the offset had already advanced past the message, Telegram would not resend it, but the transcript showed the assistant had written a reply — permanently losing the reply.

Timeline (before fix)

  1. 16:01:46 - User message arrives (update_id 79669836)
  2. 16:02:43 - Gateway acks offset (advances past 79669836), session recorded
  3. 16:03:55 - Assistant final text written to session.jsonl
  4. 16:03:45 - Gateway restart (queue lost, offset already past message)
  5. User never sees reply

Solution

Move acceptUpdateId from beginUpdate to finishUpdate (when completed=true). Now the offset only advances after the handler finishes. If the gateway restarts before finishUpdate runs, the update is re-delivered by Telegram and reprocessed (reply may duplicate, which is preferable to silent loss).

Changed file

  • extensions/telegram/src/bot-update-tracker.ts

Fixes #77000

Changed files

  • extensions/telegram/src/bot-update-tracker.ts (modified, +10/-1)

PR #77075: fix(telegram): defer update_id offset ack until after handler completes (#77000)

Description (problem / solution / changelog)

Fix: Telegram update_id acked before outbound sendMessage confirms

Problem

Telegram update_id was being acknowledged (and the offset persisted to disk) in beginUpdate, before the handler ran and before sendMessage confirmed delivery. If the gateway restarted between these events, the offset had already advanced past the message, Telegram would not resend it, but the transcript showed the assistant had written a reply — permanently losing the reply.

Timeline (before fix)

  1. 16:01:46 - User message arrives (update_id 79669836)
  2. 16:02:43 - Gateway acks offset (advances past 79669836), session recorded
  3. 16:03:55 - Assistant final text written to session.jsonl
  4. 16:03:45 - Gateway restart (queue lost, offset already past message)
  5. User never sees reply

Solution

Move acceptUpdateId from beginUpdate to finishUpdate (when completed=true). Now the offset only advances after the handler finishes. If the gateway restarts before finishUpdate runs, the update is re-delivered by Telegram and reprocessed (reply may duplicate, which is preferable to silent loss).

Changed file

  • extensions/telegram/src/bot-update-tracker.ts

Fixes #77000

Changed files

  • extensions/telegram/src/bot-update-tracker.ts (modified, +10/-1)

Code Example

16:01:46  user → topic:31  "分析前面三支 etf 的价值"  (Telegram update_id 79669836)
16:02:43  session.jsonl: user message recorded; embedded run dispatched
            → at this point Telegram offset has already been advanced
              past 79669836 (openclaw's poller acks long-poll batches eagerly)
16:03:55  session.jsonl: assistant final text written
            "Got it — the original post by @tychozzz lists 5 ETFs he's DCA-ing,
             and 前面三支 = SMH, DRAM, BOTZ. Let me pull current data on all three."
            (no sendMessage logged yet — outbound queue still pending)
16:03:45  systemctl --user restart openclaw-gateway issued
            (in our case from a periodic health check; could equally be
             openclaw update, openclaw doctor --fix, OOM, or operator action)
16:04:20  Old gateway shutdown completes; new gateway PID 391548 starts
            → outbound queue from old PID is gone
Telegram offset is past 79669836, so new gateway does not see this update again
16:10:50  New gateway processes a different topic (topic:1)8 sendMessage events,
            none for topic:31
RAW_BUFFERClick to expand / collapse

Summary

Telegram forum-topic messages can be permanently lost in a specific narrow window: openclaw acknowledges the inbound update_id to Telegram before the assistant's final outbound sendMessage has confirmed delivery. If the gateway is interrupted (watchdog restart, OOM kill, intentional systemctl restart, crash) between those two events, the new gateway never reprocesses the message — Telegram considers it delivered (offset advanced past it) and the user never sees a reply.

This is essentially the same symptom @bubucilo reported in #76554 (closed by maintainer 2026-05-03 16:01 UTC without an explicit resolution comment), but I'd like to file it again with a fresh production reproduction and a focused architectural ask, because the acked-vs-delivered race appears to be a structural issue independent of the 5.2-specific Telegram polling regression that's already merged in main (per #76735 / #76388).

Reproduction (verbatim from production)

  • openclaw 2026.4.29 (a448042) — pinned here because the 5.2 upgrade path is blocked by externalized-plugin install issues (see #76586).
  • Linode VPS, Singapore region.
  • Telegram supergroup -1003888407906, Invest topic topic:31. Session e856c600-78b6-4035-8eff-d792002e3ba2-topic-31.
  • The trigger is any gateway restart that interrupts the window between assistant-final-write and outbound-send. The specific trigger in our case happened to be a systemctl --user restart openclaw-gateway, but the same race fires on openclaw doctor --fix, openclaw update, OOM kill, segfault, or any other restart path. Restart frequency varies by environment, but the race itself is environment-independent.

Timeline (2026-05-03)

16:01:46  user → topic:31  "分析前面三支 etf 的价值"  (Telegram update_id 79669836)
16:02:43  session.jsonl: user message recorded; embedded run dispatched
            → at this point Telegram offset has already been advanced
              past 79669836 (openclaw's poller acks long-poll batches eagerly)
16:03:55  session.jsonl: assistant final text written
            "Got it — the original post by @tychozzz lists 5 ETFs he's DCA-ing,
             and 前面三支 = SMH, DRAM, BOTZ. Let me pull current data on all three."
            (no sendMessage logged yet — outbound queue still pending)
16:03:45  systemctl --user restart openclaw-gateway issued
            (in our case from a periodic health check; could equally be
             openclaw update, openclaw doctor --fix, OOM, or operator action)
16:04:20  Old gateway shutdown completes; new gateway PID 391548 starts
            → outbound queue from old PID is gone
            → Telegram offset is past 79669836, so new gateway does not see this update again
16:10:50  New gateway processes a different topic (topic:1) — 8 sendMessage events,
            none for topic:31

User waits, sees nothing in topic:31, eventually re-sends the prompt manually.

Diagnostic confirmation

session.jsonl for topic:31 contains the assistant text written at 16:03:55. journalctl --user -u openclaw-gateway.service from 16:03 to current contains zero [telegram] sendMessage ok chat=-1003888407906 message=... whose message_thread_id resolves to topic 31. The offset file ~/.openclaw/telegram/update-offset-default.json advanced past 79669836, so the new gateway will never re-fetch that user message.

This is a structural data loss: the user prompt was processed, the assistant generated a reply, the reply was written to the local transcript, and the user has no way to receive it.

What's distinct from already-closed reports

  • #76554 (closed 2026-05-03 16:01:59 UTC by maintainer, no resolution comment): same end-state symptom (transcript written, channel silent). That report was on 2026.5.2 and didn't tie the failure to a specific interruption trigger; OP @bubucilo described "many topics" silently failing without correlating to gateway lifecycle. Our report ties the loss to a deterministic trigger (any gateway restart between assistant-final-write and outbound-send) which is reproducible on demand.
  • #76388 (closed 2026-05-03, "fixed in main" via #76735): addresses Telegram polling startup not entering long-poll on high-RTT hosts. That's an upstream bug (different code path); this issue is about the downstream delivery handoff.
  • #50716 / #51659 / #66459 (older, all open or closed without fix): describe the same symptom historically. Suggests this is a long-standing race that hasn't been structurally addressed.

Suggested fix direction

The root issue is that openclaw acks the Telegram update_id (advances lastUpdateId in the offset store) when the embedded run starts processing the message, not when the assistant's final outbound sendMessage confirms delivery. Two structural fixes that would prevent the loss:

Option A: defer offset advancement until sendMessage confirms

Hold the inbound update_id un-acked until the corresponding outbound sendMessage returns 200 OK. If the run is interrupted, the next gateway boot's getUpdates will receive the same update again and re-process it. Cost: 1 extra round-trip on slow turns; user might get a duplicate response if a turn does deliver but the offset write fails (need idempotency at the channel-send layer).

Option B: persist the outbound queue across restart

After the assistant final text is written to the transcript, queue the channel-send to a durable disk-backed queue keyed on session+turn. On gateway restart, drain the queue before resuming polling. Cost: more state to manage; potential duplicate sends if a previous gateway both queued AND sent before crash. Idempotency on the channel side (Telegram + a turn-id watermark) is needed.

Option A is cheaper to implement and matches how most polling consumers handle at-least-once delivery (don't ack until processed). Option B is more general (handles other channels too) but requires more infrastructure.

The current behavior is at-most-once, which is the wrong default for this class of payload (user-visible AI replies that the user may have spent significant time formulating).

Environment

  • openclaw: 2026.4.29 (a448042)
  • OS: Linux (Linode VPS, Singapore region)
  • Telegram channel mode: long polling
  • Affected topic types: forum topics specifically (DM + non-forum group chats less affected because they don't have the same offset-vs-thread-id state to lose)
  • Restart triggers that reproduce the race in production: openclaw doctor --fix (which itself does a systemctl restart), openclaw update (without --no-restart), manual systemctl restart for hook reloads, OOM, and any process-level health-check that wraps systemctl restart — the bug is in openclaw's ack-vs-deliver ordering, not in any of these triggers individually.

Happy to provide stability bundle, full session JSONL for the 2026-05-03 16:01 incident, the offset store before/after, or a CPU profile.

extent analysis

TL;DR

Defer offset advancement until sendMessage confirms delivery to prevent message loss during gateway restarts.

Guidance

  1. Verify the issue: Check the session.jsonl file for the affected topic to confirm that the assistant text was written but not sent due to the gateway restart.
  2. Understand the root cause: The issue arises from openclaw acknowledging the Telegram update_id before the assistant's final outbound sendMessage confirms delivery, leading to lost messages during gateway restarts.
  3. Consider Option A: Defer offset advancement until sendMessage confirms delivery to ensure at-least-once delivery and prevent message loss.
  4. Evaluate Option B: Persist the outbound queue across restarts as an alternative solution, which requires more infrastructure but handles other channels as well.

Example

No code snippet is provided as the issue requires a structural fix to the openclaw implementation.

Notes

The chosen solution should consider the trade-offs between the two options, including the potential for duplicate responses and the need for idempotency at the channel-send layer.

Recommendation

Apply Option A: defer offset advancement until sendMessage confirms as it is cheaper to implement and matches the at-least-once delivery approach used by most polling consumers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Telegram update_id is acked before outbound sendMessage confirms — restart-during-write loses forum-topic replies permanently [2 pull requests, 2 comments, 3 participants]