openclaw - 💡(How to fix) Fix Telegram delivery reliability: polling stalls can lead to silent outbound message loss [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#50040Fetched 2026-04-08 00:59:56
View on GitHub
Comments
3
Participants
3
Timeline
4
Reactions
0
Timeline (top)
commented ×3cross-referenced ×1

On OpenClaw 2026.3.12, Telegram Bot API connectivity may remain generally healthy while the gateway's Telegram polling loop intermittently stalls/restarts. During those recovery windows, outbound sendMessage delivery can fail and the effective recovery path is not strong enough at runtime, leading to silent or operator-visible message loss.

This appears to be a gap between:

  • polling restart/recovery behavior, and
  • outbound delivery recovery behavior for non-idempotent Telegram sends.

Error Message

In production use, logs repeatedly showed patterns like:

  • Polling stall detected
  • sendChatAction failed
  • sendMessage failed: Network request for 'sendMessage' failed!
  • polling runner stop/restart cycles

Root Cause

On OpenClaw 2026.3.12, Telegram Bot API connectivity may remain generally healthy while the gateway's Telegram polling loop intermittently stalls/restarts. During those recovery windows, outbound sendMessage delivery can fail and the effective recovery path is not strong enough at runtime, leading to silent or operator-visible message loss.

This appears to be a gap between:

  • polling restart/recovery behavior, and
  • outbound delivery recovery behavior for non-idempotent Telegram sends.

Fix Action

Fix / Workaround

Optional note

A local prototype patch implementing the runtime worker + failure classes + stateful delivery entries significantly improves the recovery model, if maintainers want a more concrete direction for upstreaming.

RAW_BUFFERClick to expand / collapse

Telegram delivery reliability: polling stalls can lead to silent outbound message loss

Summary

On OpenClaw 2026.3.12, Telegram Bot API connectivity may remain generally healthy while the gateway's Telegram polling loop intermittently stalls/restarts. During those recovery windows, outbound sendMessage delivery can fail and the effective recovery path is not strong enough at runtime, leading to silent or operator-visible message loss.

This appears to be a gap between:

  • polling restart/recovery behavior, and
  • outbound delivery recovery behavior for non-idempotent Telegram sends.

Observed behavior

In production use, logs repeatedly showed patterns like:

  • Polling stall detected
  • sendChatAction failed
  • sendMessage failed: Network request for 'sendMessage' failed!
  • polling runner stop/restart cycles

At the same time:

  • direct short HTTPS/Bot API probes to api.telegram.org succeeded,
  • DNS and IPv4 routing looked healthy,
  • the failure pattern was intermittent rather than a full Telegram outage.

This suggests the issue is not simply "Telegram unreachable", but rather that the long-poll / recovery path can degrade and outbound delivery is not fully protected when that happens.

Why this is harmful

A message can be prepared by the assistant but still fail to reach Telegram during a polling/recovery disruption. From the operator perspective, this looks like silent message loss or partial reply loss.

Suspected design gap

There is already a disk-backed outbound delivery queue and startup recovery, but runtime delivery recovery appears insufficient for this failure mode.

The practical gap seems to be:

  1. polling stalls or restarts,
  2. outbound Telegram send fails during that window,
  3. recovery is not strong enough as a continuous runtime mechanism,
  4. result: delivery may be stuck, dropped, or ambiguous from the operator perspective.

Proposed direction

A robust fix would combine:

  1. Runtime outbound delivery recovery worker

    • periodically scan pending deliveries
    • retry only safe-to-retry entries
    • run without requiring gateway restart
    • trigger an immediate recovery pass after Telegram polling restart/recovery
  2. Delivery failure classification

    • safe_to_retry
    • ambiguous
    • permanent
  3. Stateful delivery entries

    • pending
    • retryable
    • ambiguous
    • delivered
    • failed

This would reduce silent message loss while avoiding blind retries for ambiguous non-idempotent send outcomes.

Expected behavior

When Telegram polling experiences a transient stall/restart:

  • outbound deliveries should not be silently lost,
  • safe transient failures should be retried automatically,
  • ambiguous failures should be preserved/held rather than blindly retried,
  • gateway restart should not be required to recover eligible deliveries.

Version notes

  • Reproduced on: 2026.3.12
  • A local comparison against 2026.3.13 did not reveal an obvious upstream runtime recovery worker / failure-classification / stateful-outbox implementation for this specific Telegram delivery gap.

Optional note

A local prototype patch implementing the runtime worker + failure classes + stateful delivery entries significantly improves the recovery model, if maintainers want a more concrete direction for upstreaming.

extent analysis

Fix Plan

To address the issue of silent message loss due to polling stalls and insufficient runtime delivery recovery, we will implement a runtime outbound delivery recovery worker. This worker will periodically scan pending deliveries, retry safe-to-retry entries, and run without requiring a gateway restart.

Step-by-Step Solution:

  1. Implement Runtime Outbound Delivery Recovery Worker:
    • Create a worker that periodically scans the pending deliveries queue.
    • Use a scheduling library (e.g., schedule in Python) to run the worker at regular intervals.
    • Example Python code snippet:

import schedule import time

def recovery_worker(): # Scan pending deliveries and retry safe-to-retry entries pending_deliveries = get_pending_deliveries() for delivery in pending_deliveries: if delivery['safe_to_retry']: retry_delivery(delivery)

schedule.every(1).minutes.do(recovery_worker) # Run every 1 minute

while True: schedule.run_pending() time.sleep(1)

2. **Implement Delivery Failure Classification**:
   - Introduce failure classes: `safe_to_retry`, `ambiguous`, and `permanent`.
   - Update the delivery entry with the corresponding failure class when a failure occurs.
   - Example Python code snippet:
     ```python
def classify_failure(delivery, failure_reason):
    if failure_reason == 'network_error':
        delivery['failure_class'] = 'safe_to_retry'
    elif failure_reason == 'ambiguous_error':
        delivery['failure_class'] = 'ambiguous'
    else:
        delivery['failure_class'] = 'permanent'
  1. Implement Stateful Delivery Entries:
    • Introduce states: pending, retryable, ambiguous, delivered, and failed.
    • Update the delivery entry state based on the failure class and retry outcome.
    • Example Python code snippet:

def update_delivery_state(delivery, new_state): delivery['state'] = new_state

def retry_delivery(delivery): # Retry the delivery if retry_successful: update_delivery_state(delivery, 'delivered') else: update_delivery_state(delivery, 'retryable')


### Verification
To verify that the fix worked, monitor the pending deliveries queue and the delivery failure rates. The number of silent message losses should decrease, and the recovery worker should retry safe-to-retry deliveries automatically.

### Extra Tips
- Ensure the recovery worker is properly configured and running at regular intervals.
- Monitor the delivery failure rates and adjust the failure classification and retry logic as needed.
- Consider implementing a maximum retry limit to prevent infinite retries.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When Telegram polling experiences a transient stall/restart:

  • outbound deliveries should not be silently lost,
  • safe transient failures should be retried automatically,
  • ambiguous failures should be preserved/held rather than blindly retried,
  • gateway restart should not be required to recover eligible deliveries.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING