hermes - ✅(Solved) Fix Telegram Updater goes silent forever after a single network blip; reconnect ladder swallows the cleanup TimedOut and never re-fires [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18086Fetched 2026-05-01 05:53:57
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×1

In gateway/platforms/telegram.py, _handle_polling_network_error exhausts a 10-attempt exponential ladder before declaring the adapter retryable-fatal. In practice, a single transient getUpdates 502 / Bad Gateway can trip the FIRST reconnect attempt into a state where:

  1. await self._app.updater.stop() raises telegram.error.TimedOut from python-telegram-bot's _get_updates_cleanup (the "Suppressing error to ensure graceful shutdown" path).
  2. The surrounding try / except Exception: pass in _handle_polling_network_error swallows the exception and proceeds to await self._app.updater.start_polling(...).
  3. start_polling() returns without raising — but the underlying long-poll never actually resumes consuming updates.
  4. No further error callback fires (polling is "alive" per Updater.running but quietly stuck), so the reconnect ladder never advances past attempt 1, the fatal-error path is never reached, and the gateway process keeps running indefinitely without any working Telegram polling.

The container looks healthy at every external surface — process is up, logs report only the initial "reconnecting in 5s, attempt 1/10" line, and Updater.running is True — but Telegram messages are silently dropped at the polling layer for hours/days until manually restarted.

Error Message

TS+0:00 WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway TS+0:00 WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway TS+0:25 ERROR telegram.ext.Updater: Error while calling get_updates one more time to mark all fetched updates. Suppressing error to ensure graceful shutdown. When polling for updates is restarted, updates may be fetched again. Please adjust timeouts via ApplicationBuilder or the parameter get_updates_request of Bot. TS+0:25 telegram.error.TimedOut: Timed out [silence — no further log lines for 11 hours, until the container is manually restarted]

Root Cause

The reconnect path at gateway/platforms/telegram.py:295-310 (paraphrased):

try:
    if self._app and self._app.updater and self._app.updater.running:
        await self._app.updater.stop()
except Exception:
    pass

try:
    await self._app.updater.start_polling(
        allowed_updates=Update.ALL_TYPES,
        drop_pending_updates=False,
        error_callback=self._polling_error_callback_ref,
    )
    logger.info("[%s] Telegram polling resumed after network error ...")
    self._polling_network_error_count = 0
except Exception as retry_err:
    ...

Two issues compound:

  1. Cleanup-failure invisibility. When stop() raises (e.g. _get_updates_cleanup hits a TimedOut), the bare except Exception: pass discards information that the cleanup didn't complete. PTB's Updater may be left in a state where running is internally true but its consumer task is wedged.
  2. No probe that polling is actually working. start_polling() is treated as a confirmation of recovery. There is no subsequent verification that getUpdates calls are succeeding (e.g. an offset watermark advancing, a last_get_updates_at heartbeat, or a synthetic probe of the Updater's queue). When start_polling succeeds-on-paper but the polling task is wedged, no further error callback fires and the reconnect ladder never advances. The fatal-error path that an external supervisor relies on is never reached.

Independently — the same wedging is also a problem for operators running without an external supervisor (Docker without a restart policy, dev machines, etc.), since even the eventual fatal-error path depends on _polling_error_callback firing again, which it doesn't when polling is stuck rather than erroring.

Fix Action

Fix / Workaround

  • Capture progress_marker_before = self._polling_progress_marker (an offset / counter incremented by the gateway each time it hands a real update off to the dispatcher).
  • After HEARTBEAT_SECONDS, compare to progress_marker_after.
  • If unchanged AND no new error callback has fired in the interim, treat the reconnect as failed: increment _polling_network_error_count, do a hard rebuild of the Updater (re-instantiate the underlying Application / Bot / httpx client to drop stale connections), and reschedule _handle_polling_network_error with the next attempt.

PR fix notes

PR #18088: fix(telegram): probe polling liveness after reconnect to detect wedged Updater

Description (problem / solution / changelog)

Closes #18086

Detect wedged Telegram polling after reconnect via heartbeat probe

Summary

_handle_polling_network_error currently treats Updater.start_polling() returning successfully as proof that polling has resumed. In practice, the underlying long-poll task can be left wedged on a stale httpx connection — Updater.running is True but no getUpdates calls actually progress, no error callback fires, and the reconnect ladder sits at attempt 1 forever. The fatal-error path is never reached, so the gateway runs indefinitely with no working Telegram polling.

This PR adds a deferred heartbeat probe scheduled after each successful start_polling() in the reconnect path. The probe verifies that the bot endpoint is reachable through the same client a healthy long-poll would use; on failure it re-enters the existing reconnect ladder so the proven escalation path (eventually _set_fatal_error(retryable=True) after MAX_NETWORK_RETRIES) can fire.

Why

See the linked issue for the full repro and log evidence: a single transient Bad Gateway from getUpdates triggered the customer-side wedge — Updater.stop() raised a TimedOut from _get_updates_cleanup (logged as "Suppressing error to ensure graceful shutdown"), the surrounding try / except Exception: pass swallowed it, then start_polling() returned without raising but polling never actually consumed any further updates. No "polling resumed" log, no "reconnecting in 10s, attempt 2/10", no fatal-error path — just silence for 11 hours until the container was manually restarted.

The first reconnect attempt looks like recovery in the logs; only the absence of subsequent activity reveals the wedge. The existing safeguards (MAX_NETWORK_RETRIES, fatal-error retryable=True with supervisor restart) are sound but unreachable when polling silently wedges.

Approach

Two-part fix, both inside the existing reconnect abstraction (no PTB-internal coupling, no Application rebuild):

  1. After a successful start_polling() in _handle_polling_network_error, schedule _verify_polling_after_reconnect() as a background task.

  2. The probe waits HEARTBEAT_PROBE_DELAY (60s, comfortably above one healthy long-poll cycle), then verifies Updater.running is still True and Bot.get_me() returns within PROBE_TIMEOUT (10s) via asyncio.wait_for. Either failure feeds back into _handle_polling_network_error so the reconnect ladder advances.

This is a minimal additive layer — no behavior change on the happy path, and on the wedged path the existing MAX_NETWORK_RETRIES ladder eventually escalates to fatal-error so external supervisors (systemd Restart=on-failure) can do their job.

Considered alternatives

  • Full Application rebuild on every reconnect. Heavier, more places to break, requires re-binding handlers; this PR is the conservative additive change.
  • Don't swallow Updater.stop() exceptions. Worth doing on its own, but doesn't help in the case where stop() cleans up successfully and start_polling() is the silent failure point.
  • Watermark counter incremented per dispatched update. Doesn't differentiate "wedged" from "legitimately idle" — a healthy bot with no inbound DMs has the same signature as a wedged one.

Bot.get_me() was chosen as the probe because it shares the bot's httpx client (so a wedged pool fails the probe) and doesn't conflict with Updater.getUpdates() (no 409).

Test plan

tests/platforms/test_polling_heartbeat.py (added in this PR) covers:

  • Healthy reconnect path: Updater is running, get_me returns → probe is a no-op.
  • Updater non-running after delay: probe re-enters the reconnect ladder with a synthetic RuntimeError.
  • get_me times out: probe re-enters the reconnect ladder with the timeout exception.
  • get_me raises (NetworkError / ConnectionError): probe re-enters the reconnect ladder with the original exception.
  • Adapter already fatal-error'd: probe bails without further action.

Tests use unittest.mock.AsyncMock for the PTB Application/Updater/Bot and patch asyncio.sleep/asyncio.wait_for so the suite runs in milliseconds.

End-to-end repro (operator-side, optional): boot the gateway against a transparent proxy that returns 502 Bad Gateway for getUpdates for 30s then 200; observe the heartbeat probe firing and the reconnect ladder progressing past attempt 1 instead of silencing.

Files changed

  • gateway/platforms/telegram.py — schedules _verify_polling_after_reconnect() after a successful reconnect; defines the new method.
  • tests/platforms/test_polling_heartbeat.py — unit tests for the probe.

Open questions for maintainers

  • Constants HEARTBEAT_PROBE_DELAY and PROBE_TIMEOUT are currently hardcoded; happy to move them next to MAX_NETWORK_RETRIES / BASE_DELAY as module-level constants or env knobs if there's a preference.
  • Whether to also remove the bare except Exception: pass around Updater.stop() in the reconnect path. That's a behavioral change worth its own discussion, kept out of this PR for scope.
  • Whether the probe should be reused outside the reconnect path (e.g. periodic liveness) — that would close the loop on wedges that occur without an antecedent network error.

Changed files

  • gateway/platforms/telegram.py (modified, +55/-0)
  • tests/gateway/test_telegram_network_reconnect.py (modified, +189/-0)

Code Example

<TS+0:00> WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway
<TS+0:00> WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
<TS+0:25> ERROR telegram.ext.Updater: Error while calling `get_updates` one more time to mark all fetched updates. Suppressing error to ensure graceful shutdown. When polling for updates is restarted, updates may be fetched again. Please adjust timeouts via `ApplicationBuilder` or the parameter `get_updates_request` of `Bot`.
<TS+0:25> telegram.error.TimedOut: Timed out
[silence — no further log lines for 11 hours, until the container is manually restarted]

---

try:
    if self._app and self._app.updater and self._app.updater.running:
        await self._app.updater.stop()
except Exception:
    pass

try:
    await self._app.updater.start_polling(
        allowed_updates=Update.ALL_TYPES,
        drop_pending_updates=False,
        error_callback=self._polling_error_callback_ref,
    )
    logger.info("[%s] Telegram polling resumed after network error ...")
    self._polling_network_error_count = 0
except Exception as retry_err:
    ...
RAW_BUFFERClick to expand / collapse

Telegram Updater goes silent forever after a single network blip; reconnect ladder swallows the cleanup TimedOut and never re-fires

Summary

In gateway/platforms/telegram.py, _handle_polling_network_error exhausts a 10-attempt exponential ladder before declaring the adapter retryable-fatal. In practice, a single transient getUpdates 502 / Bad Gateway can trip the FIRST reconnect attempt into a state where:

  1. await self._app.updater.stop() raises telegram.error.TimedOut from python-telegram-bot's _get_updates_cleanup (the "Suppressing error to ensure graceful shutdown" path).
  2. The surrounding try / except Exception: pass in _handle_polling_network_error swallows the exception and proceeds to await self._app.updater.start_polling(...).
  3. start_polling() returns without raising — but the underlying long-poll never actually resumes consuming updates.
  4. No further error callback fires (polling is "alive" per Updater.running but quietly stuck), so the reconnect ladder never advances past attempt 1, the fatal-error path is never reached, and the gateway process keeps running indefinitely without any working Telegram polling.

The container looks healthy at every external surface — process is up, logs report only the initial "reconnecting in 5s, attempt 1/10" line, and Updater.running is True — but Telegram messages are silently dropped at the polling layer for hours/days until manually restarted.

Repro

Operator-context: running Hermes inside a Docker container with hermes gateway run as PID 1. No external supervisor; the container's restart policy is no. Pinned to tag v2026.4.16.

Observed in the wild after a transient Telegram API blip ~5 hours into a container's lifetime. Synthetic repro:

  1. Boot the gateway with Telegram polling enabled.
  2. Briefly fault Telegram egress: route the bot's outbound traffic through a transparent proxy that returns 502 Bad Gateway for getUpdates for ~30 seconds, then restore.
  3. Observe the log sequence below.
  4. Wait several minutes. Updater.running is True. Send a DM to the bot. The bot never replies, and no further log lines appear.

Observed log (sanitized)

Times are minutes in container-local epoch; <TS> placeholders.

<TS+0:00> WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway
<TS+0:00> WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
<TS+0:25> ERROR telegram.ext.Updater: Error while calling `get_updates` one more time to mark all fetched updates. Suppressing error to ensure graceful shutdown. When polling for updates is restarted, updates may be fetched again. Please adjust timeouts via `ApplicationBuilder` or the parameter `get_updates_request` of `Bot`.
<TS+0:25> telegram.error.TimedOut: Timed out
[silence — no further log lines for 11 hours, until the container is manually restarted]

Notably absent (and expected if the ladder were progressing):

  • No [Telegram] Telegram polling resumed after network error (attempt 1) (would indicate clean recovery).
  • No [Telegram] Telegram polling reconnect failed: ... followed by reconnecting in 10s, attempt 2/10 (would indicate the ladder advancing).
  • No Telegram polling could not reconnect after 10 network error retries. Restarting gateway. (the fatal-error path, which the ladder is supposed to land on if recovery genuinely fails).

This means start_polling() returned without raising on attempt 1, but the resumed polling task is stuck — likely waiting on a TCP read against an httpx connection pool that holds a stale connection from the failed long-poll, or PTB's Updater is in a state where running is True but the consumer task isn't actually fetching.

Root cause analysis

The reconnect path at gateway/platforms/telegram.py:295-310 (paraphrased):

try:
    if self._app and self._app.updater and self._app.updater.running:
        await self._app.updater.stop()
except Exception:
    pass

try:
    await self._app.updater.start_polling(
        allowed_updates=Update.ALL_TYPES,
        drop_pending_updates=False,
        error_callback=self._polling_error_callback_ref,
    )
    logger.info("[%s] Telegram polling resumed after network error ...")
    self._polling_network_error_count = 0
except Exception as retry_err:
    ...

Two issues compound:

  1. Cleanup-failure invisibility. When stop() raises (e.g. _get_updates_cleanup hits a TimedOut), the bare except Exception: pass discards information that the cleanup didn't complete. PTB's Updater may be left in a state where running is internally true but its consumer task is wedged.
  2. No probe that polling is actually working. start_polling() is treated as a confirmation of recovery. There is no subsequent verification that getUpdates calls are succeeding (e.g. an offset watermark advancing, a last_get_updates_at heartbeat, or a synthetic probe of the Updater's queue). When start_polling succeeds-on-paper but the polling task is wedged, no further error callback fires and the reconnect ladder never advances. The fatal-error path that an external supervisor relies on is never reached.

Independently — the same wedging is also a problem for operators running without an external supervisor (Docker without a restart policy, dev machines, etc.), since even the eventual fatal-error path depends on _polling_error_callback firing again, which it doesn't when polling is stuck rather than erroring.

Suggested fix shape

In _handle_polling_network_error, after start_polling() returns successfully, schedule a deferred liveness check that runs ~HEARTBEAT_SECONDS later (suggest 60s) and verifies that polling has made progress since the reconnect. Concretely:

  • Capture progress_marker_before = self._polling_progress_marker (an offset / counter incremented by the gateway each time it hands a real update off to the dispatcher).
  • After HEARTBEAT_SECONDS, compare to progress_marker_after.
  • If unchanged AND no new error callback has fired in the interim, treat the reconnect as failed: increment _polling_network_error_count, do a hard rebuild of the Updater (re-instantiate the underlying Application / Bot / httpx client to drop stale connections), and reschedule _handle_polling_network_error with the next attempt.

This stays inside the existing reconnect abstraction (no PTB-internal coupling), turns silent wedging into a recovered-or-fatal outcome, and makes the existing MAX_NETWORK_RETRIES exhaustion path actually reachable when polling is stuck.

A test would simulate two failure modes in one repro:

  • 502 then recovered: confirms recovery happens normally and the heartbeat doesn't false-positive on healthy polling.
  • 502 then quietly-wedged-Updater: confirms the heartbeat detects the wedge and the ladder advances, eventually firing the fatal-error path.

Happy to send a PR with the heartbeat probe + tests if maintainers prefer that shape; want to validate the approach before investing in the implementation.

Environment

  • Hermes tag: v2026.4.16
  • Python: 3.11 (slim image)
  • python-telegram-bot: bundled with Hermes installer
  • Runtime: containerized, hermes gateway run as PID 1, no external supervisor.
  • Triggered by: a transient Telegram API 502 / Bad Gateway against getUpdates.

What we've ruled out

  • Token validity: getMe returns 200 throughout the silent window.
  • Webhook conflict: getWebhookInfo shows no webhook and pending_update_count: 0 (i.e. no other consumer is stealing updates; the polling task is just not consuming).
  • Container-level network: outbound curl https://api.telegram.org/bot.../getMe from inside the still-running container returns 200 in <1s during the silent window — egress is healthy, the wedge is at the Updater layer.
  • Container OOM / restart: container shows running, 0 restarts, OOMKilled false.

extent analysis

TL;DR

Implement a deferred liveness check after start_polling() to verify that polling has made progress, and treat the reconnect as failed if no progress is made.

Guidance

  • Identify the root cause of the issue: the try/except block in _handle_polling_network_error swallows the TimedOut exception, and the start_polling() method returns without raising an error, but the underlying polling task is stuck.
  • Implement a liveness check after start_polling() to verify that polling has made progress, such as by comparing the progress_marker_before and progress_marker_after values.
  • If no progress is made, treat the reconnect as failed, increment the _polling_network_error_count, and reschedule _handle_polling_network_error with the next attempt.
  • Consider adding a test to simulate the failure modes and validate the approach.

Example

# Capture progress marker before reconnect
progress_marker_before = self._polling_progress_marker

# Start polling and schedule liveness check
await self._app.updater.start_polling(...)
asyncio.sleep(HEARTBEAT_SECONDS)

# Compare progress markers and check for new error callback
progress_marker_after = self._polling_progress_marker
if progress_marker_after == progress_marker_before and not new_error_callback_fired:
    # Treat reconnect as failed and reschedule
    self._polling_network_error_count += 1
    await self._handle_polling_network_error()

Notes

  • The suggested fix shape stays inside the existing reconnect abstraction and turns silent wedging into a recovered-or-fatal outcome.
  • The approach assumes that the progress_marker is incremented by the gateway each time it hands a real update off to the dispatcher.

Recommendation

Apply the suggested fix shape with a deferred liveness check to verify that polling has made progress after start_polling(). This approach addresses the root cause of

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Telegram Updater goes silent forever after a single network blip; reconnect ladder swallows the cleanup TimedOut and never re-fires [1 pull requests, 1 participants]