hermes - 💡(How to fix) Fix Persistent Telegram 409 polling conflicts caused by PTB network_retry_loop racing with _handle_polling_conflict

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism (_handle_polling_conflict) keeps the gateway alive but polling is interrupted every ~30 seconds, causing missed updates during the 20-second retry window.

Error Message

After calling our error callback, PTB independently retries getUpdates with exponential backoff (1s → 1.5s → ... → 30s max). Instead of racing, we could prevent PTB's network_retry_loop from retrying by ensuring updater.stop() is called synchronously before PTB's retry logic executes. The current loop.create_task pattern means our handler runs after on_err_cb returns, giving PTB time to schedule its own retry. Using await instead of create_task in the error callback would block PTB's loop until our handler finishes, but this could cause other issues (blocking the callback).

Root Cause

Two independent retry paths race against each other:

Code Example

except TelegramError as telegram_exc:
    if on_err_cb:
        on_err_cb(telegram_exc)   # calls our handler
    if check_max_retries_and_log(retries):
        raise
    cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval)
    await asyncio.sleep(cur_interval)   # PTB retries on its own!

---

18:57:26 — resume
18:57:56conflict  (30.7s later)
18:58:17resume    (21.1s later)
18:58:48conflict  (30.7s later)

---

await self._app.updater.start_polling(
      allowed_updates=Update.ALL_TYPES,
      drop_pending_updates=True,
      error_callback=_polling_error_callback,
      timeout=timedelta(seconds=60),
  )

---

await self._app.updater.start_polling(
      allowed_updates=Update.ALL_TYPES,
      drop_pending_updates=False,
      error_callback=self._polling_error_callback_ref,
      timeout=timedelta(seconds=60),
  )
RAW_BUFFERClick to expand / collapse

Summary

Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism (_handle_polling_conflict) keeps the gateway alive but polling is interrupted every ~30 seconds, causing missed updates during the 20-second retry window.

Root Cause

Two independent retry paths race against each other:

1. PTB's internal network_retry_loop (max_retries=-1)

PTB v22.x wraps the polling getUpdates call in a network_retry_loop with max_retries=-1 (infinite retries). When a TelegramError (including 409 Conflict) occurs, the loop:

except TelegramError as telegram_exc:
    if on_err_cb:
        on_err_cb(telegram_exc)   # calls our handler
    if check_max_retries_and_log(retries):
        raise
    cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval)
    await asyncio.sleep(cur_interval)   # PTB retries on its own!

After calling our error callback, PTB independently retries getUpdates with exponential backoff (1s → 1.5s → ... → 30s max).

2. Our _handle_polling_conflict handler

Our handler creates a task (via loop.create_task), which:

  1. Stops the updater (await updater.stop()) — sets running=False, awaits the polling task
  2. Sleeps 20 seconds
  3. Drains connection pools
  4. Restarts polling via start_polling()

The Race

Since PTB's retry loop and our handler run as concurrent asyncio tasks:

  1. Polling gets 409 Conflict
  2. network_retry_loop calls on_err_cb → creates our handler task (returns immediately)
  3. network_retry_loop sleeps for cur_interval (1s on first retry)
  4. Our handler task runs, calls updater.stop() which sets running=False and sets the stop event
  5. network_retry_loop wakes from sleep, checks effective_is_running() → False → exits
  6. After 20s, our handler calls start_polling() → new polling session starts
  7. New polling runs for ~10s (PTB default timeout=10s for getUpdates long poll)
  8. On the next getUpdates cycle, PTB's fresh request briefly overlaps with a stale httpx connection → 409 again
  9. Cycle repeats every ~31 seconds (20s retry sleep + ~10s polling + ~1s overlap)

Timing

From production logs (every ~30.7s):

18:57:26 — resume
18:57:56 — conflict  (30.7s later)
18:58:17 — resume    (21.1s later)
18:58:48 — conflict  (30.7s later)

Contributing Factors

  • read_timeout=20s — The HTTPX client's read_timeout (configured via HERMES_TELEGRAM_HTTP_READ_TIMEOUT) is set to 20s, which is close to the server-side long poll timeout of 10s. Tight timeouts increase the chance of stale connections overlapping with new requests.
  • No timeout passed to start_polling() — PTB defaults to timeout=timedelta(seconds=10) for the getUpdates long poll parameter. This means Telegram holds the request for only 10 seconds, creating frequent polling cycles.
  • poll_interval=0 — PTB polls immediately after each response, so there's no gap between cycles.

Proposed Fix

Two complementary changes:

A. Increase server-side long poll timeout

Pass timeout=60 (or similar) to start_polling() in both calls:

  • Initial startup:
    await self._app.updater.start_polling(
        allowed_updates=Update.ALL_TYPES,
        drop_pending_updates=True,
        error_callback=_polling_error_callback,
        timeout=timedelta(seconds=60),
    )
  • Conflict retry:
    await self._app.updater.start_polling(
        allowed_updates=Update.ALL_TYPES,
        drop_pending_updates=False,
        error_callback=self._polling_error_callback_ref,
        timeout=timedelta(seconds=60),
    )

This reduces the polling frequency from ~10s to ~60s cycles, dramatically reducing overlap chances.

B. Increase HTTP read_timeout

Set HERMES_TELEGRAM_HTTP_READ_TIMEOUT=120 (or pass in request_kwargs) so the HTTP client doesn't time out while waiting for a long-held getUpdates response.

C. (Alternative) Disarm PTB's internal retry

Instead of racing, we could prevent PTB's network_retry_loop from retrying by ensuring updater.stop() is called synchronously before PTB's retry logic executes. The current loop.create_task pattern means our handler runs after on_err_cb returns, giving PTB time to schedule its own retry. Using await instead of create_task in the error callback would block PTB's loop until our handler finishes, but this could cause other issues (blocking the callback).

The timeout approach (A+B) is safer and sufficient.

Environment

  • hermes-agent version: 0.14.0 (latest main)
  • PTB: python-telegram-bot 22.7
  • httpx: 0.28.1
  • read_timeout: 20s (default)
  • getUpdates timeout: 10s (PTB default)
  • poll_interval: 0 (PTB default)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Persistent Telegram 409 polling conflicts caused by PTB network_retry_loop racing with _handle_polling_conflict