hermes - 💡(How to fix) Fix Persistent Telegram 409 polling conflicts caused by PTB network_retry_loop racing with _handle_polling

StepCodex · 2026-05-22T00:39:55Z

[hermes] Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism handle polling conflic… Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism (`_handle_polling_conflict`) keeps the gateway alive but polling is interrupted every ~30 seconds, causing missed updates during the 20-second retry window. ## Summary Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism (`_handle_polling_conflict`) keeps the gateway alive but polling is interrupted every ~30 seconds, causing missed updates during the 20-second retry window. ## Root Cause Two independent retry paths race against each other: ### 1. PTB's internal `network_retry_loop` (`max_retries=-1`) PTB v22.x wraps the polling `getUpdates` call in a `network_retry_loop` with `max_retries=-1` (infinite retries). When a `TelegramError` (including 409 Conflict) occurs, the loop: ``` except TelegramError as telegram_exc: if on_err_cb: on_err_cb(telegram_exc) # calls our handler if check_max_retries_and_log(retries): raise cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval) await asyncio.sleep(cur_interval) # PTB retries on its own! ``` After calling our error callback, PTB **independently retries** `getUpdates` with exponential backoff (1s → 1.5s → ... → 30s max). ### 2. Our `_handle_polling_conflict` handler Our handler creates a task (via `loop.create_task`), which: 1. Stops the updater (`await updater.stop()`) — sets `running=False`, awaits the polling task 2. Sleeps 20 seconds 3. Drains connection pools 4. Restarts polling via `start_polling()` ### The Race Since PTB's retry loop and our handler run as concurrent asyncio tasks: 1. Polling gets 409 Conflict 2. `network_retry_loop` calls `on_err_cb` → creates our handler task (returns immediately) 3. `network_retry_loop` sleeps for `cur_interval` (1s on first retry) 4. Our handler task runs, calls `updater.stop()` which sets `running=False` and sets the stop event 5. `network_retry_loop` wakes from sleep, checks `effective_is_running()` → False → exits 6. After 20s, our handler calls `start_polling()` → new polling session starts 7. New polling runs for ~10s (PTB default `timeout=10s` for getUpdates long poll) 8. **On the next getUpdates cycle**, PTB's fresh request briefly overlaps with a stale httpx connection → 409 again 9. Cycle repeats every ~31 seconds (20s retry sleep + ~10s polling + ~1s overlap) ### Timing From production logs (every ~30.7s): ``` 18:57:26 — resume 18:57:56 — conflict (30.7s later) 18:58:17 — resume (21.1s later) 18:58:48 — conflict (30.7s later) ``` ## Contributing Factors - **`read_timeout=20s`** — The HTTPX client's `read_timeout` (configured via `HERMES_TELEGRAM_HTTP_READ_TIMEOUT`) is set to 20s, which is close to the server-side long poll timeout of 10s. Tight timeouts increase the chance of stale connections overlapping with new requests. - **No `timeout` passed to `start_polling()`** — PTB defaults to `timeout=timedelta(seconds=10)` for the `getUpdates` long poll parameter. This means Telegram holds the request for only 10 seconds, creating frequent polling cycles. - **`poll_interval=0`** — PTB polls immediately after each response, so there's no gap between cycles. ## Proposed Fix Two complementary changes: ### A. Increase server-side long poll timeout Pass `timeout=60` (or similar) to `start_polling()` in both calls: - Initial startup: ```python await self._app.updater.start_polling( allowed_updates=Update.ALL_TYPES, drop_pending_updates=True, error_callback=_polling_error_callback, timeout=timedelta(seconds=60), ) ``` - Conflict retry: ```python await self._app.updater.start_polling( allowed_updates=Update.ALL_TYPES, drop_pending_updates=False, error_callback=self._polling_error_callback_ref, timeout=timedelta(seconds=60), ) ``` This reduces the polling frequency from ~10s to ~60s cycles, dramatically reducing overlap chances. ### B. Increase HTTP read_timeout Set `HERMES_TELEGRAM_HTTP_READ_TIMEOUT=120` (or pass in `request_kwargs`) so the HTTP client doesn't time out while waiting for a long-held `getUpdates` response. ### C. (Alternative) Disarm PTB's internal retry Instead of racing, we could prevent PTB's `network_retry_loop` from retrying by ensuring `updater.stop()` is called synchronously before PTB's retry logic executes. The current `loop.create_task` pattern means our handler runs *after* `on_err_cb` returns, giving PTB time to schedule its own retry. Using `await` instead of `create_task` in the error callback would block PTB's loop until our handler finishes, but this could cause other issues (blocking the callback). The timeout approach (A+B) is safer and sufficient. ## Environment - hermes-agent version: 0.14.0 (latest main) - PTB: python-telegram-bot 22.7 - httpx: 0.28.1 - read_timeout: 20s (default) - getUpdates timeout: 10s (PTB d

Error Message

After calling our error callback, PTB independently retries getUpdates with exponential backoff (1s → 1.5s → ... → 30s max). Instead of racing, we could prevent PTB's network_retry_loop from retrying by ensuring updater.stop() is called synchronously before PTB's retry logic executes. The current loop.create_task pattern means our handler runs after on_err_cb returns, giving PTB time to schedule its own retry. Using await instead of create_task in the error callback would block PTB's loop until our handler finishes, but this could cause other issues (blocking the callback).

Code Example

except TelegramError as telegram_exc:
    if on_err_cb:
        on_err_cb(telegram_exc)   # calls our handler
    if check_max_retries_and_log(retries):
        raise
    cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval)
    await asyncio.sleep(cur_interval)   # PTB retries on its own!

---

18:57:26 — resume
18:57:56 — conflict  (30.7s later)
18:58:17 — resume    (21.1s later)
18:58:48 — conflict  (30.7s later)

---

await self._app.updater.start_polling(
      allowed_updates=Update.ALL_TYPES,
      drop_pending_updates=True,
      error_callback=_polling_error_callback,
      timeout=timedelta(seconds=60),
  )

---

await self._app.updater.start_polling(
      allowed_updates=Update.ALL_TYPES,
      drop_pending_updates=False,
      error_callback=self._polling_error_callback_ref,
      timeout=timedelta(seconds=60),
  )

Summary

Telegram polling enters a persistent ~31-second cycle of 409 Conflict → retry → resume → Conflict. The existing retry mechanism (_handle_polling_conflict) keeps the gateway alive but polling is interrupted every ~30 seconds, causing missed updates during the 20-second retry window.

Root Cause

Two independent retry paths race against each other:

1. PTB's internal `network_retry_loop` (`max_retries=-1`)

PTB v22.x wraps the polling getUpdates call in a network_retry_loop with max_retries=-1 (infinite retries). When a TelegramError (including 409 Conflict) occurs, the loop:

except TelegramError as telegram_exc:
    if on_err_cb:
        on_err_cb(telegram_exc)   # calls our handler
    if check_max_retries_and_log(retries):
        raise
    cur_interval = 1 if cur_interval == 0 else min(30, 1.5 * cur_interval)
    await asyncio.sleep(cur_interval)   # PTB retries on its own!

After calling our error callback, PTB independently retries getUpdates with exponential backoff (1s → 1.5s → ... → 30s max).

2. Our `_handle_polling_conflict` handler

Our handler creates a task (via loop.create_task), which:

Stops the updater (await updater.stop()) — sets running=False, awaits the polling task
Sleeps 20 seconds
Drains connection pools
Restarts polling via start_polling()

The Race

Since PTB's retry loop and our handler run as concurrent asyncio tasks:

Polling gets 409 Conflict
network_retry_loop calls on_err_cb → creates our handler task (returns immediately)
network_retry_loop sleeps for cur_interval (1s on first retry)
Our handler task runs, calls updater.stop() which sets running=False and sets the stop event
network_retry_loop wakes from sleep, checks effective_is_running() → False → exits
After 20s, our handler calls start_polling() → new polling session starts
New polling runs for ~10s (PTB default timeout=10s for getUpdates long poll)
On the next getUpdates cycle, PTB's fresh request briefly overlaps with a stale httpx connection → 409 again
Cycle repeats every ~31 seconds (20s retry sleep + ~10s polling + ~1s overlap)

Timing

From production logs (every ~30.7s):

18:57:26 — resume
18:57:56 — conflict  (30.7s later)
18:58:17 — resume    (21.1s later)
18:58:48 — conflict  (30.7s later)

Contributing Factors

read_timeout=20s — The HTTPX client's read_timeout (configured via HERMES_TELEGRAM_HTTP_READ_TIMEOUT) is set to 20s, which is close to the server-side long poll timeout of 10s. Tight timeouts increase the chance of stale connections overlapping with new requests.
No timeout passed to start_polling() — PTB defaults to timeout=timedelta(seconds=10) for the getUpdates long poll parameter. This means Telegram holds the request for only 10 seconds, creating frequent polling cycles.
poll_interval=0 — PTB polls immediately after each response, so there's no gap between cycles.

Proposed Fix

Two complementary changes:

A. Increase server-side long poll timeout

Pass timeout=60 (or similar) to start_polling() in both calls:

Initial startup:

await self._app.updater.start_polling(
    allowed_updates=Update.ALL_TYPES,
    drop_pending_updates=True,
    error_callback=_polling_error_callback,
    timeout=timedelta(seconds=60),
)

Conflict retry:

await self._app.updater.start_polling(
    allowed_updates=Update.ALL_TYPES,
    drop_pending_updates=False,
    error_callback=self._polling_error_callback_ref,
    timeout=timedelta(seconds=60),
)

This reduces the polling frequency from ~10s to ~60s cycles, dramatically reducing overlap chances.

B. Increase HTTP read_timeout

Set HERMES_TELEGRAM_HTTP_READ_TIMEOUT=120 (or pass in request_kwargs) so the HTTP client doesn't time out while waiting for a long-held getUpdates response.

C. (Alternative) Disarm PTB's internal retry

Instead of racing, we could prevent PTB's network_retry_loop from retrying by ensuring updater.stop() is called synchronously before PTB's retry logic executes. The current loop.create_task pattern means our handler runs after on_err_cb returns, giving PTB time to schedule its own retry. Using await instead of create_task in the error callback would block PTB's loop until our handler finishes, but this could cause other issues (blocking the callback).

The timeout approach (A+B) is safer and sufficient.

Environment

hermes-agent version: 0.14.0 (latest main)
PTB: python-telegram-bot 22.7
httpx: 0.28.1
read_timeout: 20s (default)
getUpdates timeout: 10s (PTB default)
poll_interval: 0 (PTB default)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Persistent Telegram 409 polling conflicts caused by PTB network_retry_loop racing with _handle_polling_conflict

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Root Cause

1. PTB's internal `network_retry_loop` (`max_retries=-1`)

2. Our `_handle_polling_conflict` handler

The Race

Timing

Contributing Factors

Proposed Fix

A. Increase server-side long poll timeout

B. Increase HTTP read_timeout

C. (Alternative) Disarm PTB's internal retry

Environment

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Persistent Telegram 409 polling conflicts caused by PTB network_retry_loop racing with _handle_polling_conflict

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Root Cause

1. PTB's internal network_retry_loop (max_retries=-1)

2. Our _handle_polling_conflict handler

The Race

Timing

Contributing Factors

Proposed Fix

A. Increase server-side long poll timeout

B. Increase HTTP read_timeout

C. (Alternative) Disarm PTB's internal retry

Environment

Still need to ship something?

TRENDING

1. PTB's internal `network_retry_loop` (`max_retries=-1`)

2. Our `_handle_polling_conflict` handler