hermes - 💡(How to fix) Fix Analysis: Telegram reconnect pyramid — 6 layers chasing symptoms, root cause is self-conflict with PTB internal retry

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

PTB internals, approximately

while True: try: updates = await api.get_updates(...) process(updates) except NetworkError: if retries >= max_retries: # max_retries=-1 → retry forever raise await asyncio.sleep(backoff(retries)) retries += 1

Root Cause

In the past two weeks, upstream merged 6+ Telegram reconnection-related commits. This isn't a complaint — it's an architectural observation: every symptom you're fixing shares a single root cause — Hermes shouldn't be adding retry layers on top of PTB.

Fix Action

Fix / Workaround

Every commit patches the same file (gateway/platforms/telegram.py), stacking on the same _handle_polling_* architecture.

A symptom patch again. If Hermes didn't create these "transient network errors" (i.e. manual stop→start causing brief unavailability), the can_edit false-positive wouldn't trigger. The logic is correct but it's one swing in an endless whack-a-mole game.

Commitdeltarelation to root cause
8804d3364 (this fork)-429 lines, remove entire recovery pyramidroot cause fix
af381ef12+99 lines, wrap connect timeout with retry❌ symptom patch, creates more races
f260aa6dc+109 lines, manual conflict recovery❌ symptom patch, conflict shouldn't happen
5c4b43ced+10 lines, sticky IP reset✅ valuable PCR at transport layer
6be579f62+221 lines, keep can_edit on transient errors⚠️ symptom patch, but logically sound
1b3c51bcc+11 lines, preserve edit after flood control⚠️ symptom patch, but logically sound
e2a1a2bf1+33 lines, pre-mark resume_pending before drain✅ session layer, orthogonal

Code Example

PTB internal network_retry_loop (max_retries=-1, exponential backoff)
_handle_polling_network_error (stop→drain→start_polling→verify)
_verify_polling_after_reconnect (60s probe, re-enters the ladder on detection)
_platform_reconnect_watcher (outer watchdog)
        ← systemd Restart=always (last-resort)

---

PTB network_retry_loop (max_retries=-1)
new connect timeout wrapper retry [af381ef12]
    ← _handle_polling_network_error
      ← _verify_polling_after_reconnect
new polling conflict recovery [f260aa6dc]
          ← _platform_reconnect_watcher
            ← systemd

---

# PTB internals, approximately
while True:
    try:
        updates = await api.get_updates(...)
        process(updates)
    except NetworkError:
        if retries >= max_retries:  # max_retries=-1 → retry forever
            raise
        await asyncio.sleep(backoff(retries))
        retries += 1

---

T1: PTB retry #3 waiting on backoff
T2: Hermes retry wrapper decides "timeout", calls stop_polling()
T3: PTB's retry just restarted polling, receives stop from Hermes
T4: Hermes calls start_polling()
T5: PTB's stale retry also reaches restart
Two polling sessions coexist
Telegram returns 409 Conflict

---

fix(telegram): delegate ALL network error recovery to PTB, remove self-conflict pyramid
RAW_BUFFERClick to expand / collapse

Telegram reconnect: the growing 6-layer retry pyramid — and the foundation that shouldn't exist

Why this exists

I maintain a fork of hermes-agent and spent significant time diagnosing repeated Telegram gateway disconnections. My conclusion runs counter to the direction of recent upstream "fixes": the problem is not that PTB retries too little, but that Hermes stacks a full recovery pyramid on top of PTB's internal retry loop, creating a self-conflict race.

In the past two weeks, upstream merged 6+ Telegram reconnection-related commits. This isn't a complaint — it's an architectural observation: every symptom you're fixing shares a single root cause — Hermes shouldn't be adding retry layers on top of PTB.


The growing pyramid

Before I started, telegram.py already had this recovery chain:

PTB internal network_retry_loop (max_retries=-1, exponential backoff)
  ← _handle_polling_network_error (stop→drain→start_polling→verify)
    ← _verify_polling_after_reconnect (60s probe, re-enters the ladder on detection)
      ← _platform_reconnect_watcher (outer watchdog)
        ← systemd Restart=always (last-resort)

Recent upstream added 4 more layers:

  1. af381ef12retry wrapped connect timeouts: yet another try/except retry around connect
  2. f260aa6dcrecover from post-update polling conflict without entering limbo: manual conflict recovery
  3. 5c4b43cedreset sticky fallback IP on connect failure, retry primary DNS: Hermes-layer DNS intervention
  4. 6be579f62preserve can_edit after transient network errors in progress edits: display policy compensating for network instability

Pyramid now:

PTB network_retry_loop (max_retries=-1)
  ← new connect timeout wrapper retry [af381ef12]
    ← _handle_polling_network_error
      ← _verify_polling_after_reconnect
        ← new polling conflict recovery [f260aa6dc]
          ← _platform_reconnect_watcher
            ← systemd

Every commit patches the same file (gateway/platforms/telegram.py), stacking on the same _handle_polling_* architecture.


The internal contradictions

1. af381ef12: "retry wrapped connect timeouts"

PTB's Updater.start_polling() already has a full network_retry_loop:

# PTB internals, approximately
while True:
    try:
        updates = await api.get_updates(...)
        process(updates)
    except NetworkError:
        if retries >= max_retries:  # max_retries=-1 → retry forever
            raise
        await asyncio.sleep(backoff(retries))
        retries += 1

PTB defaults to max_retries=-1 → infinite retry with exponential backoff. Wrapping another retry on top creates a race:

T1: PTB retry #3 waiting on backoff
T2: Hermes retry wrapper decides "timeout", calls stop_polling()
T3: PTB's retry just restarted polling, receives stop from Hermes
T4: Hermes calls start_polling()
T5: PTB's stale retry also reaches restart
→ Two polling sessions coexist
→ Telegram returns 409 Conflict

This is exactly what I observed in production: disconnection doesn't recover silently — it enters a 409 Conflict → stop→restart → 409 → stop→restart death spiral until manual intervention.

2. f260aa6dc: "recover from post-update polling conflict without entering limbo"

This commit acknowledges that conflict does happen, then adds logic to recover from it. The question is: if conflict never happened, none of these 93 lines would be needed.

Proposal: every time conflict fires, ask "who triggered a concurrent polling session?" — not "how do we gracefully recover from conflict?"

3. 5c4b43ced: "reset sticky fallback IP on connect failure, retry primary DNS"

This one is worth keeping — it fixes a PTB lower-layer HTTP transport issue (sticky fallback IP stuck). It lives in telegram_network.py, orthogonal to the recovery pyramid.

4. 6be579f62: "preserve can_edit after transient network errors in progress edits"

A symptom patch again. If Hermes didn't create these "transient network errors" (i.e. manual stop→start causing brief unavailability), the can_edit false-positive wouldn't trigger. The logic is correct but it's one swing in an endless whack-a-mole game.


The correct fix

I landed on this commit:

fix(telegram): delegate ALL network error recovery to PTB, remove self-conflict pyramid

Core changes:

  1. Delete _verify_polling_after_reconnect (~58 lines) — unnecessary; PTB's retry loop auto-recovers after disconnection
  2. _handle_polling_network_error from 114 lines to 10 lines — single log line, no more stop→drain→start_polling→probe
  3. Remove _polling_network_error_count, _pending_probe_task and related fields
  4. Conflict handler counts only, never restarts polling — 3 strikes/60s → fatal, and fatal is marked retryable=False to prevent _platform_reconnect_watcher from auto-restarting the adapter

Net: -429 lines removed. After deployment, disconnection recovery is stable with zero 409 loops.

The hard part is trust: accepting "doing nothing is the safest approach" goes against intuition. But PTB is designed this way — network_retry_loop is meant to absorb momentary network glitches. Hermes's intervention breaks it.


Recommendations

  1. Freeze any new retry/reconnect/conflict-recovery code in telegram.py. These are symptom fixes, not root-cause fixes.
  2. Seriously consider dismantling the entire recovery pyramid. With 5-6 layers stacked, every new layer is a potential new race condition source.
  3. If the team decides not to fully remove it, at minimum add end-to-end trace logging across all layers. Currently each layer independently decides whether to recover, with no correlated timeline across layers — diagnosis is still guesswork.

Appendix: commit comparison

Commitdeltarelation to root cause
8804d3364 (this fork)-429 lines, remove entire recovery pyramidroot cause fix
af381ef12+99 lines, wrap connect timeout with retry❌ symptom patch, creates more races
f260aa6dc+109 lines, manual conflict recovery❌ symptom patch, conflict shouldn't happen
5c4b43ced+10 lines, sticky IP reset✅ valuable PCR at transport layer
6be579f62+221 lines, keep can_edit on transient errors⚠️ symptom patch, but logically sound
1b3c51bcc+11 lines, preserve edit after flood control⚠️ symptom patch, but logically sound
e2a1a2bf1+33 lines, pre-mark resume_pending before drain✅ session layer, orthogonal

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Analysis: Telegram reconnect pyramid — 6 layers chasing symptoms, root cause is self-conflict with PTB internal retry