hermes - 💡(How to fix) Fix Analysis: Telegram reconnect pyramid — 6 layers chasing symptoms, root cause is self-conflict with PTB internal retry

hermes2026-05-23 08:31:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

PTB internals, approximately

while True: try: updates = await api.get_updates(...) process(updates) except NetworkError: if retries >= max_retries: # max_retries=-1 → retry forever raise await asyncio.sleep(backoff(retries)) retries += 1

Root Cause

In the past two weeks, upstream merged 6+ Telegram reconnection-related commits. This isn't a complaint — it's an architectural observation: every symptom you're fixing shares a single root cause — Hermes shouldn't be adding retry layers on top of PTB.

Fix Action

Fix / Workaround

Every commit patches the same file (gateway/platforms/telegram.py), stacking on the same _handle_polling_* architecture.

A symptom patch again. If Hermes didn't create these "transient network errors" (i.e. manual stop→start causing brief unavailability), the can_edit false-positive wouldn't trigger. The logic is correct but it's one swing in an endless whack-a-mole game.

Commit	delta	relation to root cause
`8804d3364` (this fork)	-429 lines, remove entire recovery pyramid	✅ root cause fix
`af381ef12`	+99 lines, wrap connect timeout with retry	❌ symptom patch, creates more races
`f260aa6dc`	+109 lines, manual conflict recovery	❌ symptom patch, conflict shouldn't happen
`5c4b43ced`	+10 lines, sticky IP reset	✅ valuable PCR at transport layer
`6be579f62`	+221 lines, keep can_edit on transient errors	⚠️ symptom patch, but logically sound
`1b3c51bcc`	+11 lines, preserve edit after flood control	⚠️ symptom patch, but logically sound
`e2a1a2bf1`	+33 lines, pre-mark resume_pending before drain	✅ session layer, orthogonal

Code Example

PTB internal network_retry_loop (max_retries=-1, exponential backoff)
  ← _handle_polling_network_error (stop→drain→start_polling→verify)
    ← _verify_polling_after_reconnect (60s probe, re-enters the ladder on detection)
      ← _platform_reconnect_watcher (outer watchdog)
        ← systemd Restart=always (last-resort)

---

PTB network_retry_loop (max_retries=-1)
  ← new connect timeout wrapper retry [af381ef12]
    ← _handle_polling_network_error
      ← _verify_polling_after_reconnect
        ← new polling conflict recovery [f260aa6dc]
          ← _platform_reconnect_watcher
            ← systemd

---

# PTB internals, approximately
while True:
    try:
        updates = await api.get_updates(...)
        process(updates)
    except NetworkError:
        if retries >= max_retries:  # max_retries=-1 → retry forever
            raise
        await asyncio.sleep(backoff(retries))
        retries += 1

---

T1: PTB retry #3 waiting on backoff
T2: Hermes retry wrapper decides "timeout", calls stop_polling()
T3: PTB's retry just restarted polling, receives stop from Hermes
T4: Hermes calls start_polling()
T5: PTB's stale retry also reaches restart
→ Two polling sessions coexist
→ Telegram returns 409 Conflict

---

fix(telegram): delegate ALL network error recovery to PTB, remove self-conflict pyramid

RAW_BUFFERClick to expand / collapse

Telegram reconnect: the growing 6-layer retry pyramid — and the foundation that shouldn't exist

Why this exists

I maintain a fork of hermes-agent and spent significant time diagnosing repeated Telegram gateway disconnections. My conclusion runs counter to the direction of recent upstream "fixes": the problem is not that PTB retries too little, but that Hermes stacks a full recovery pyramid on top of PTB's internal retry loop, creating a self-conflict race.

The growing pyramid

Before I started, telegram.py already had this recovery chain:

PTB internal network_retry_loop (max_retries=-1, exponential backoff)
  ← _handle_polling_network_error (stop→drain→start_polling→verify)
    ← _verify_polling_after_reconnect (60s probe, re-enters the ladder on detection)
      ← _platform_reconnect_watcher (outer watchdog)
        ← systemd Restart=always (last-resort)

Recent upstream added 4 more layers:

af381ef12 — retry wrapped connect timeouts: yet another try/except retry around connect
f260aa6dc — recover from post-update polling conflict without entering limbo: manual conflict recovery
5c4b43ced — reset sticky fallback IP on connect failure, retry primary DNS: Hermes-layer DNS intervention
6be579f62 — preserve can_edit after transient network errors in progress edits: display policy compensating for network instability

Pyramid now:

PTB network_retry_loop (max_retries=-1)
  ← new connect timeout wrapper retry [af381ef12]
    ← _handle_polling_network_error
      ← _verify_polling_after_reconnect
        ← new polling conflict recovery [f260aa6dc]
          ← _platform_reconnect_watcher
            ← systemd

Every commit patches the same file (gateway/platforms/telegram.py), stacking on the same _handle_polling_* architecture.

The internal contradictions

1. af381ef12: "retry wrapped connect timeouts"

PTB's Updater.start_polling() already has a full network_retry_loop:

# PTB internals, approximately
while True:
    try:
        updates = await api.get_updates(...)
        process(updates)
    except NetworkError:
        if retries >= max_retries:  # max_retries=-1 → retry forever
            raise
        await asyncio.sleep(backoff(retries))
        retries += 1

PTB defaults to max_retries=-1 → infinite retry with exponential backoff. Wrapping another retry on top creates a race:

T1: PTB retry #3 waiting on backoff
T2: Hermes retry wrapper decides "timeout", calls stop_polling()
T3: PTB's retry just restarted polling, receives stop from Hermes
T4: Hermes calls start_polling()
T5: PTB's stale retry also reaches restart
→ Two polling sessions coexist
→ Telegram returns 409 Conflict

This is exactly what I observed in production: disconnection doesn't recover silently — it enters a 409 Conflict → stop→restart → 409 → stop→restart death spiral until manual intervention.

2. f260aa6dc: "recover from post-update polling conflict without entering limbo"

This commit acknowledges that conflict does happen, then adds logic to recover from it. The question is: if conflict never happened, none of these 93 lines would be needed.

Proposal: every time conflict fires, ask "who triggered a concurrent polling session?" — not "how do we gracefully recover from conflict?"

3. 5c4b43ced: "reset sticky fallback IP on connect failure, retry primary DNS"

This one is worth keeping — it fixes a PTB lower-layer HTTP transport issue (sticky fallback IP stuck). It lives in telegram_network.py, orthogonal to the recovery pyramid.

4. 6be579f62: "preserve can_edit after transient network errors in progress edits"

The correct fix

I landed on this commit:

fix(telegram): delegate ALL network error recovery to PTB, remove self-conflict pyramid

Core changes:

Delete _verify_polling_after_reconnect (~58 lines) — unnecessary; PTB's retry loop auto-recovers after disconnection
_handle_polling_network_error from 114 lines to 10 lines — single log line, no more stop→drain→start_polling→probe
Remove _polling_network_error_count, _pending_probe_task and related fields
Conflict handler counts only, never restarts polling — 3 strikes/60s → fatal, and fatal is marked retryable=False to prevent _platform_reconnect_watcher from auto-restarting the adapter

Net: -429 lines removed. After deployment, disconnection recovery is stable with zero 409 loops.

The hard part is trust: accepting "doing nothing is the safest approach" goes against intuition. But PTB is designed this way — network_retry_loop is meant to absorb momentary network glitches. Hermes's intervention breaks it.

Recommendations

Freeze any new retry/reconnect/conflict-recovery code in telegram.py. These are symptom fixes, not root-cause fixes.
Seriously consider dismantling the entire recovery pyramid. With 5-6 layers stacked, every new layer is a potential new race condition source.
If the team decides not to fully remove it, at minimum add end-to-end trace logging across all layers. Currently each layer independently decides whether to recover, with no correlated timeline across layers — diagnosis is still guesswork.

Appendix: commit comparison

Commit	delta	relation to root cause
`8804d3364` (this fork)	-429 lines, remove entire recovery pyramid	✅ root cause fix
`af381ef12`	+99 lines, wrap connect timeout with retry	❌ symptom patch, creates more races
`f260aa6dc`	+109 lines, manual conflict recovery	❌ symptom patch, conflict shouldn't happen
`5c4b43ced`	+10 lines, sticky IP reset	✅ valuable PCR at transport layer
`6be579f62`	+221 lines, keep can_edit on transient errors	⚠️ symptom patch, but logically sound
`1b3c51bcc`	+11 lines, preserve edit after flood control	⚠️ symptom patch, but logically sound
`e2a1a2bf1`	+33 lines, pre-mark resume_pending before drain	✅ session layer, orthogonal

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering