hermes - 💡(How to fix) Fix Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days)

Root Cause

gateway/platforms/telegram.py::_drain_polling_connections (added in #17015) mitigates this for _request[0] (getUpdates) only, with explicit rationale at lines 822–824:

# We reset ONLY _request[0] (the getUpdates request) — the general
# request (_request[1]) is left untouched so concurrent
# send_message / edit_message calls are never interrupted.

Reasonable for short outages. But over many days of flaky-proxy operation, the general pool accumulates half-closed connections faster than httpx evicts them — visible as CLOSED in lsof — because the proxy=… HTTPXRequest construction goes through httpcore's tunnel-proxy path which does not always release the underlying socket on ConnectError.

After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one → httpx.ConnectError: All connection attempts failed.

Code Example

telegram.error.NetworkError: httpx.ConnectError: All connection attempts failed

---

$ lsof -p <gateway_pid> | wc -l
287                                      # vs launchctl limit maxfiles soft = 256

$ lsof -p <gateway_pid> | awk '{print $5}' | sort | uniq -c | sort -rn
  235 IPv4
   42 REG
    3 unix
    ...

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $NF}' | sort | uniq -c | sort -rn
  267 (CLOSED)
  117 (ESTABLISHED)
    4 (CLOSE_WAIT)

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $9}' | sed 's/.*->//' | sort | uniq -c | sort -rn | head -3
  280  localhost:10808     ← local xray HTTP proxy
   12  216.38.168.230:45979
   10  localhost:13580

---

[Telegram] Telegram network error, scheduling reconnect: httpx.ConnectError:
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: httpx.ConnectError:
[Telegram] Telegram polling reconnect failed: httpx.ConnectError:
[Telegram] Telegram polling resumed after network error (attempt N)

---

# We reset ONLY _request[0] (the getUpdates request) — the general
# request (_request[1]) is left untouched so concurrent
# send_message / edit_message calls are never interrupted.

Problem

After ~2 days of continuous operation behind a local HTTP proxy (xray on 127.0.0.1:10808), the gateway's Telegram adapter accumulates hundreds of half-closed sockets in the httpx general-request pool. The OS-level fd count exceeds the macOS launchd default maxfiles=256, after which every subsequent bot.send_message() / set_my_commands() fails:

telegram.error.NetworkError: httpx.ConnectError: All connection attempts failed

Simultaneously, kanban dispatcher and channel-directory writes start failing with [Errno 24] Too many open files and sqlite3.OperationalError: unable to open database file.

gateway_state.json continues to report platforms.telegram.state = "connected" (stale — last updated when the pool was still healthy), so external monitoring does not detect the wedge.

Why this is NOT a duplicate of #30230 or #5729 / #21548

This was the first thing I checked. The leak vector here is distinct:

#30230 blames MCP subprocess pipes/sockets in multi-profile setups. In my case there are 0 MCP servers and 1 profile, but the gateway still hits fd 287 after 2 days — see lsof breakdown below, 280/287 fds are httpx-through-proxy sockets, not MCP pipes.
#5729 / PR #21548 describe a cold-boot wedge of the polling pool (_request[0]) while the general pool is healthy and getMe works. My case is the opposite: polling pool is fine and reconnects via _drain_polling_connections work; the general pool (_request[1]) is the one accumulating dead connections, and eventually bot.send_message() (which routes through _request[1]) fails.

Evidence

Captured from a wedged gateway (uptime ~2 days, single profile, no MCP servers configured):

$ lsof -p <gateway_pid> | wc -l
287                                      # vs launchctl limit maxfiles soft = 256

$ lsof -p <gateway_pid> | awk '{print $5}' | sort | uniq -c | sort -rn
  235 IPv4
   42 REG
    3 unix
    ...

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $NF}' | sort | uniq -c | sort -rn
  267 (CLOSED)
  117 (ESTABLISHED)
    4 (CLOSE_WAIT)

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $9}' | sed 's/.*->//' | sort | uniq -c | sort -rn | head -3
  280  localhost:10808     ← local xray HTTP proxy
   12  216.38.168.230:45979
   10  localhost:13580

280 of the 287 fds terminate at the local proxy port. Persistent log pattern in the days leading up to the wedge:

[Telegram] Telegram network error, scheduling reconnect: httpx.ConnectError:
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: httpx.ConnectError:
[Telegram] Telegram polling reconnect failed: httpx.ConnectError:
[Telegram] Telegram polling resumed after network error (attempt N)

i.e., proxy hiccups → reconnect ladder fires → polling pool gets drained correctly → but each cycle also leaks 1–2 connections in the general pool (which set_my_commands, send_message, and the resolver-fallback HTTPXRequest all use).

Root cause

gateway/platforms/telegram.py::_drain_polling_connections (added in #17015) mitigates this for _request[0] (getUpdates) only, with explicit rationale at lines 822–824:

# We reset ONLY _request[0] (the getUpdates request) — the general
# request (_request[1]) is left untouched so concurrent
# send_message / edit_message calls are never interrupted.

After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one → httpx.ConnectError: All connection attempts failed.

Reproduction

Configure system HTTP/HTTPS proxy to a local proxy that occasionally drops connections (xray / clash / v2ray are typical on macOS in restricted-network environments).
Start the gateway with Telegram enabled, single profile, no MCP servers.
Let it run 24–48h; observe periodic Telegram network error, scheduling reconnect: httpx.ConnectError in gateway.log.
After enough cycles: lsof -p <gateway_pid> | wc -l exceeds launchctl limit maxfiles soft limit, all sends fail.

Workaround (confirmed)

hermes gateway restart clears the leaked sockets (fd 287 → 54, Telegram resumes). Recurs in 1–2 days.

Suggested fixes

In rough order of impact:

Bound the general pool when proxy is configured: pass limits=httpx.Limits(max_connections=20, max_keepalive_connections=10) into the HTTPXRequest(..., proxy=proxy_url) construction at gateway/platforms/telegram.py:1424–1425. Caps the leak, makes it surface immediately instead of after days.
Periodically drain _request[1] — e.g., on a low-frequency schedule (hourly) gracefully drain the general request with a brief grace period for in-flight sends. Symmetrical with the existing polling-pool drain. Targeted fix.
Heartbeat on the send path, not just polling: update platforms.telegram.updated_at from a probe that exercises _request[1], so wedged-but-still-polling state is observable externally instead of silently lying as connected.
(Cross-ref #30230) Detect launchd maxfiles < 1024 at startup and emit a single WARN.

I'm happy to send a PR for fix (1) if a maintainer can confirm the approach — it's a 2-line change at telegram.py:1414–1425 and the failure mode it prevents is well-bounded.

Environment

macOS 15 (Darwin 25.5.0, Apple Silicon)
hermes-agent 0.14.0 (commit 7f1b2b4)
Python 3.11.15
httpx 0.28.1, httpcore 1.0.9, python-telegram-bot 22.6
Single profile, no MCP servers
Local HTTP proxy on 127.0.0.1:10808 (xray)
launchd maxfiles: 256 (default)

#30230 — same hit-the-wall symptom, different leak vector (MCP subprocesses + multi-profile)
#5729 / PR #21548 — polling-pool wedge on cold boot; complementary fix on the other pool
#17015 — merged fix that added _drain_polling_connections for polling pool only
#25666 — SIGSEGV on aarch64 during httpx.ReadError reconnect; same code path, different platform

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround (confirmed)

Code Example

Problem

Why this is NOT a duplicate of #30230 or #5729 / #21548

Evidence

Root cause

Reproduction

Workaround (confirmed)

Suggested fixes

Environment

Related

Still need to ship something?

TRENDING