hermes - 💡(How to fix) Fix [Bug]: WeChat zombie connection — _get_updates swallows asyncio.TimeoutError as empty success, never detects network drop [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

except asyncio.TimeoutError: return {"ret": 0, "msgs": [], "get_updates_buf": sync_buf}

Root Cause

gateway/platforms/weixin.py line 428-429:

except asyncio.TimeoutError:
    return {"ret": 0, "msgs": [], "get_updates_buf": sync_buf}

The _get_updates() function is a long-polling HTTP POST with a 35-second timeout (LONG_POLL_TIMEOUT_MS = 35_000). When the network drops and the TCP connection hangs:

  1. After 35 seconds, asyncio.TimeoutError is raised by _api_post()
  2. Line 428 catches it and returns {"ret": 0} — converting a network timeout into an "empty success" response
  3. The _poll_loop() receives ret=0, resets consecutive_failures to 0, and starts the next poll cycle
  4. The next poll also hangs → same 35s timeout → same {"ret": 0} → infinite loop
  5. The adapter's _connected flag (set once in connect()) is never cleared
  6. The health endpoint (/health/detailed) reads weixin.state: "connected" forever
  7. The weixin.updated_at timestamp freezes at the time of the last successful poll

The connection is dead, but the gateway never detects it.

Fix Action

Fixed

Code Example

except asyncio.TimeoutError:
    return {"ret": 0, "msgs": [], "get_updates_buf": sync_buf}

---

{
  "platforms": {
    "weixin": {
      "state": "connected",
      "error_code": null,
      "updated_at": "2026-05-10T19:02:36Z"stale (hours old)
    }
  }
}

---

except asyncio.TimeoutError:
    return {"ret": -3, "errcode": -3, "errmsg": "connection timeout"}
RAW_BUFFERClick to expand / collapse

Bug Description

WeChat (Weixin) gateway connection silently becomes a zombie after network interruption (WiFi reconnect, macOS sleep/wake, etc.). The gateway health endpoint permanently reports weixin.state=connected even though the iLink long-poll socket is dead. No outbound messages can be sent, no inbound messages are received, and the condition persists indefinitely until manual gateway restart.

Root Cause

gateway/platforms/weixin.py line 428-429:

except asyncio.TimeoutError:
    return {"ret": 0, "msgs": [], "get_updates_buf": sync_buf}

The _get_updates() function is a long-polling HTTP POST with a 35-second timeout (LONG_POLL_TIMEOUT_MS = 35_000). When the network drops and the TCP connection hangs:

  1. After 35 seconds, asyncio.TimeoutError is raised by _api_post()
  2. Line 428 catches it and returns {"ret": 0} — converting a network timeout into an "empty success" response
  3. The _poll_loop() receives ret=0, resets consecutive_failures to 0, and starts the next poll cycle
  4. The next poll also hangs → same 35s timeout → same {"ret": 0} → infinite loop
  5. The adapter's _connected flag (set once in connect()) is never cleared
  6. The health endpoint (/health/detailed) reads weixin.state: "connected" forever
  7. The weixin.updated_at timestamp freezes at the time of the last successful poll

The connection is dead, but the gateway never detects it.

Why This Happens on macOS

On Linux servers (VPS) with stable wired Ethernet, TCP connections almost never enter a half-open state. On macOS with WiFi + sleep/wake cycles, connections silently break daily. Additionally, macOS defaults:

  • net.inet.tcp.keepidle = 7_200_000ms (2 hours) — TCP keepalive takes 2+ hours to detect a dead connection
  • This is far longer than the 35-second LONG_POLL_TIMEOUT_MS, so keepalive never helps

Symptom Detection

Check /health/detailed:

{
  "platforms": {
    "weixin": {
      "state": "connected",
      "error_code": null,
      "updated_at": "2026-05-10T19:02:36Z"  ← stale (hours old)
    }
  }
}

If weixin.updated_at is more than 5 minutes old while the gateway is otherwise healthy, the connection is a zombie.

Steps to Reproduce

  1. Start gateway with WeChat adapter connected
  2. Force a network interruption (airport off/on, sleep/wake)
  3. Wait 5+ minutes
  4. Check /health/detailedweixin.state shows "connected", updated_at is frozen
  5. Send a message from WeChat — gateway never receives it
  6. Send a message via send_message tool — fails silently or with stale-session error

Proposed Fix

Code fix (preferred) — gateway/platforms/weixin.py

Option A — Don't swallow timeout as success: Change line 428-429 to return an error indicator instead of {"ret": 0}:

except asyncio.TimeoutError:
    return {"ret": -3, "errcode": -3, "errmsg": "connection timeout"}

This lets the _poll_loop detect consecutive timeouts and take corrective action (reconnect).

Option B — Add liveness heartbeat to _poll_loop: Track _last_successful_poll timestamp. If no successful poll in > N minutes, force disconnect() + connect() cycle.

Related

  • macOS default net.inet.tcp.keepidle = 7_200_000ms (2 hours) is the second contributor — the code fix alone would reconnect after each timeout, but tuning keepalive to e.g. 20s would surface the dead connection faster and reduce the 35s timeout overhead.
  • This is NOT related to issue #23389 (which is about launchd gui/502 domain being unsupported on macOS 26).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: WeChat zombie connection — _get_updates swallows asyncio.TimeoutError as empty success, never detects network drop [1 pull requests]