hermes - 💡(How to fix) Fix Gateway needs periodic platform liveness watchdog to catch zombie connections across all adapters [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The gateway has no generic mechanism to detect when a platform adapter becomes a zombie — process alive, _running=True, but the underlying connection is dead. This is a class of bugs affecting multiple platforms:

  • Discord (#26656): discord.py exhausts reconnect backoff after DNS outage, silently stops. _bot_task completes with no observer.
  • Feishu (#23491): WebSocket disconnect leaves gateway as zombie, cron ticker stops.
  • Slack (#25476): Socket Mode silently drops without auto-reconnect.
  • WeChat (#23523): _get_updates swallows timeout as empty success, never detects network drop.
  • QQ (#19648, #18221): WebSocket disconnects/hangs freeze gateway.
  • Any plugin adapter (#28919): _notify_fatal_error() is opt-in; plugin authors who forget it leave gateway zombied.

Error Message

logger.error(

Root Cause

DNS went down for ~2 hours on a host running two gateway instances. After DNS recovered:

  • Daimon (Discord) was a zombie for 17 hours — systemd showed active (running), ss -tnp showed TCP socket still open, but bot was dead. Zero log output after discord.py gave up reconnecting.
  • Main gateway (Telegram) self-healed because Telegram polling is resilient to transient DNS failures.

Fix Action

Fixed

Code Example

async def _platform_liveness_watchdog(self):
    """Periodic check that connected adapters are actually alive."""
    while self._running:
        await asyncio.sleep(self._liveness_interval)  # e.g. 120s
        for platform, adapter in list(self.adapters.items()):
            if not adapter.is_alive():  # new method on BasePlatformAdapter
                logger.error(
                    "Liveness watchdog: %s appears dead, triggering reconnect",
                    platform.value,
                )
                adapter._set_fatal_error("liveness_timeout", "watchdog detected zombie connection", retryable=True)
                await adapter._notify_fatal_error()

---

gateway:
  liveness_watchdog:
    enabled: true          # default true
    interval: 120          # seconds between checks
    failure_threshold: 2   # consecutive failures before triggering reconnect
RAW_BUFFERClick to expand / collapse

Summary

The gateway has no generic mechanism to detect when a platform adapter becomes a zombie — process alive, _running=True, but the underlying connection is dead. This is a class of bugs affecting multiple platforms:

  • Discord (#26656): discord.py exhausts reconnect backoff after DNS outage, silently stops. _bot_task completes with no observer.
  • Feishu (#23491): WebSocket disconnect leaves gateway as zombie, cron ticker stops.
  • Slack (#25476): Socket Mode silently drops without auto-reconnect.
  • WeChat (#23523): _get_updates swallows timeout as empty success, never detects network drop.
  • QQ (#19648, #18221): WebSocket disconnects/hangs freeze gateway.
  • Any plugin adapter (#28919): _notify_fatal_error() is opt-in; plugin authors who forget it leave gateway zombied.

Real-world impact (May 25-26 incident)

DNS went down for ~2 hours on a host running two gateway instances. After DNS recovered:

  • Daimon (Discord) was a zombie for 17 hours — systemd showed active (running), ss -tnp showed TCP socket still open, but bot was dead. Zero log output after discord.py gave up reconnecting.
  • Main gateway (Telegram) self-healed because Telegram polling is resilient to transient DNS failures.

Proposed: gateway-level platform liveness watchdog

Add a periodic watchdog task in GatewayRunner (not per-adapter) that applies to all platforms:

async def _platform_liveness_watchdog(self):
    """Periodic check that connected adapters are actually alive."""
    while self._running:
        await asyncio.sleep(self._liveness_interval)  # e.g. 120s
        for platform, adapter in list(self.adapters.items()):
            if not adapter.is_alive():  # new method on BasePlatformAdapter
                logger.error(
                    "Liveness watchdog: %s appears dead, triggering reconnect",
                    platform.value,
                )
                adapter._set_fatal_error("liveness_timeout", "watchdog detected zombie connection", retryable=True)
                await adapter._notify_fatal_error()

is_alive() on BasePlatformAdapter

Each adapter implements is_alive() with platform-appropriate checks:

Platformis_alive() check
Discordself._client and not self._client.is_closed() and self._bot_task and not self._bot_task.done()
Telegramself._polling_task and not self._polling_task.done()
WebSocket-based (Feishu, Slack, QQ, WeChat)WS connection state + last-message-received timestamp
Webhook/API serverHTTP listener still bound
Default fallbackreturn self._running (no-op, backwards compatible)

Configuration

gateway:
  liveness_watchdog:
    enabled: true          # default true
    interval: 120          # seconds between checks
    failure_threshold: 2   # consecutive failures before triggering reconnect

Why gateway-level, not per-adapter

  1. Catches the #28919 gap: Even if an adapter forgets _notify_fatal_error(), the watchdog catches it.
  2. Single implementation: No need to duplicate liveness loops in every adapter plugin.
  3. Consistent behavior: Same backoff, logging, and reconnect-watcher integration for all platforms.
  4. Complementary to _bot_task monitoring: The watchdog catches slow-death zombies; a _bot_task.add_done_callback() catches instant-death (discord.py gave up).

Quick wins (can land independently)

  1. _bot_task done-callback on Discord adapter — detect when client.start() coroutine finishes and call _set_fatal_error + _notify_fatal_error. This alone would have caught the May 25-26 incident.
  2. Make _set_fatal_error() auto-call _notify_fatal_error() — fixes #28919 class of bugs where adapter authors forget to notify.

Related

  • #26656 — Discord zombie after network outage (incident report added)
  • #28919 — _notify_fatal_error is opt-in → gateway zombies
  • #23491 — Feishu zombie
  • #25476 — Slack zombie
  • #23523 — WeChat zombie
  • #19648, #18221 — QQ zombie/hang

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING