hermes - 💡(How to fix) Fix Gateway needs periodic platform liveness watchdog to catch zombie connections across all adapters [1 pull requests]

The gateway has no generic mechanism to detect when a platform adapter becomes a zombie — process alive, _running=True, but the underlying connection is dead. This is a class of bugs affecting multiple platforms:

Discord (#26656): discord.py exhausts reconnect backoff after DNS outage, silently stops. _bot_task completes with no observer.
Feishu (#23491): WebSocket disconnect leaves gateway as zombie, cron ticker stops.
Slack (#25476): Socket Mode silently drops without auto-reconnect.
WeChat (#23523): _get_updates swallows timeout as empty success, never detects network drop.
QQ (#19648, #18221): WebSocket disconnects/hangs freeze gateway.
Any plugin adapter (#28919): _notify_fatal_error() is opt-in; plugin authors who forget it leave gateway zombied.

Root Cause

DNS went down for ~2 hours on a host running two gateway instances. After DNS recovered:

Daimon (Discord) was a zombie for 17 hours — systemd showed active (running), ss -tnp showed TCP socket still open, but bot was dead. Zero log output after discord.py gave up reconnecting.
Main gateway (Telegram) self-healed because Telegram polling is resilient to transient DNS failures.

Code Example

async def _platform_liveness_watchdog(self):
    """Periodic check that connected adapters are actually alive."""
    while self._running:
        await asyncio.sleep(self._liveness_interval)  # e.g. 120s
        for platform, adapter in list(self.adapters.items()):
            if not adapter.is_alive():  # new method on BasePlatformAdapter
                logger.error(
                    "Liveness watchdog: %s appears dead, triggering reconnect",
                    platform.value,
                )
                adapter._set_fatal_error("liveness_timeout", "watchdog detected zombie connection", retryable=True)
                await adapter._notify_fatal_error()

---

gateway:
  liveness_watchdog:
    enabled: true          # default true
    interval: 120          # seconds between checks
    failure_threshold: 2   # consecutive failures before triggering reconnect

Summary

Discord (#26656): discord.py exhausts reconnect backoff after DNS outage, silently stops. _bot_task completes with no observer.
Feishu (#23491): WebSocket disconnect leaves gateway as zombie, cron ticker stops.
Slack (#25476): Socket Mode silently drops without auto-reconnect.
WeChat (#23523): _get_updates swallows timeout as empty success, never detects network drop.
QQ (#19648, #18221): WebSocket disconnects/hangs freeze gateway.
Any plugin adapter (#28919): _notify_fatal_error() is opt-in; plugin authors who forget it leave gateway zombied.

Real-world impact (May 25-26 incident)

DNS went down for ~2 hours on a host running two gateway instances. After DNS recovered:

Daimon (Discord) was a zombie for 17 hours — systemd showed active (running), ss -tnp showed TCP socket still open, but bot was dead. Zero log output after discord.py gave up reconnecting.
Main gateway (Telegram) self-healed because Telegram polling is resilient to transient DNS failures.

Proposed: gateway-level platform liveness watchdog

Add a periodic watchdog task in GatewayRunner (not per-adapter) that applies to all platforms:

async def _platform_liveness_watchdog(self):
    """Periodic check that connected adapters are actually alive."""
    while self._running:
        await asyncio.sleep(self._liveness_interval)  # e.g. 120s
        for platform, adapter in list(self.adapters.items()):
            if not adapter.is_alive():  # new method on BasePlatformAdapter
                logger.error(
                    "Liveness watchdog: %s appears dead, triggering reconnect",
                    platform.value,
                )
                adapter._set_fatal_error("liveness_timeout", "watchdog detected zombie connection", retryable=True)
                await adapter._notify_fatal_error()

`is_alive()` on `BasePlatformAdapter`

Each adapter implements is_alive() with platform-appropriate checks:

Platform	`is_alive()` check
Discord	`self._client and not self._client.is_closed() and self._bot_task and not self._bot_task.done()`
Telegram	`self._polling_task and not self._polling_task.done()`
WebSocket-based (Feishu, Slack, QQ, WeChat)	WS connection state + last-message-received timestamp
Webhook/API server	HTTP listener still bound
Default fallback	`return self._running` (no-op, backwards compatible)

Configuration

gateway:
  liveness_watchdog:
    enabled: true          # default true
    interval: 120          # seconds between checks
    failure_threshold: 2   # consecutive failures before triggering reconnect

Why gateway-level, not per-adapter

Catches the #28919 gap: Even if an adapter forgets _notify_fatal_error(), the watchdog catches it.
Single implementation: No need to duplicate liveness loops in every adapter plugin.
Consistent behavior: Same backoff, logging, and reconnect-watcher integration for all platforms.
Complementary to _bot_task monitoring: The watchdog catches slow-death zombies; a _bot_task.add_done_callback() catches instant-death (discord.py gave up).

Quick wins (can land independently)

_bot_task done-callback on Discord adapter — detect when client.start() coroutine finishes and call _set_fatal_error + _notify_fatal_error. This alone would have caught the May 25-26 incident.
Make _set_fatal_error() auto-call _notify_fatal_error() — fixes #28919 class of bugs where adapter authors forget to notify.

#26656 — Discord zombie after network outage (incident report added)
#28919 — _notify_fatal_error is opt-in → gateway zombies
#23491 — Feishu zombie
#25476 — Slack zombie
#23523 — WeChat zombie
#19648, #18221 — QQ zombie/hang

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Gateway needs periodic platform liveness watchdog to catch zombie connections across all adapters [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Summary

Real-world impact (May 25-26 incident)

Proposed: gateway-level platform liveness watchdog

`is_alive()` on `BasePlatformAdapter`

Configuration

Why gateway-level, not per-adapter

Quick wins (can land independently)

Related

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Gateway needs periodic platform liveness watchdog to catch zombie connections across all adapters [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Summary

Real-world impact (May 25-26 incident)

Proposed: gateway-level platform liveness watchdog

is_alive() on BasePlatformAdapter

Configuration

Why gateway-level, not per-adapter

Quick wins (can land independently)

Related

Still need to ship something?

TRENDING

`is_alive()` on `BasePlatformAdapter`