hermes - 💡(How to fix) Fix Gateway: Multi-platform WebSockets share single event loop, causing cascading disconnections

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When running multiple messaging platforms simultaneously (WeCom + Feishu + QQBot), the Hermes Gateway experiences cascading WebSocket disconnections. All platform connections share a single Python asyncio event loop. When the agent is processing a message (calling LLM API, executing tools, etc.), the event loop becomes occupied, and WebSocket keepalive pings for other platforms are not serviced in time. This causes the remote servers to drop the connections.

Root Cause

In gateway/run.py, all platform adapters and the agent processing loop run within the same asyncio event loop. Each platform maintains its own WebSocket connection with server-side keepalive expectations. When any single operation blocks the event loop for more than a few seconds (e.g., LLM API call, tool execution), the WebSocket keepalive/ping tasks for ALL other platforms are delayed, causing their respective servers to time out and close the connection.

Fix Action

Fix / Workaround

Current Workarounds (for affected users)

RAW_BUFFERClick to expand / collapse

Description

When running multiple messaging platforms simultaneously (WeCom + Feishu + QQBot), the Hermes Gateway experiences cascading WebSocket disconnections. All platform connections share a single Python asyncio event loop. When the agent is processing a message (calling LLM API, executing tools, etc.), the event loop becomes occupied, and WebSocket keepalive pings for other platforms are not serviced in time. This causes the remote servers to drop the connections.

Impact

  • Messages sent on one platform can cause other platforms to disconnect
  • During the reconnect window (15-30 seconds), messages on the affected platform are lost
  • User finds the gateway unresponsive and needs to wait for reconnect
  • Cascading effect: one platform's disconnect can trigger another's during reconnect processing
  • This prevents users from relying on the gateway for remote access

Root Cause

In gateway/run.py, all platform adapters and the agent processing loop run within the same asyncio event loop. Each platform maintains its own WebSocket connection with server-side keepalive expectations. When any single operation blocks the event loop for more than a few seconds (e.g., LLM API call, tool execution), the WebSocket keepalive/ping tasks for ALL other platforms are delayed, causing their respective servers to time out and close the connection.

Reproduction Steps

  1. Configure at least 2 platforms (e.g., WeCom + Feishu)
  2. Start the gateway
  3. Send a message on one platform that triggers agent processing
  4. Observe: while the first message is being processed, other platforms disconnect

Environment

  • Hermes Agent v0.12.0 (2026.4.30)
  • macOS (launchd)
  • Platforms: Feishu + WeCom (and optionally QQBot)
  • All platforms use WebSocket long-connection mode

Suggested Solutions

Option A: Isolate each platform's WebSocket in a separate asyncio event loop

Run each platform adapter's WebSocket connection in its own asyncio event loop (using asyncio.run() or loop.run_forever() in a dedicated thread). This would ensure that one platform's reconnect or network delay never affects another's keepalive.

Option B: Run agent processing in a thread pool

Move the agent's main processing loop (_handle_message_with_agent) to loop.run_in_executor() so agent processing doesn't block the main event loop that services WebSocket connections.

Option C: Per-platform dedicated heartbeat thread

At minimum, ensure each platform's heartbeat/keepalive mechanism runs in a way that is isolated from the main processing loop, perhaps using a dedicated thread per platform.

Current Workarounds (for affected users)

  1. Set busy_input_mode: queue in config.yaml to serialize incoming messages (prevents cascading from multiple concurrent messages)
  2. Reduce WeCom heartbeat interval from 30s to 15s (gateway/platforms/wecom.py: HEARTBEAT_INTERVAL_SECONDS = 15)
  3. Configure Feishu ping interval explicitly via extra.ws_ping_interval: 10 in config.yaml
  4. But these only reduce frequency, not eliminate the root cause

Additional Context

This issue manifests as correlated disconnections across platforms - timestamps within 1-11 seconds of each other. For example, when a user sends a message on WeCom at T+0s, Feishu's WebSocket drops at T+1s (keepalive ping timeout), and WeCom may also drop at T+11s as it tries to recover.

The Feishu platform adapter (lark_oapi SDK) appears to have a server-side session timeout of approximately 19-29 minutes, which exacerbates the issue - even idle connections eventually drop, and the reconnect phase briefly blocks the event loop.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Gateway: Multi-platform WebSockets share single event loop, causing cascading disconnections