hermes - 💡(How to fix) Fix [Bug]: Feishu WebSocket disconnect leaves gateway as zombie process — cron ticker stops, no auto-reconnect [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23491Fetched 2026-05-11 03:29:15
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×2commented ×1

Error Message

  • No error or warning is logged when the WS drops The Feishu WebSocket adapter uses aiohttp for the WS connection. When the connection drops (network timeout, VPN switch, server-side close), the event loop appears to hang rather than raising an exception that the reconnect handler can catch.

Root Cause

The Feishu WebSocket adapter uses aiohttp for the WS connection. When the connection drops (network timeout, VPN switch, server-side close), the event loop appears to hang rather than raising an exception that the reconnect handler can catch.

Key observations:

  1. The Cron ticker started log only appears once at startup — there is no Cron ticker stopped log before the silence, suggesting the event loop itself froze rather than gracefully stopping
  2. The v0.13.0 release notes mention session auto-recovery after gateway restart, but this only covers process restart, not mid-run WS reconnection
  3. The gateway/platforms/feishu.py WebSocket handler likely needs a heartbeat/keepalive mechanism and an explicit reconnect loop with backoff

Code Example

2026-05-11 04:04:09,357 INFO gateway.run: Session expiry: 3 sessions to finalize (feishu:3)
2026-05-11 04:04:09,367 INFO gateway.run: Session expiry done: 3 finalized

---

2026-05-09 19:47:44 Cron ticker started
30+ hours of silence, ticker stopped
RAW_BUFFERClick to expand / collapse

Bug Description

The Feishu gateway process remains alive after the WebSocket connection silently drops, but becomes completely unresponsive — no new messages are received, cron ticker stops firing, and no log output is produced. The process becomes a zombie that appears healthy (ps shows it running) but is functionally dead.

This is not caused by macOS sleep. The machine stays awake, but the Feishu WebSocket connection drops (likely due to network fluctuation, VPN reconnection, or server-side timeout) and the gateway never recovers.

Steps to Reproduce

  1. Start gateway with Feishu in WebSocket mode: hermes gateway
  2. Verify Feishu WS connects: logs show [Feishu] Connected in websocket mode
  3. Leave gateway running overnight (~8+ hours)
  4. At some point the Feishu WS connection drops silently
  5. No further log output appears after the disconnect
  6. Cron jobs scheduled for 09:00 do not fire
  7. No inbound messages are received from Feishu
  8. Process is still alive (ps aux | grep gateway) but has no active WS connections

Expected Behavior

  • Gateway should detect WebSocket disconnect and automatically reconnect
  • Cron ticker should continue operating independently of WS state
  • If reconnect fails after retries, the gateway should log the failure clearly and either restart or exit with a non-zero status

Actual Behavior

  • Gateway process stays alive but becomes fully unresponsive
  • No error or warning is logged when the WS drops
  • Cron ticker silently stops (no tick logs after the disconnect)
  • The only ESTABLISHED TCP connections remaining are stale (2 CLOSED connections to Feishu servers)
  • kill and restart is the only recovery method

Evidence from Logs

Last normal activity (2026-05-11 04:04:09):

2026-05-11 04:04:09,357 INFO gateway.run: Session expiry: 3 sessions to finalize (feishu:3)
2026-05-11 04:04:09,367 INFO gateway.run: Session expiry done: 3 finalized

No further logs after 04:04 — the process ran for another ~30 hours with zero log output.

TCP connections at time of discovery:

  • 1 ESTABLISHED (unrelated)
  • 2 CLOSED (to Feishu WS servers — connection dropped, never reconnected)

Cron ticker log shows gap:

2026-05-09 19:47:44 Cron ticker started
                           ← 30+ hours of silence, ticker stopped

Affected Component

  • Gateway (Feishu adapter, WebSocket lifecycle)
  • Agent Core (cron scheduler dependency on event loop)

Messaging Platform

Feishu (WebSocket mode)

Debug Report

Environment

  • OS: macOS 15.4 (Darwin 25.4.0 arm64, Apple Silicon)
  • Python: 3.11.15
  • Hermes Version: 0.13.0 (2026.5.7) [commit 96dc2726]
  • Feishu connection mode: WebSocket (FEISHU_CONNECTION_MODE=websocket)
  • Provider: zai (GLM-5.1)

Root Cause Analysis

The Feishu WebSocket adapter uses aiohttp for the WS connection. When the connection drops (network timeout, VPN switch, server-side close), the event loop appears to hang rather than raising an exception that the reconnect handler can catch.

Key observations:

  1. The Cron ticker started log only appears once at startup — there is no Cron ticker stopped log before the silence, suggesting the event loop itself froze rather than gracefully stopping
  2. The v0.13.0 release notes mention session auto-recovery after gateway restart, but this only covers process restart, not mid-run WS reconnection
  3. The gateway/platforms/feishu.py WebSocket handler likely needs a heartbeat/keepalive mechanism and an explicit reconnect loop with backoff

Proposed Fix

  1. Add WS health monitoring: Periodic heartbeat check in the Feishu WS handler (ping/pong or a lightweight API call)
  2. Auto-reconnect on disconnect: Wrap the WS listener in a reconnect loop with exponential backoff, similar to how Discord handles reconnections
  3. Decouple cron ticker from WS: The cron ticker should operate independently of any platform WS state — it currently appears to freeze when the event loop stalls
  4. Watchdog timeout: If no log output is produced for N minutes, trigger an automatic gateway restart

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Feishu WebSocket disconnect leaves gateway as zombie process — cron ticker stops, no auto-reconnect [1 comments, 2 participants]