hermes - ✅(Solved) Fix gateway/feishu: WS thread exit (ping timeout) is silently swallowed, should trigger in-process reconnect [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#24807Fetched 2026-05-14 03:51:38
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×1

When the Feishu WebSocket connection drops (e.g. keepalive ping timeout), lark_oapi's internal reconnect may eventually give up and let the exception escape. In feishu.py, _run_official_feishu_ws_client wraps ws_client.start() in a bare except Exception: pass, silently swallowing the error. No fatal error is raised, so the adapter stays registered but goes deaf — no new events arrive, and the gateway doesn't know.

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Error Message

When the Feishu WebSocket connection drops (e.g. keepalive ping timeout), lark_oapi's internal reconnect may eventually give up and let the exception escape. In feishu.py, _run_official_feishu_ws_client wraps ws_client.start() in a bare except Exception: pass, silently swallowing the error. No fatal error is raised, so the adapter stays registered but goes deaf — no new events arrive, and the gateway doesn't know.

  • gateway/platforms/feishu.py _run_official_feishu_ws_client (~line 1282) — exception swallowed with bare except Exception: pass

Root Cause

When the Feishu WebSocket connection drops (e.g. keepalive ping timeout), lark_oapi's internal reconnect may eventually give up and let the exception escape. In feishu.py, _run_official_feishu_ws_client wraps ws_client.start() in a bare except Exception: pass, silently swallowing the error. No fatal error is raised, so the adapter stays registered but goes deaf — no new events arrive, and the gateway doesn't know.

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Fix Action

Fix / Workaround

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

PR fix notes

PR #24813: fix(gateway/feishu): in-process WS reconnect + fallback send strips thread_id

Description (problem / solution / changelog)

Closes #24807, closes #24808

Problems

1. WS thread exit is silently swallowed (no in-process reconnect)

When the Feishu WebSocket thread exits unexpectedly (keepalive ping timeout escaping lark_oapi's internal reconnect loop), the exception was caught by a bare except Exception: pass in _run_official_feishu_ws_client. No fatal error was raised, so the adapter stayed registered but went deaf. Recovery required a full gateway process restart via systemd (exit 75 / TEMPFAIL), which killed any in-flight agent tasks and triggered the 60s drain timeout.

2. Fallback send fails with [99992402] field validation failed

_send_with_retry's plain-text fallback passed the original metadata unchanged. If metadata contained a stale thread_id, Feishu returned [99992402] field validation failed, making the fallback fail too.

Fixes

1. Add a done-callback on _ws_future in _connect_websocket. When the WS thread exits while the adapter is still running, the callback calls _set_fatal_error(retryable=True) and schedules _notify_fatal_error(). This wires into the existing _failed_platforms reconnect infrastructure in run.py (exponential backoff, up to _MAX_ATTEMPTS) — reconnect happens in-process without restarting the gateway.

2. Before the fallback send() call, build a copy of metadata with thread_id and reply_to_message_id stripped, so the fallback degrades to a plain top-level chat message.

Changes

  • gateway/platforms/feishu.py: add _on_ws_thread_done done-callback on _ws_future
  • gateway/platforms/base.py: strip thread_id/reply_to_message_id from metadata before fallback send

Changed files

  • gateway/platforms/base.py (modified, +11/-2)
  • gateway/platforms/feishu.py (modified, +13/-0)
RAW_BUFFERClick to expand / collapse

Summary

When the Feishu WebSocket connection drops (e.g. keepalive ping timeout), lark_oapi's internal reconnect may eventually give up and let the exception escape. In feishu.py, _run_official_feishu_ws_client wraps ws_client.start() in a bare except Exception: pass, silently swallowing the error. No fatal error is raised, so the adapter stays registered but goes deaf — no new events arrive, and the gateway doesn't know.

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Expected behavior

When _ws_future resolves unexpectedly (WS thread died), a done-callback should call self._set_fatal_error(..., retryable=True) and schedule _notify_fatal_error(). The existing _failed_platforms reconnect infrastructure in run.py (with exponential backoff, up to _MAX_ATTEMPTS) would then handle reconnecting in-process without restarting the gateway process.

Relevant code

  • gateway/platforms/feishu.py _run_official_feishu_ws_client (~line 1282) — exception swallowed with bare except Exception: pass
  • gateway/platforms/feishu.py _connect_websocket (~line 4400) — _ws_future created with no done-callback
  • gateway/run.py _handle_adapter_fatal_error + _failed_platforms reconnect loop — infrastructure already exists, just not wired to WS thread exit

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When _ws_future resolves unexpectedly (WS thread died), a done-callback should call self._set_fatal_error(..., retryable=True) and schedule _notify_fatal_error(). The existing _failed_platforms reconnect infrastructure in run.py (with exponential backoff, up to _MAX_ATTEMPTS) would then handle reconnecting in-process without restarting the gateway process.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING