hermes - ✅(Solved) Fix gateway/feishu: WS thread exit (ping timeout) is silently swallowed, should trigger in-process reconnect [1 pull requests, 1 participants]

Q: Expected behavior

When `_ws_future` resolves unexpectedly (WS thread died), a done-callback should call `self._set_fatal_error(..., retryable=True)` and schedule `_notify_fatal_error()`. The existing `_failed_platforms` reconnect infrastructure in `run.py` (with exponential backoff, up to `_MAX_ATTEMPTS`) would then handle reconnecting in-process without restarting the gateway process.

hermes2026-05-13 04:18:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#24807•Fetched 2026-05-14 03:51:38

View on GitHub

Comments

Participants

Timeline

Reactions

Author

waynehuu

Participants

waynehuu

Timeline (top)

labeled ×4cross-referenced ×1

When the Feishu WebSocket connection drops (e.g. keepalive ping timeout), lark_oapi's internal reconnect may eventually give up and let the exception escape. In feishu.py, _run_official_feishu_ws_client wraps ws_client.start() in a bare except Exception: pass, silently swallowing the error. No fatal error is raised, so the adapter stays registered but goes deaf — no new events arrive, and the gateway doesn't know.

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Error Message

gateway/platforms/feishu.py _run_official_feishu_ws_client (~line 1282) — exception swallowed with bare except Exception: pass

Root Cause

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Fix Action

Fix / Workaround

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

PR fix notes

PR #24813: fix(gateway/feishu): in-process WS reconnect + fallback send strips thread_id

Repository: NousResearch/hermes-agent
Author: waynehuu
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/24813

Description (problem / solution / changelog)

Closes #24807, closes #24808

Problems

1. WS thread exit is silently swallowed (no in-process reconnect)

When the Feishu WebSocket thread exits unexpectedly (keepalive ping timeout escaping lark_oapi's internal reconnect loop), the exception was caught by a bare except Exception: pass in _run_official_feishu_ws_client. No fatal error was raised, so the adapter stayed registered but went deaf. Recovery required a full gateway process restart via systemd (exit 75 / TEMPFAIL), which killed any in-flight agent tasks and triggered the 60s drain timeout.

2. Fallback send fails with [99992402] field validation failed

_send_with_retry's plain-text fallback passed the original metadata unchanged. If metadata contained a stale thread_id, Feishu returned [99992402] field validation failed, making the fallback fail too.

Fixes

1. Add a done-callback on _ws_future in _connect_websocket. When the WS thread exits while the adapter is still running, the callback calls _set_fatal_error(retryable=True) and schedules _notify_fatal_error(). This wires into the existing _failed_platforms reconnect infrastructure in run.py (exponential backoff, up to _MAX_ATTEMPTS) — reconnect happens in-process without restarting the gateway.

2. Before the fallback send() call, build a copy of metadata with thread_id and reply_to_message_id stripped, so the fallback degrades to a plain top-level chat message.

Changes

gateway/platforms/feishu.py: add _on_ws_thread_done done-callback on _ws_future
gateway/platforms/base.py: strip thread_id/reply_to_message_id from metadata before fallback send

Changed files

gateway/platforms/base.py (modified, +11/-2)
gateway/platforms/feishu.py (modified, +13/-0)

RAW_BUFFERClick to expand / collapse

Summary

The current workaround is relying on systemd to restart the entire gateway process (exit code 75 / TEMPFAIL), but this kills any in-flight agent tasks and triggers gateway drain timeout.

Expected behavior

When _ws_future resolves unexpectedly (WS thread died), a done-callback should call self._set_fatal_error(..., retryable=True) and schedule _notify_fatal_error(). The existing _failed_platforms reconnect infrastructure in run.py (with exponential backoff, up to _MAX_ATTEMPTS) would then handle reconnecting in-process without restarting the gateway process.

Relevant code

gateway/platforms/feishu.py _run_official_feishu_ws_client (~line 1282) — exception swallowed with bare except Exception: pass
gateway/platforms/feishu.py _connect_websocket (~line 4400) — _ws_future created with no done-callback
gateway/run.py _handle_adapter_fatal_error + _failed_platforms reconnect loop — infrastructure already exists, just not wired to WS thread exit

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - ✅(Solved) Fix gateway/feishu: WS thread exit (ping timeout) is silently swallowed, should trigger in-process reconnect [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #24813: fix(gateway/feishu): in-process WS reconnect + fallback send strips thread_id

Description (problem / solution / changelog)

Problems

Fixes

Changes

Changed files

Summary

Expected behavior

Relevant code

FAQ

Expected behavior

Still need to ship something?

TRENDING

hermes - ✅(Solved) Fix gateway/feishu: WS thread exit (ping timeout) is silently swallowed, should trigger in-process reconnect [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #24813: fix(gateway/feishu): in-process WS reconnect + fallback send strips thread_id

Description (problem / solution / changelog)

Problems

Fixes

Changes

Changed files

Summary

Expected behavior

Relevant code

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING