hermes - ✅(Solved) Fix Adapter fatal-error notify is opt-in, leading to gateway zombies if a plugin skips it [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28919Fetched 2026-05-20 04:01:07
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×3cross-referenced ×1

BasePlatformAdapter._set_fatal_error() updates the runtime status and flips _running to False, but it does not call _notify_fatal_error(). The notify-the-gateway step is left up to each adapter author, who has to remember to do it from every mid-life failure path.

If an adapter sets a fatal error without notifying — or if its disconnect handler short-circuits on is_connected (== self._running) after _set_fatal_error already flipped _running to False — the gateway never sees the failure, _failed_platforms never gets an entry, _platform_reconnect_watcher does nothing, the python process stays alive (systemd Restart= doesn't fire), and you get a fully running gateway whose only messaging platform is dead.

Error Message

If an adapter sets a fatal error without notifying — or if its disconnect handler short-circuits on is_connected (== self._running) after _set_fatal_error already flipped _running to False — the gateway never sees the failure, _failed_platforms never gets an entry, _platform_reconnect_watcher does nothing, the python process stays alive (systemd Restart= doesn't fire), and you get a fully running gateway whose only messaging platform is dead. 6. Gateway logs zero error lines. _failed_platforms empty. _platform_reconnect_watcher idle. systemd thinks everything is fine. Bridge silently dead.

Root Cause

BasePlatformAdapter._set_fatal_error() updates the runtime status and flips _running to False, but it does not call _notify_fatal_error(). The notify-the-gateway step is left up to each adapter author, who has to remember to do it from every mid-life failure path.

If an adapter sets a fatal error without notifying — or if its disconnect handler short-circuits on is_connected (== self._running) after _set_fatal_error already flipped _running to False — the gateway never sees the failure, _failed_platforms never gets an entry, _platform_reconnect_watcher does nothing, the python process stays alive (systemd Restart= doesn't fire), and you get a fully running gateway whose only messaging platform is dead.

Fix Action

Fix / Workaround

Yes — and we fixed them: dropped the failed_auth handler (the 45 s session_start timeout in connect() already covers real auth failures cleanly), and re-gated _on_disconnected on a _session_established flag instead of _running. (patch)

PR fix notes

PR #28930: fix(gateway): auto-notify fatal errors from _set_fatal_error

Description (problem / solution / changelog)

Summary

_set_fatal_error() flips _running=False and writes gateway_state.json, but does not call _notify_fatal_error(). If an adapter's disconnect handler short-circuits on is_connected after _set_fatal_error already set it to False, the gateway never learns about the failure — no retry, no log entry, zombie gateway.

Fix: auto-schedule _notify_fatal_error() from inside _set_fatal_error() so the two-step API collapses into a single call. Adapters cannot forget the second step.

Related Issue

Fixes #28919

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/platforms/base.py — added _schedule_fatal_notify() helper called at the end of _set_fatal_error(). Uses asyncio.get_running_loop().create_task() wrapped in try/except RuntimeError for sync init paths.

How to Test

  1. Unit test: create a mock adapter, call _set_fatal_error(), verify _notify_fatal_error() is invoked (handler called).
  2. Integration: simulate a plugin that skips notify — gateway should still detect the fatal error and trigger reconnect watcher.

Changed files

  • gateway/platforms/base.py (modified, +14/-0)
RAW_BUFFERClick to expand / collapse

Summary

BasePlatformAdapter._set_fatal_error() updates the runtime status and flips _running to False, but it does not call _notify_fatal_error(). The notify-the-gateway step is left up to each adapter author, who has to remember to do it from every mid-life failure path.

If an adapter sets a fatal error without notifying — or if its disconnect handler short-circuits on is_connected (== self._running) after _set_fatal_error already flipped _running to False — the gateway never sees the failure, _failed_platforms never gets an entry, _platform_reconnect_watcher does nothing, the python process stays alive (systemd Restart= doesn't fire), and you get a fully running gateway whose only messaging platform is dead.

How I hit it (slixmpp XMPP adapter)

  1. slixmpp fires failed_auth per SASL mechanism the server rejects (see slixmpp/features/feature_mechanisms/mechanisms.py:_handle_fail — it fires the event and then calls _send_auth() to try the next mechanism).
  2. ejabberd advertises SCRAM-SHA-512, SCRAM-SHA-256, PLAIN, ...; our adapter's _on_failed_auth runs on the rejected first mech, calls _set_fatal_error('auth_failed', retryable=False), which flips _running=False and writes `gateway_state.json` to `state=fatal, error_code=auth_failed`.
  3. SCRAM-SHA-512 then succeeds, session_start fires, _mark_connected() flips _running=True and writes `state=connected` ... but the next per-mechanism rejection (or, in our case, a later failed_auth arriving after gateway logged ✓ xmpp connected) re-poisons _running=False.
  4. Adapter keeps serving messages (sends/receives only check self._client, not _running) for ~17 hours.
  5. ejabberd restarts → slixmpp disconnected → adapter's _on_disconnected checks if self.is_connected: → sees _running=False → skips _set_fatal_error('connection_lost', retryable=True) + _notify_fatal_error().
  6. Gateway logs zero error lines. _failed_platforms empty. _platform_reconnect_watcher idle. systemd thinks everything is fine. Bridge silently dead.

Concrete timeline from our run (UTC):

TimeEvent
23:46:55.941gateway.log: ✓ xmpp connected
23:46:56.033gateway_state.json overwritten by adapter: state=fatal, error_code=auth_failed
next 17hsends/receives flowing normally
17:08:51 (next day)ejabberd graceful shutdown; slixmpp disconnected fires; adapter handler bails on is_connected=False
next 2h 42mzero log lines, zero retry attempts, until manual systemctl restart

The two adapter-side bugs are ours to fix

Yes — and we fixed them: dropped the failed_auth handler (the 45 s session_start timeout in connect() already covers real auth failures cleanly), and re-gated _on_disconnected on a _session_established flag instead of _running. (patch)

What I'd ask upstream to change

The class invariant should make this kind of bug impossible to silently cause. Two complementary asks, either of which would have surfaced the problem:

1. Auto-notify on _set_fatal_error

Schedule `_notify_fatal_error()` from inside `_set_fatal_error` whenever a handler is registered, so adapters can't forget:

```python def _set_fatal_error(self, code: str, message: str, *, retryable: bool) -> None: self._running = False self._fatal_error_code = code self._fatal_error_message = message self._fatal_error_retryable = retryable self._write_runtime_status_safe(...) if self._fatal_error_handler is not None: try: asyncio.get_running_loop().create_task(self._notify_fatal_error()) except RuntimeError: pass # called outside loop (e.g. lock_conflict during sync init) ```

This collapses a two-step API (`set` + `notify`) into one, and means a future plugin can't leave the gateway in the dark by forgetting one half.

2. Periodic reconciliation in the gateway

Today the gateway only consults _failed_platforms for retries. Add a slow loop (every N s) that also walks self.adapters and queues any adapter where has_fatal_error is True or is_connected is False — basically "trust the adapter's own state more than its notify discipline."

3. (Defense in depth, optional) systemd watchdog

`Type=notify` + `WatchdogSec=120` with periodic `sd_notify(WATCHDOG=1)` gated on "at least one required platform connected" would have killed and restarted the zombie regardless of the in-process bug. There's already a Stale systemd unit detected warning surfacing the gateway's awareness of its unit config, so this seems within scope.

(1) is by far the simplest and has the highest leverage.

Environment

  • hermes-agent 0.13.0 (release v2026.5.7), Python 3.12.13
  • slixmpp 1.13.2
  • NixOS systemd unit (no Type=notify)
  • Plugin vendored locally; no upstream xmpp-platform in this repo

Happy to send a PR for (1) if it sounds reasonable.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Adapter fatal-error notify is opt-in, leading to gateway zombies if a plugin skips it [1 pull requests, 1 participants]