hermes - ✅(Solved) Fix Adapter fatal-error notify is opt-in, leading to gateway zombies if a plugin skips it [1 pull requests, 1 participants]

hermes2026-05-19 20:31:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#28919•Fetched 2026-05-20 04:01:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

andr-ec

Participants

andr-ec

Timeline (top)

labeled ×3cross-referenced ×1

BasePlatformAdapter._set_fatal_error() updates the runtime status and flips _running to False, but it does not call _notify_fatal_error(). The notify-the-gateway step is left up to each adapter author, who has to remember to do it from every mid-life failure path.

If an adapter sets a fatal error without notifying — or if its disconnect handler short-circuits on is_connected (== self._running) after _set_fatal_error already flipped _running to False — the gateway never sees the failure, _failed_platforms never gets an entry, _platform_reconnect_watcher does nothing, the python process stays alive (systemd Restart= doesn't fire), and you get a fully running gateway whose only messaging platform is dead.

Error Message

Root Cause

Fix Action

Fix / Workaround

Yes — and we fixed them: dropped the failed_auth handler (the 45 s session_start timeout in connect() already covers real auth failures cleanly), and re-gated _on_disconnected on a _session_established flag instead of _running. (patch)

PR fix notes

PR #28930: fix(gateway): auto-notify fatal errors from _set_fatal_error

Repository: NousResearch/hermes-agent
Author: LifeJiggy
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/28930

Description (problem / solution / changelog)

Summary

_set_fatal_error() flips _running=False and writes gateway_state.json, but does not call _notify_fatal_error(). If an adapter's disconnect handler short-circuits on is_connected after _set_fatal_error already set it to False, the gateway never learns about the failure — no retry, no log entry, zombie gateway.

Fix: auto-schedule _notify_fatal_error() from inside _set_fatal_error() so the two-step API collapses into a single call. Adapters cannot forget the second step.

Related Issue

Fixes #28919

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

gateway/platforms/base.py — added _schedule_fatal_notify() helper called at the end of _set_fatal_error(). Uses asyncio.get_running_loop().create_task() wrapped in try/except RuntimeError for sync init paths.

How to Test

Unit test: create a mock adapter, call _set_fatal_error(), verify _notify_fatal_error() is invoked (handler called).
Integration: simulate a plugin that skips notify — gateway should still detect the fatal error and trigger reconnect watcher.

Changed files

gateway/platforms/base.py (modified, +14/-0)

RAW_BUFFERClick to expand / collapse

Summary

How I hit it (slixmpp XMPP adapter)

slixmpp fires failed_auth per SASL mechanism the server rejects (see slixmpp/features/feature_mechanisms/mechanisms.py:_handle_fail — it fires the event and then calls _send_auth() to try the next mechanism).
ejabberd advertises SCRAM-SHA-512, SCRAM-SHA-256, PLAIN, ...; our adapter's _on_failed_auth runs on the rejected first mech, calls _set_fatal_error('auth_failed', retryable=False), which flips _running=False and writes `gateway_state.json` to `state=fatal, error_code=auth_failed`.
SCRAM-SHA-512 then succeeds, session_start fires, _mark_connected() flips _running=True and writes `state=connected` ... but the next per-mechanism rejection (or, in our case, a later failed_auth arriving after gateway logged ✓ xmpp connected) re-poisons _running=False.
Adapter keeps serving messages (sends/receives only check self._client, not _running) for ~17 hours.
ejabberd restarts → slixmpp disconnected → adapter's _on_disconnected checks if self.is_connected: → sees _running=False → skips _set_fatal_error('connection_lost', retryable=True) + _notify_fatal_error().
Gateway logs zero error lines. _failed_platforms empty. _platform_reconnect_watcher idle. systemd thinks everything is fine. Bridge silently dead.

Concrete timeline from our run (UTC):

Time	Event
23:46:55.941	gateway.log: `✓ xmpp connected`
23:46:56.033	gateway_state.json overwritten by adapter: `state=fatal, error_code=auth_failed`
next 17h	sends/receives flowing normally
17:08:51 (next day)	ejabberd graceful shutdown; slixmpp `disconnected` fires; adapter handler bails on `is_connected=False`
next 2h 42m	zero log lines, zero retry attempts, until manual `systemctl restart`

The two adapter-side bugs are ours to fix

What I'd ask upstream to change

The class invariant should make this kind of bug impossible to silently cause. Two complementary asks, either of which would have surfaced the problem:

1. Auto-notify on `_set_fatal_error`

Schedule `_notify_fatal_error()` from inside `_set_fatal_error` whenever a handler is registered, so adapters can't forget:

```python def _set_fatal_error(self, code: str, message: str, *, retryable: bool) -> None: self._running = False self._fatal_error_code = code self._fatal_error_message = message self._fatal_error_retryable = retryable self._write_runtime_status_safe(...) if self._fatal_error_handler is not None: try: asyncio.get_running_loop().create_task(self._notify_fatal_error()) except RuntimeError: pass # called outside loop (e.g. lock_conflict during sync init) ```

This collapses a two-step API (`set` + `notify`) into one, and means a future plugin can't leave the gateway in the dark by forgetting one half.

2. Periodic reconciliation in the gateway

Today the gateway only consults _failed_platforms for retries. Add a slow loop (every N s) that also walks self.adapters and queues any adapter where has_fatal_error is True or is_connected is False — basically "trust the adapter's own state more than its notify discipline."

3. (Defense in depth, optional) systemd watchdog

`Type=notify` + `WatchdogSec=120` with periodic `sd_notify(WATCHDOG=1)` gated on "at least one required platform connected" would have killed and restarted the zombie regardless of the in-process bug. There's already a Stale systemd unit detected warning surfacing the gateway's awareness of its unit config, so this seems within scope.

(1) is by far the simplest and has the highest leverage.

Environment

hermes-agent 0.13.0 (release v2026.5.7), Python 3.12.13
slixmpp 1.13.2
NixOS systemd unit (no Type=notify)
Plugin vendored locally; no upstream xmpp-platform in this repo

Happy to send a PR for (1) if it sounds reasonable.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - ✅(Solved) Fix Adapter fatal-error notify is opt-in, leading to gateway zombies if a plugin skips it [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #28930: fix(gateway): auto-notify fatal errors from _set_fatal_error

Description (problem / solution / changelog)

Summary

Related Issue

Type of Change

Changes Made

How to Test

Changed files

Summary

How I hit it (slixmpp XMPP adapter)

The two adapter-side bugs are ours to fix

What I'd ask upstream to change

1. Auto-notify on `_set_fatal_error`

2. Periodic reconciliation in the gateway

3. (Defense in depth, optional) systemd watchdog

Environment

Still need to ship something?

TRENDING

hermes - ✅(Solved) Fix Adapter fatal-error notify is opt-in, leading to gateway zombies if a plugin skips it [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #28930: fix(gateway): auto-notify fatal errors from _set_fatal_error

Description (problem / solution / changelog)

Summary

Related Issue

Type of Change

Changes Made

How to Test

Changed files

Summary

How I hit it (slixmpp XMPP adapter)

The two adapter-side bugs are ours to fix

What I'd ask upstream to change

1. Auto-notify on _set_fatal_error

2. Periodic reconciliation in the gateway

3. (Defense in depth, optional) systemd watchdog

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Auto-notify on `_set_fatal_error`