hermes - ✅(Solved) Fix Bug: session_reset + credential pool exhaustion leaves thread session in zombie state — subsequent messages silently dropped [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28686Fetched 2026-05-20 04:02:30
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×5cross-referenced ×1

A Telegram topic/thread permanently stops receiving messages after two events coincide:

  1. A skill that fires session_reset (e.g. /cc) arrives for a thread while that thread's agent is running.
  2. The credential pool is simultaneously exhausted (e.g. a 402 from DeepSeek drains the last slot).

The affected thread becomes a zombie: the gateway believes an agent is still running for it, so every subsequent inbound message is silently discarded as "agent busy". Other threads are unaffected. Recovery requires a full gateway restart.

Version: Hermes Agent v0.14.0
Priority: P2 — data-loss / permanent-denial-of-service for a thread
Platform: Telegram (topic/thread sessions), but the gateway code path is platform-agnostic


Root Cause

The zombie is created by a race between the run-generation guard and the outer finally-block cleanup:

Fix Action

Fixed

PR fix notes

PR #28689: fix(gateway): clear stale agent slot after session_reset to prevent zombie thread

Description (problem / solution / changelog)

Closes #28686

Problem

When a session_reset skill fires while an agent turn is in-flight and the credential pool is simultaneously exhausted, the affected Telegram thread enters a permanent zombie state: every subsequent message is silently dropped as "agent busy", requiring a gateway restart to recover.

The root cause is a gap in the outer _process_message_or_command finally block: when the run-generation guard correctly blocks the inner _release_running_agent_state call (to protect a newer run), the outer else branch only clears the metadata dicts (_running_agents_ts, _busy_ack_ts) but leaves the dead agent reference in _running_agents[session_key]. The staleness-eviction path can't recover it either because _running_agents_ts was already popped.

See issue #28686 for the full step-by-step trace and log evidence.

Changes

gateway/run.py — Fix 1: outer finally in _process_message_or_command

Replace the sentinel-conditional cleanup with an unconditional _release_running_agent_state call. This is safe because the method is idempotent (pop on an absent key is harmless) and no new agent for the same session key can start while the outer frame is unwinding.

gateway/run.py — Fix 2: _handle_reset_command

Add an explicit _release_running_agent_state(session_key) call immediately after _invalidate_session_run_generation. This makes the reset path self-contained: even if the outer finally doesn't run for this session key (e.g. the reset arrives from a different coroutine context), the slot is still cleared.

Test plan

  • tests/gateway/test_pending_event_none.py — existing tests still pass (guard for the related pending-event path)
  • Manual: trigger /cc on a thread while an agent turn is in-flight with a depleted credential pool → thread continues accepting messages after the reset
  • Manual: trigger /new while an agent turn is in-flight (normal path) → behavior unchanged
  • No regressions on other threads during/after the reset

Changed files

  • gateway/run.py (modified, +17/-12)
  • scripts/release.py (modified, +1/-0)

Code Example

16:00:52 INFO  gateway: Invalidated run generation for ...telegram:group:-1003890808219:221 (session_reset)
16:00:52 INFO  agent.credential_pool: no available entries (all exhausted or empty)

---

# BEFORE (buggy)
finally:
    if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
        self._release_running_agent_state(_quick_key)
    else:
        # Pops metadata dicts but NOT _running_agents[_quick_key] when
        # the slot holds a dead real agent instead of the sentinel.
        self._running_agents_ts.pop(_quick_key, None)
        if hasattr(self, "_busy_ack_ts"):
            self._busy_ack_ts.pop(_quick_key, None)

---

# _invalidate_session_run_generation bumps the generation, making the
# in-flight run's cleanup a no-op — but does not itself clear the slot.
self._invalidate_session_run_generation(session_key, reason="session_reset")
# No _release_running_agent_state call here → slot stays occupied.

---

# AFTER (fixed)
finally:
    # Unconditional: if _run_agent already released it this is a no-op;
    # if generation-guard blocked the inner release, this clears the zombie.
    self._release_running_agent_state(_quick_key)

---

self._invalidate_session_run_generation(session_key, reason="session_reset")
# Evict the stale agent slot so the bumped generation doesn't leave a zombie.
self._release_running_agent_state(session_key)
RAW_BUFFERClick to expand / collapse

Summary

A Telegram topic/thread permanently stops receiving messages after two events coincide:

  1. A skill that fires session_reset (e.g. /cc) arrives for a thread while that thread's agent is running.
  2. The credential pool is simultaneously exhausted (e.g. a 402 from DeepSeek drains the last slot).

The affected thread becomes a zombie: the gateway believes an agent is still running for it, so every subsequent inbound message is silently discarded as "agent busy". Other threads are unaffected. Recovery requires a full gateway restart.

Version: Hermes Agent v0.14.0
Priority: P2 — data-loss / permanent-denial-of-service for a thread
Platform: Telegram (topic/thread sessions), but the gateway code path is platform-agnostic


Reproduction

  1. Start gateway with a credential pool that has exactly one entry (or all but one exhausted).
  2. Send a message to Telegram thread :2 that triggers a long-running agent turn.
  3. While the agent turn is in flight, send /cc (or any skill that calls session_reset).
  4. Ensure the credential pool entry gets exhausted (402 from upstream LLM) at the same moment.
  5. Send any subsequent message to thread :2.

Expected: The new message is processed normally (new agent turn starts).
Actual: The message is silently dropped — no log entry, no reply.


Log Evidence

16:00:52 INFO  gateway: Invalidated run generation for ...telegram:group:-1003890808219:2 → 21 (session_reset)
16:00:52 INFO  agent.credential_pool: no available entries (all exhausted or empty)

After these two lines, all subsequent messages to thread :2 produce zero log output — not even the "inbound message" line at _handle_message_with_agent is reached. Messages to other threads (:1, :3, …) continue normally.


Root Cause

The zombie is created by a race between the run-generation guard and the outer finally-block cleanup:

Step-by-step

StepWhat happensState of _running_agents[session_key]
1Gen N agent starts; track_agent() promotes sentinel → real agentgen-N agent
2session_reset fires → _invalidate_session_run_generation bumps gen N → N+1gen-N agent (stale)
3Gen N's _run_agent finally: _release_running_agent_state(session_key, run_generation=N)gen N ≠ current gen N+1 → returns Falseslot NOT cleared
4Outer _process_message_or_command finally (run.py ~7499): _running_agents.get(key) is _AGENT_PENDING_SENTINELFalse (it's the dead gen-N agent) → else branch pops _running_agents_ts and _busy_ack_ts but not _running_agents[key]ZOMBIE: dead gen-N agent remains
5Next message: if _quick_key in self._running_agents: → True → busy path → silently queuedmessage dropped

The staleness-eviction path can't rescue it either: _stale_ts = _running_agents_ts.get(key, 0) returns 0 (popped in step 4), so the eviction condition _stale_ts and time.time() - _stale_ts > _STALE_AGENT_TIMEOUT is never true.

Affected code

gateway/run.py — outer finally in _process_message_or_command (~line 7499):

# BEFORE (buggy)
finally:
    if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
        self._release_running_agent_state(_quick_key)
    else:
        # Pops metadata dicts but NOT _running_agents[_quick_key] when
        # the slot holds a dead real agent instead of the sentinel.
        self._running_agents_ts.pop(_quick_key, None)
        if hasattr(self, "_busy_ack_ts"):
            self._busy_ack_ts.pop(_quick_key, None)

gateway/run.py_handle_reset_command (~line 8961):

# _invalidate_session_run_generation bumps the generation, making the
# in-flight run's cleanup a no-op — but does not itself clear the slot.
self._invalidate_session_run_generation(session_key, reason="session_reset")
# No _release_running_agent_state call here → slot stays occupied.

Fix Direction

Fix 1 — Replace the sentinel-conditional finally with an unconditional release:

# AFTER (fixed)
finally:
    # Unconditional: if _run_agent already released it this is a no-op;
    # if generation-guard blocked the inner release, this clears the zombie.
    self._release_running_agent_state(_quick_key)

This is safe because _release_running_agent_state is already idempotent (pop on absent key is harmless), and no new agent for this session_key can start while the outer frame is still unwinding.

Fix 2 — Clear the slot explicitly in _handle_reset_command after invalidating the generation:

self._invalidate_session_run_generation(session_key, reason="session_reset")
# Evict the stale agent slot so the bumped generation doesn't leave a zombie.
self._release_running_agent_state(session_key)

Both fixes together ensure the slot is always cleared by whichever path runs first.


Not a duplicate of

This is distinct from previously filed issues about session handling: the zombie state here is caused specifically by the interaction between the generation-guard short-circuit and the outer finally's else-branch omission, not by missing reset logic or platform-level session tracking bugs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Bug: session_reset + credential pool exhaustion leaves thread session in zombie state — subsequent messages silently dropped [1 pull requests, 1 participants]