hermes - ✅(Solved) Fix Cached agent reuse silently disables the inactivity timeout, producing a Still working iteration 0/60 (cached) loop after user interrupt [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15654Fetched 2026-04-26 05:25:58
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×1referenced ×1

Root Cause

The agent-cache reuse path in gateway/run.py (around line 9843, in the per-message handler that resolves a turn against _agent_cache) does this when it reuses a cached AIAgent:

agent._last_activity_ts = time.time()
agent._last_activity_desc = "starting new turn (cached)"
agent._api_call_count = 0

These three resets create a blind spot for the existing inactivity watchdog:

  1. _api_call_count = 0 makes the heartbeat read "iteration 0/60" on every cached reuse, which is why operators see no progress.
  2. _last_activity_desc = "starting new turn (cached)" is the source of the (cached) flag.
  3. _last_activity_ts = time.time() is the load-bearing one. The inactivity watchdog polls seconds_since_activity from get_activity_summary() to decide when to kill a turn. If _last_activity_ts is reset to "now" on every cached reuse, the watchdog never accumulates idle time and the 1800s deadline is perpetually pushed forward. The safety net is silently disabled.

This is the direct counterpart of #9051. The fix for #9051 added an early _touch_activity() reset at the start of run_conversation() to prevent a stale timestamp from a previous turn from triggering a false-positive immediate timeout. That fix solves one direction (stale timestamp causes premature kill) but creates the opposite failure mode in the cached-reuse path: a perpetually-fresh timestamp prevents any kill at all when the new turn never makes its first API call.

This issue documents the resulting loop and proposes a turn-level watchdog that preserves the cache-reuse behavior introduced for #9051 while restoring an independent safety net.

Fix Action

Fix / Workaround

The interrupt path in _send_busy_ack (around line 1592 of gateway/run.py) calls running_agent.interrupt(event.text) and queues the new message for a successor turn. The interrupted parent agent finishes its in-flight tool call (or hangs in pre-flight setup), then the successor turn dispatches, hits the _agent_cache with a matching signature, and the three resets above run. If the successor itself stalls before its first API call, it is now indistinguishable from a healthy paused turn from the watchdog's point of view, because the cached reuse just refreshed the timestamp.

PR fix notes

PR #15807: fix(gateway): preserve inactivity clock on interrupt-recursive cached-agent turns (#15654)

Description (problem / solution / changelog)

Summary

  • Gate _last_activity_ts reset on _interrupt_depth == 0 — fresh external turns only
  • Extract the per-turn reset into a static helper _init_cached_agent_for_turn() to make it directly testable
  • 5 new targeted tests covering depth-0 reset, depth>0 preservation, and the watchdog-accumulation scenario

The bug

_run_agent unconditionally reset _last_activity_ts = time.time() on every _agent_cache hit. When a turn got stuck (e.g. after a user interrupt left the agent in a bad state) and the user sent another message, _run_agent was called recursively at _interrupt_depth=1. That re-entry hit the cache, reset the idle clock back to zero, and the inactivity watchdog's 30-min deadline was silently pushed forward again. The session would emit:

Still working... (3 min elapsed — iteration 0/60, starting new turn (cached))
Still working... (6 min elapsed — iteration 0/60, starting new turn (cached))
...

indefinitely — the timeout never fired, requiring a manual gateway restart.

The fix

Only reset _last_activity_ts when _interrupt_depth == 0 (a fresh external user message, not a recursive re-entry after an interrupt). At depth > 0, the timestamp is preserved so accumulated stuck-turn idle time keeps growing until the watchdog fires.

The depth-0 reset is still correct and needed: a session idle for 29 min has a stale timestamp, and without the reset the watchdog would fire after just 1 more minute of a legitimate new turn (#9051).

Extraction into _init_cached_agent_for_turn() is purely for testability — it lets tests call the production logic directly rather than going through the full _run_agent stack.

Test plan

  • Before (stash gateway/run.py): all 5 TestCachedAgentInactivityReset tests → AttributeError: type object 'GatewayRunner' has no attribute '_init_cached_agent_for_turn' (confirms tests can't pass without the fix)
  • After (restore fix): 5/5 pass
  • Full tests/gateway/test_agent_cache.py: 40/40 pass
  • Broader tests/gateway/ suite: failures identical to pre-existing baseline on origin/main (f93d4624); zero new failures in touched code

Related

  • Fixes #15654
  • Companion to #9051 (the original _last_activity_ts reset — this fix preserves its depth-0 behavior)

🤖 Generated with Claude Code

Changed files

  • gateway/run.py (modified, +20/-6)
  • tests/gateway/test_agent_cache.py (modified, +129/-0)

Code Example

Still working... (3 min elapsed — iteration 0/60, starting new turn (cached))
Still working... (6 min elapsed — iteration 0/60, starting new turn (cached))
Still working... (9 min elapsed — iteration 0/60, starting new turn (cached))
...

---

agent._last_activity_ts = time.time()
agent._last_activity_desc = "starting new turn (cached)"
agent._api_call_count = 0
RAW_BUFFERClick to expand / collapse

Symptom

After a user interrupts an in-flight gateway turn (Discord, in our case), the gateway enters a loop where the periodic "Still working..." notification fires every 3 to 6 minutes with the same payload:

Still working... (3 min elapsed — iteration 0/60, starting new turn (cached))
Still working... (6 min elapsed — iteration 0/60, starting new turn (cached))
Still working... (9 min elapsed — iteration 0/60, starting new turn (cached))
...

The iteration counter never advances past 0/60. The (cached) marker stays. The existing inactivity-based timeout (HERMES_AGENT_TIMEOUT, default 1800s) never fires. Subsequent user messages spawn new turns that inherit the same stuck state instead of killing the parent. In our Discord channel this manifested as a "ghost session" that kept posting heartbeats for over 30 minutes until the gateway was manually restarted.

Root cause

The agent-cache reuse path in gateway/run.py (around line 9843, in the per-message handler that resolves a turn against _agent_cache) does this when it reuses a cached AIAgent:

agent._last_activity_ts = time.time()
agent._last_activity_desc = "starting new turn (cached)"
agent._api_call_count = 0

These three resets create a blind spot for the existing inactivity watchdog:

  1. _api_call_count = 0 makes the heartbeat read "iteration 0/60" on every cached reuse, which is why operators see no progress.
  2. _last_activity_desc = "starting new turn (cached)" is the source of the (cached) flag.
  3. _last_activity_ts = time.time() is the load-bearing one. The inactivity watchdog polls seconds_since_activity from get_activity_summary() to decide when to kill a turn. If _last_activity_ts is reset to "now" on every cached reuse, the watchdog never accumulates idle time and the 1800s deadline is perpetually pushed forward. The safety net is silently disabled.

This is the direct counterpart of #9051. The fix for #9051 added an early _touch_activity() reset at the start of run_conversation() to prevent a stale timestamp from a previous turn from triggering a false-positive immediate timeout. That fix solves one direction (stale timestamp causes premature kill) but creates the opposite failure mode in the cached-reuse path: a perpetually-fresh timestamp prevents any kill at all when the new turn never makes its first API call.

This issue documents the resulting loop and proposes a turn-level watchdog that preserves the cache-reuse behavior introduced for #9051 while restoring an independent safety net.

Why interrupts make it worse

The interrupt path in _send_busy_ack (around line 1592 of gateway/run.py) calls running_agent.interrupt(event.text) and queues the new message for a successor turn. The interrupted parent agent finishes its in-flight tool call (or hangs in pre-flight setup), then the successor turn dispatches, hits the _agent_cache with a matching signature, and the three resets above run. If the successor itself stalls before its first API call, it is now indistinguishable from a healthy paused turn from the watchdog's point of view, because the cached reuse just refreshed the timestamp.

In practice users who keep typing while the agent is busy keep retriggering this cycle. Each interrupt resets the safety net.

Repro

  1. Open a Discord session bound to the gateway with a long-running task that takes more than HERMES_AGENT_NOTIFY_INTERVAL to complete (default 180s). A multi-tool research turn or a delegated subagent works.
  2. While the turn is running, send a second user message in the same channel within ~10 seconds. Send a third one a few seconds after that.
  3. Observe journalctl -fu hermes-gateway and the channel itself. Within one notify interval the heartbeat starts reading iteration 0/60, starting new turn (cached) and continues to do so every 3 minutes until the gateway is restarted.
  4. Inspect the agent's activity summary via /status (or by attaching a debugger) to confirm _last_activity_ts is being refreshed on every heartbeat tick rather than reflecting actual API or tool activity.

The bug does not require Discord specifically. Any adapter that exercises the busy-ack and cached-reuse paths reproduces it.

Proposed fix

Implemented and tested locally on a private branch. Four parts, all in or next to gateway/:

  • Turn-level watchdog (gateway/turn_watchdog.py). Two kill rules. cache_loop: iteration == 0 AND cached AND elapsed > 90s. stall: identical (iteration, cached, elapsed_bucket) tuple observed twice consecutively. The cache-loop rule owns the iter==0+cached case exclusively so the operator log attributes the kill to the right root cause instead of the generic stall reason. To avoid false positives on legitimate long Codex turns, the watchdog reads a new is_cached_turn field from get_activity_summary(). The flag is set in the cached-reuse path and cleared inside _touch_activity() the first time the turn makes any real progress, so a turn that genuinely advances past iteration 0 is never killed by this rule.
  • Interrupt-orphan cleanup. When _send_busy_ack interrupts a running agent for a user message, also evict the cached agent for that session. The successor turn rebuilds a fresh AIAgent instead of inheriting the cached signature whose _api_call_count and _last_activity_ts were just reset. This costs a prompt-cache miss on the interrupt-then-reply path, which is acceptable for the safety guarantee.
  • Heartbeat throttle. A small per-turn state object suppresses the "Still working..." Discord post when (iteration, elapsed_bucket) is unchanged from the previous heartbeat. Operators no longer see N copies of the same status when the watchdog is observing a stall window. The watchdog still evaluates every tick; only the user-facing send is suppressed.
  • Audit log. Kills are appended as JSONL records to $HERMES_STATE_DIR/turn_watchdog.log (defaults to ~/.local/state/hermes/turn_watchdog.log). Each record carries ts, reason (cache_loop / stall / interrupt_orphan), turn_id, model, iteration, max_iterations, elapsed_s, and a truncated session_key. The Discord ephemeral message is for the user; the audit log is for post-mortem.

The local change is 5 files: one new module, one new test file, one new repro script, plus surgical edits in gateway/run.py and run_agent.py. Test coverage: 20 new unit cases covering both kill rules, the throttle, the audit log (including a swallowed-failure case so a broken disk cannot crash the turn loop), the cached-detection precedence, and an end-to-end stuck-turn integration test that drives a FakeAgent mirroring the pattern in tests/gateway/test_gateway_inactivity_timeout.py. The 74 existing gateway tests in tests/gateway/test_busy_session_ack.py, test_gateway_inactivity_timeout.py, test_agent_cache.py, test_session_reset_notify.py, and test_stuck_loop.py still pass on top of the change.

The repro is reproducible without a live gateway via a small script that simulates heartbeats against a FakeAgent whose get_activity_summary() returns the stuck-cached payload. Before the fix the simulation emits 60 heartbeats over 30 minutes with no kill. After the fix the watchdog kills at 120s with reason cache_loop and writes one audit-log record.

Happy to open a PR from a fork if you think this is interesting. Local branch is ready.

Environment

  • Hermes Agent v0.11.0
  • Python 3.11.15
  • Linux x86_64, Ubuntu 24.04

Possibly adjacent to #10849 (which describes a different freeze mode where the agent stalls inside a terminal tool at iteration 24/90 and the existing inactivity timeout does fire after 1804s). The root cause and observable signature are different but both relate to gateway-loop stability.

extent analysis

TL;DR

The proposed fix involves implementing a turn-level watchdog with two kill rules, interrupt-orphan cleanup, heartbeat throttle, and audit log to prevent the gateway from entering a loop where the periodic "Still working..." notification fires every 3 to 6 minutes with the same payload.

Guidance

  • Implement the turn-level watchdog with cache_loop and stall kill rules to detect and kill stuck turns.
  • Evict the cached agent for a session when _send_busy_ack interrupts a running agent to prevent inheriting a cached signature with reset _api_call_count and _last_activity_ts.
  • Suppress the "Still working..." Discord post when (iteration, elapsed_bucket) is unchanged from the previous heartbeat to prevent duplicate status updates.
  • Append kill records to an audit log for post-mortem analysis.

Example

# Turn-level watchdog implementation
class TurnWatchdog:
    def __init__(self):
        self.cache_loop_rule = lambda turn: turn.iteration == 0 and turn.cached and turn.elapsed > 90
        self.stall_rule = lambda turn: turn.iteration == 0 and turn.cached and turn.elapsed > 90

    def evaluate(self, turn):
        if self.cache_loop_rule(turn):
            # Kill the turn and log the reason
            return "cache_loop"
        elif self.stall_rule(turn):
            # Kill the turn and log the reason
            return "stall"
        return None

Notes

The proposed fix is specific to the Hermes Agent v0.11.0 and may not be applicable to other versions. The implementation should be thoroughly tested to ensure it does not introduce new issues.

Recommendation

Apply the proposed fix, which includes implementing the turn-level watchdog, interrupt-orphan cleanup, heartbeat throttle, and audit log, to prevent the gateway from entering a loop and to provide a safety net for stuck turns.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Cached agent reuse silently disables the inactivity timeout, producing a Still working iteration 0/60 (cached) loop after user interrupt [1 pull requests, 1 participants]