hermes - ✅(Solved) Fix fix: gateway memory leak — _evict_cached_agent drops agents without cleanup, _session_messages never cleared [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#25315Fetched 2026-05-14 03:47:21
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×3unlabeled ×1

The Hermes gateway process grows from ~400 MB to 20–37 GB over 20–35 hours of uptime, eventually triggering OOM kills. Root cause analysis identified four contributing leaks, all in the agent cache and per-session dict lifecycle.

Root Cause

The Hermes gateway process grows from ~400 MB to 20–37 GB over 20–35 hours of uptime, eventually triggering OOM kills. Root cause analysis identified four contributing leaks, all in the agent cache and per-session dict lifecycle.

Fix Action

Fixed

PR fix notes

PR #25318: fix(gateway): prevent memory leak from agent cache eviction and unbounded session data

Description (problem / solution / changelog)

Summary

Fixes gateway memory leak that causes OOM after 20–35 hours of uptime. The gateway grows from ~400 MB to 20–37 GB and eventually crashes.

Four targeted fixes:

  1. _evict_cached_agent() now calls _release_evicted_agent_soft() via a daemon thread instead of silently dropping the agent. This ensures OpenAI clients, httpx transports, and SSL contexts are properly closed. Previously, every /new or /model command leaked a full agent object.

  2. _release_evicted_agent_soft() now clears _session_messages after release_clients(), freeing conversation history memory. Tool outputs (file reads, terminal output, search results) can be tens of MB per session.

  3. _sweep_idle_cached_agents() now prunes stale entries from per-session dicts (_session_model_overrides, _session_reasoning_overrides, _pending_approvals, _update_prompt_pending) for keys no longer in the agent cache.

  4. _tool_defs_cache in model_tools.py is now bounded to 8 entries — clears entirely when exceeded (repopulates on next call).

Root Cause

The primary leak: _evict_cached_agent() (called on /new, /model, etc.) did self._agent_cache.pop(session_key, None) but never called release_clients(). Meanwhile, _sweep_idle_cached_agents() and _enforce_agent_cache_cap() both properly called _release_evicted_agent_soft(). Each orphaned agent leaked an OpenAI client, httpx transport, SSL context, and full conversation history.

Contributing factors: _session_messages was never cleared even on proper soft eviction, per-session dicts grew unbounded, and _tool_defs_cache accumulated entries on every config touch.

Evidence

  • May 9: 24.9 GB peak after 21h → OOM crash
  • May 11: 20.2 GB peak after 28h → OOM crash
  • May 12: 37.3 GB peak after 35h → OOM crash
  • May 14: 32.1 GB peak after 36h → manual restart

Files Changed

  • gateway/run.py — fixes 1, 2, 3
  • model_tools.py — fix 4

Fixes #25315 Related: #18438, #19251

Changed files

  • gateway/run.py (modified, +31/-1)
  • model_tools.py (modified, +2/-0)

PR #25332: fix: stop gateway memory leak on cached agent and session expiry

Description (problem / solution / changelog)

Summary

Fixes a long-lived gateway memory leak introduced by cache/session lifecycle paths that never fully release state on agent eviction or expiry.

Changes

  • Ensure _evict_cached_agent() actually detaches and cleans the evicted agent on /new, /model, and related reset paths by invoking the existing soft-eviction cleanup path asynchronously.
  • Extend soft cleanup to clear _session_messages, releasing potentially large per-session histories.
  • Clear stale per-session dictionaries when idle session expiry finalizes a session:
    • _session_model_overrides
    • _session_reasoning_overrides
    • _pending_approvals
    • _update_prompt_pending
  • Add a bounded policy to quiet-mode tool-definition cache in model_tools.py:
    • cap at 8 entries
    • reset cache when over capacity before inserting a new key
  • Add regressions covering
    • evicted cached agent soft cleanup and message clearing,
    • session-expiry state-map cleanup,
    • tool-definition cache cap behavior.

Issue

  • Fixes #25315

Changed files

  • agent/auxiliary_client.py (modified, +37/-2)
  • cron/jobs.py (modified, +58/-12)
  • cron/scheduler.py (modified, +2/-2)
  • gateway/run.py (modified, +39/-1)
  • model_tools.py (modified, +7/-0)
  • tests/agent/test_auxiliary_client.py (modified, +35/-3)
  • tests/cron/test_cron_runtime_profile_paths.py (added, +38/-0)
  • tests/gateway/test_agent_cache.py (modified, +48/-0)
  • tests/gateway/test_session_boundary_hooks.py (modified, +64/-0)
  • tests/test_get_tool_definitions_cache_isolation.py (modified, +20/-6)

PR #25377: fix(gateway): release cached agents on eviction

Description (problem / solution / changelog)

Summary

  • release cached AIAgent clients when explicit session eviction removes cache entries
  • clear cached agents' _session_messages during soft cleanup to free large histories
  • clear expired sessions' per-session override/pending maps
  • cap quiet-mode tool definition cache to avoid unbounded config-mtime growth

Fixes #25315

Test Plan

  • python -m pytest tests/gateway/test_agent_cache.py::TestAgentCacheLifecycle::test_evict_runs_soft_cleanup_and_clears_session_messages tests/gateway/test_session_boundary_hooks.py::test_idle_expiry_clears_session_scoped_state tests/test_get_tool_definitions_cache_isolation.py::TestQuietModeCacheIsolation::test_cache_is_capped_and_clears_when_over_limit -q -o 'addopts='
  • python -m pytest tests/gateway/test_agent_cache.py tests/gateway/test_session_boundary_hooks.py tests/test_get_tool_definitions_cache_isolation.py -q -o 'addopts='
  • python -m pytest tests/gateway/test_session_model_reset.py -q -o 'addopts='
  • python -m ruff check gateway/run.py model_tools.py tests/gateway/test_agent_cache.py tests/gateway/test_session_boundary_hooks.py tests/test_get_tool_definitions_cache_isolation.py
  • git diff --check

Changed files

  • gateway/run.py (modified, +41/-1)
  • model_tools.py (modified, +3/-0)
  • tests/gateway/test_agent_cache.py (modified, +42/-0)
  • tests/gateway/test_session_boundary_hooks.py (modified, +64/-0)
  • tests/test_get_tool_definitions_cache_isolation.py (modified, +14/-0)
RAW_BUFFERClick to expand / collapse

Summary

The Hermes gateway process grows from ~400 MB to 20–37 GB over 20–35 hours of uptime, eventually triggering OOM kills. Root cause analysis identified four contributing leaks, all in the agent cache and per-session dict lifecycle.

Root Cause Analysis

Primary: _evict_cached_agent() drops agents without cleanup

In gateway/run.py, _evict_cached_agent() does self._agent_cache.pop(session_key, None) but never calls release_clients() or close(). Meanwhile, the proper eviction paths (_sweep_idle_cached_agents() and _enforce_agent_cache_cap()) do call _release_evicted_agent_soft().

Each orphaned agent leaks:

  • An OpenAI client instance
  • An httpx transport (connection pool)
  • An SSL context
  • The full _session_messages conversation history

_evict_cached_agent() is called on /new, /model, and other session-resetting commands — so every user who switches models or starts a new conversation leaks one full agent object.

Contributing: _release_evicted_agent_soft() doesn't clear _session_messages

When release_clients() is called via the soft eviction path, it closes the API client but doesn't clear _session_messages. Conversation histories that include tool outputs (file reads, terminal output, search results) can be tens of MB each and remain pinned in memory until the process exits.

Contributing: Per-session dicts never cleaned on session expiry

These dicts in gateway/run.py accumulate entries for every session but never prune stale keys:

  • _session_model_overrides
  • _session_reasoning_overrides
  • _pending_approvals
  • _update_prompt_pending

Over hours of operation with many sessions, these grow without bound.

Contributing: _tool_defs_cache in model_tools.py grows with config changes

The cache is keyed by (enabled_toolsets, disabled_toolsets, registry_generation, config_mtime). Each config.yaml touch (even a no-op save) creates a new entry that's never evicted. Under normal gateway operation with periodic config reloads, this grows indefinitely.

Evidence

DatePeak RSSUptimeOutcome
May 924.9 GB21 hOOM crash
May 1120.2 GB28 hOOM crash
May 1237.3 GB35 hOOM crash
May 1432.1 GB36 hManual restart

Proposed Fix

  1. _evict_cached_agent(): Call _release_evicted_agent_soft() on the evicted agent via a daemon thread, matching the pattern used by _enforce_agent_cache_cap().

  2. _release_evicted_agent_soft(): After release_clients(), clear _session_messages to free conversation history memory.

  3. _sweep_idle_cached_agents(): After evicting idle agents, prune stale entries from _session_model_overrides, _session_reasoning_overrides, _pending_approvals, and _update_prompt_pending.

  4. _tool_defs_cache: Cap at 8 entries; clear entirely when exceeded (it repopulates on next call).

Related Issues

  • #18438 — 8 GB in 1 hour with Discord integration
  • #19251 — residual leak under heavy scheduled workload

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING