hermes - 💡(How to fix) Fix File descriptor leak in api_server platform: ResponseStore SQLite connections not closed on retry

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Three contributing factors:

Fix Action

Fix / Workaround

  • Gateway hits OSError: [Errno 24] Too many open files after ~17–24 hours of uptime
  • Cron jobs fail silently when the FD limit is reached (scheduler can't open files)
  • Kanban dispatcher fails (kanban_db.py fails first at line 1111)
  • Gateway becomes unresponsive and requires manual restart
  • 10+ unexpected restarts observed in one month on this setup

Code Example

if api_server_enabled or api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

---

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: True,  # always returns True
    ...
}

---

class APIServerAdapter(BasePlatformAdapter):
    def __init__(self, config: PlatformConfig):
        ...
        self._response_store = ResponseStore()  # line 706

---

# config.py line 1486 — change OR to AND
if api_server_enabled and api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

---

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: bool(cfg.extra.get("key")) if cfg else False,
    ...
}

---

# api_server.pyResponseStore
def close(self):
    if self._conn:
        self._conn.close()
        self._conn = None

# api_server.pyAPIServerAdapter
def stop(self):
    if self._response_store:
        self._response_store.close()
        self._response_store = None
    ...

---

# Count response_store.db FDs on gateway PID
lsof -p $(pgrep -f 'hermes_cli.main gateway') 2>/dev/null | grep response_store.db | wc -l

# Should be stable at ~3 (one set of db+wal+shm) after fix
# Before fix: grows by ~3 every gateway restart or retry cycle
RAW_BUFFERClick to expand / collapse

Bug Description

The api_server platform accumulates file descriptors (FDs) over time due to SQLite WAL connections in ResponseStore not being properly closed during platform retry cycles.

Environment

  • OS: macOS 26.4.1 (Mac Mini)
  • Installation: Homebrew
  • Hermes version: (latest via Homebrew)
  • Gateway PID: 42506, uptime ~12 hours
  • FD limit: 65,535 (raised from Mac default 256)

Reproduction

  1. Enable api_server platform via API_SERVER_ENABLED=true in ~/.hermes/.env (with no API_SERVER_KEY set)
  2. Observe FD count: lsof -p $(pgrep -f 'hermes_cli.main gateway') | grep response_store.db | wc -l
  3. After 12 hours: 122+ FDs pointing to response_store.db on a single gateway process (PID 42506)
  4. This equals ~41 complete SQLite WAL connection sets (main db + WAL + SHM = 3 FDs each)

Root Cause

Three contributing factors:

1. api_server auto-enabled without API_SERVER_KEY

In gateway/config.py line 1486:

if api_server_enabled or api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

The platform is instantiated even when only API_SERVER_ENABLED=true is set, without a valid API_SERVER_KEY. The HTTP server refuses to start (Refusing to start: API_SERVER_KEY is required) but the adapter is still loaded into the gateway.

2. Connected check always returns True

In gateway/config.py line 425:

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: True,  # always returns True
    ...
}

The api_server is always reported as "connected" regardless of whether it is actually running. This is misleading and may prevent proper retry/recovery logic.

3. ResponseStore opened at init, never closed

In gateway/platforms/api_server.py line 706:

class APIServerAdapter(BasePlatformAdapter):
    def __init__(self, config: PlatformConfig):
        ...
        self._response_store = ResponseStore()  # line 706

ResponseStore.__init__ opens a SQLite connection with WAL mode (sqlite3.connect(..., check_same_thread=False) + apply_wal_with_fallback). This is called at adapter __init__ time, not when the HTTP server starts. The connection is never explicitly closed — no close() method is defined on ResponseStore, and APIServerAdapter has no teardown logic for the store.

On each gateway restart or platform reconnect cycle, a new ResponseStore instance may be created while old ones are not garbage-collected, leading to accumulation of SQLite WAL file handles.

Impact

  • Gateway hits OSError: [Errno 24] Too many open files after ~17–24 hours of uptime
  • Cron jobs fail silently when the FD limit is reached (scheduler can't open files)
  • Kanban dispatcher fails (kanban_db.py fails first at line 1111)
  • Gateway becomes unresponsive and requires manual restart
  • 10+ unexpected restarts observed in one month on this setup

Proposed Fix

Fix 1: Require API_SERVER_KEY for platform to be loaded

# config.py line 1486 — change OR to AND
if api_server_enabled and api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

Fix 2: Fix the connected checker to validate key presence

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: bool(cfg.extra.get("key")) if cfg else False,
    ...
}

Fix 3: Add close() to ResponseStore and call it on adapter teardown

# api_server.py — ResponseStore
def close(self):
    if self._conn:
        self._conn.close()
        self._conn = None

# api_server.py — APIServerAdapter
def stop(self):
    if self._response_store:
        self._response_store.close()
        self._response_store = None
    ...

Alternatively, make ResponseStore a process-wide singleton so repeated adapter instantiation does not create new SQLite connections.

Verification

# Count response_store.db FDs on gateway PID
lsof -p $(pgrep -f 'hermes_cli.main gateway') 2>/dev/null | grep response_store.db | wc -l

# Should be stable at ~3 (one set of db+wal+shm) after fix
# Before fix: grows by ~3 every gateway restart or retry cycle

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix File descriptor leak in api_server platform: ResponseStore SQLite connections not closed on retry