hermes - 💡(How to fix) Fix kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The kanban dispatcher in the gateway opens a new SQLite connection on every tick via kanban_db.connect(), but the file descriptors are not released even after conn.close() (called in finally blocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs for kanban.db and ~500 for kanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.

Error Message

  1. Single persistent connection: cache one connection per board slug and reuse it across ticks, only reopening on error

Root Cause

  • kanban_db.connect() (line 990) opens a new sqlite3.connect() call every invocation — no connection pooling or reuse
  • The dispatcher calls connect() on every tick (~60s) via _tick_once_for_board() and _ready_nonempty()
  • Although conn.close() is called in finally blocks, SQLite WAL mode appears to keep the WAL file descriptor open even after close
  • Observed: 499 FDs for kanban.db + 498 for kanban.db-wal = 997 FDs (gateway PID 73, FD limit was 1024)
  • The .db-wal FDs correspond 1:1 with .db FDs, suggesting each WAL connection holds a file open after close

Fix Action

Fix / Workaround

The kanban dispatcher in the gateway opens a new SQLite connection on every tick via kanban_db.connect(), but the file descriptors are not released even after conn.close() (called in finally blocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs for kanban.db and ~500 for kanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.

  • Feishu/飞书: HTTPS connection to open.feishu.cn fails with [Errno 24] Too many open files

  • Kanban dispatcher: sqlite3.OperationalError: unable to open database file at kanban_db.py:990

  • kanban_db.connect() (line 990) opens a new sqlite3.connect() call every invocation — no connection pooling or reuse

  • The dispatcher calls connect() on every tick (~60s) via _tick_once_for_board() and _ready_nonempty()

  • Although conn.close() is called in finally blocks, SQLite WAL mode appears to keep the WAL file descriptor open even after close

  • Observed: 499 FDs for kanban.db + 498 for kanban.db-wal = 997 FDs (gateway PID 73, FD limit was 1024)

  • The .db-wal FDs correspond 1:1 with .db FDs, suggesting each WAL connection holds a file open after close

RAW_BUFFERClick to expand / collapse

Summary

The kanban dispatcher in the gateway opens a new SQLite connection on every tick via kanban_db.connect(), but the file descriptors are not released even after conn.close() (called in finally blocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs for kanban.db and ~500 for kanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.

Symptoms

  • Feishu/飞书: HTTPS connection to open.feishu.cn fails with [Errno 24] Too many open files
  • Kanban dispatcher: sqlite3.OperationalError: unable to open database file at kanban_db.py:990

Root Cause

  • kanban_db.connect() (line 990) opens a new sqlite3.connect() call every invocation — no connection pooling or reuse
  • The dispatcher calls connect() on every tick (~60s) via _tick_once_for_board() and _ready_nonempty()
  • Although conn.close() is called in finally blocks, SQLite WAL mode appears to keep the WAL file descriptor open even after close
  • Observed: 499 FDs for kanban.db + 498 for kanban.db-wal = 997 FDs (gateway PID 73, FD limit was 1024)
  • The .db-wal FDs correspond 1:1 with .db FDs, suggesting each WAL connection holds a file open after close

Affected Code

  • hermes_cli/kanban_db.py:961-1018connect() creates new connection every call
  • gateway/run.py:4890_tick_once_for_board() opens connection per tick
  • gateway/run.py:4967_ready_nonempty() opens connection per tick

Suggested Fix

Options (from least to most invasive):

  1. Single persistent connection: cache one connection per board slug and reuse it across ticks, only reopening on error
  2. Explicit WAL checkpoint before close: call conn.execute("PRAGMA wal_checkpoint(TRUNCATE)") before close to force SQLite to release WAL FDs
  3. Investigate Python GC interaction: del conn before conn.close(), or gc.collect() — CPython may be deferring SQLite finalizer

Workaround (applied on site)

Raised FD soft limit to 65536 via prlimit. This buys time but the leak will eventually hit the new limit too.

Environment

  • WSL2 (Ubuntu)
  • Python 3.14
  • Hermes Agent commit cc94195ea

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode