hermes - 💡(How to fix) Fix SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban

hermes2026-05-08 18:42:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When ~/.hermes is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's PRAGMA journal_mode=WAL fails with sqlite3.OperationalError: locking protocol. Every component that opens state.db or kanban.db swallows this error silently, and the user is left with:

/resume, /title, /history, /branch all respond "Session database not available." with no explanation
hermes update snapshot warning SQLite safe copy failed for ~/.hermes/state.db: locking protocol
Kanban dispatcher tick crashing every 60s with the same error
TUI session store unavailable warnings
(Downstream) the known duplicate column name: consecutive_failures kanban migration race (#21708 / #21374) firing continuously because the migration is retried on every tick

The user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.

Error Message

2026-05-08 13:41:11 WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol 2026-05-08 13:45:05 ERROR gateway.run: kanban dispatcher: tick failed on board default File "hermes_cli/kanban_db.py", line 878, in connect conn.execute("PRAGMA journal_mode=WAL") sqlite3.OperationalError: locking protocol 2026-05-08 13:46:46 WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol 2026-05-08 13:46:59 WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol 2026-05-08 13:47:08 WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol

Root Cause

Two files hit PRAGMA journal_mode=WAL unconditionally with no fallback:

hermes_state.py:201 — SessionDB.__init__ sets journal_mode=WAL. On failure the caller (SessionDB() in cli.py:2379, gateway/run.py:1194, tui_gateway/server.py) catches the exception and sets _session_db = None, but never tries a different journal mode.
hermes_cli/kanban_db.py:920 — connect() sets journal_mode=WAL. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.

The failure is silent downstream:

Gateway logs at DEBUG (gateway/run.py:1196): logger.debug("SQLite session store not available: %s", e) — invisible in errors.log.
CLI logs at WARNING (correct) — visible but still generic.
/resume error message hard-codes "Session database not available." with no cause. Nine such sites across cli.py and gateway/run.py:
- cli.py:5368, 5479, 6755, 6770
- gateway/run.py:10186, 10224, 10438, 10482, 10569

Fix Action

Fix / Workaround

/resume, /title, /history, /branch all respond "Session database not available." with no explanation
hermes update snapshot warning SQLite safe copy failed for ~/.hermes/state.db: locking protocol
Kanban dispatcher tick crashing every 60s with the same error
TUI session store unavailable warnings
(Downstream) the known duplicate column name: consecutive_failures kanban migration race (#21708 / #21374) firing continuously because the migration is retried on every tick

2026-05-08 13:41:11  WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05  ERROR gateway.run: kanban dispatcher: tick failed on board default
    File "hermes_cli/kanban_db.py", line 878, in connect
      conn.execute("PRAGMA journal_mode=WAL")
  sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59  WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol

The kanban dispatcher retried this failed migration continuously until the user restarted the gateway.

Code Example

File: "/home/mormio/.hermes"
Type: nfs
ID: 0  Namelen: 255
172.26.224.200:d2dfac12/home on /home type nfs
  (rw, relatime, vers=3, rsize=1048576, wsize=1048576, namelen=255,
   hard, forcerdirplus, proto=tcp, nconnect=4, timeo=600, retrans=2,
   sec=sys, mountaddr=172.26.224.200, mountvers=3, mountport=20048,
   mountproto=udp, local_lock=none, addr=172.26.224.200)

---

2026-05-08 13:41:11  WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05  ERROR gateway.run: kanban dispatcher: tick failed on board default
    File "hermes_cli/kanban_db.py", line 878, in connect
      conn.execute("PRAGMA journal_mode=WAL")
  sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59  WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol

RAW_BUFFERClick to expand / collapse

Summary

/resume, /title, /history, /branch all respond "Session database not available." with no explanation
hermes update snapshot warning SQLite safe copy failed for ~/.hermes/state.db: locking protocol
Kanban dispatcher tick crashing every 60s with the same error
TUI session store unavailable warnings
(Downstream) the known duplicate column name: consecutive_failures kanban migration race (#21708 / #21374) firing continuously because the migration is retried on every tick

The user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.

Evidence

Real user debug report. Their stat -f ~/.hermes output and mount line:

File: "/home/mormio/.hermes"
Type: nfs
ID: 0  Namelen: 255
172.26.224.200:d2dfac12/home on /home type nfs
  (rw, relatime, vers=3, rsize=1048576, wsize=1048576, namelen=255,
   hard, forcerdirplus, proto=tcp, nconnect=4, timeo=600, retrans=2,
   sec=sys, mountaddr=172.26.224.200, mountvers=3, mountport=20048,
   mountproto=udp, local_lock=none, addr=172.26.224.200)

NFSv3 over TCP with local_lock=none — the exact configuration SQLite upstream documents as incompatible with WAL:

SQLite databases in WAL mode do not work over a network filesystem.

The resulting log entries in the same user's session:

2026-05-08 13:41:11  WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05  ERROR gateway.run: kanban dispatcher: tick failed on board default
    File "hermes_cli/kanban_db.py", line 878, in connect
      conn.execute("PRAGMA journal_mode=WAL")
  sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59  WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol

The kanban dispatcher retried this failed migration continuously until the user restarted the gateway.

Root cause

Two files hit PRAGMA journal_mode=WAL unconditionally with no fallback:

hermes_state.py:201 — SessionDB.__init__ sets journal_mode=WAL. On failure the caller (SessionDB() in cli.py:2379, gateway/run.py:1194, tui_gateway/server.py) catches the exception and sets _session_db = None, but never tries a different journal mode.
hermes_cli/kanban_db.py:920 — connect() sets journal_mode=WAL. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.

The failure is silent downstream:

Gateway logs at DEBUG (gateway/run.py:1196): logger.debug("SQLite session store not available: %s", e) — invisible in errors.log.
CLI logs at WARNING (correct) — visible but still generic.
/resume error message hard-codes "Session database not available." with no cause. Nine such sites across cli.py and gateway/run.py:
- cli.py:5368, 5479, 6755, 6770
- gateway/run.py:10186, 10224, 10438, 10482, 10569

Who this affects

Users with ~/.hermes on NFS (shared university clusters, enterprise Linux, cloud dev VMs mounting team home dirs)
Users with ~/.hermes on SMB/CIFS, some FUSE mounts, or WSL1
Anyone whose state.db / kanban.db ends up in a container bind-mount where locking semantics differ

The failure mode presents to the user as "/resume just doesn't work" with no actionable diagnostic. Support burden: every affected user has to share logs with a maintainer to figure out what's broken.

Proposed fix

Three changes, all in one PR:

Fall back to journal_mode=DELETE on WAL failure. DELETE mode is the SQLite default before WAL was invented; it works on NFS. Concurrency drops (no concurrent readers during writes) but the feature works. Apply the fallback in both hermes_state.py and hermes_cli/kanban_db.py. Log a single WARNING on fallback explaining why.
Surface the cause in /resume and related error messages. Capture the underlying OperationalError on the failing init and include it in the user-facing string. Instead of "Session database not available.", show "Session database not available: locking protocol (state.db may be on a network filesystem — see <docs>).".
Bump gateway/run.py:1196 log level from DEBUG to WARNING so the failure appears in errors.log, matching the CLI path which already does this correctly.

Deliberately out of scope for the PR

NFS autodetection at startup via statvfs / /proc/mounts. Fragile across Linux/macOS/WSL/Docker overlay FS. The try/except fallback approach is OS-agnostic and more robust.
hermes doctor integration. Separate concern, separate PR.
The duplicate column name: consecutive_failures kanban migration race (#21708 / #21374). Unrelated root cause; fires because of this bug (WAL failure → migration retried forever) but fixing the WAL issue stops the cascade without fixing the migration itself.

Acceptance criteria

SessionDB() succeeds on NFS via DELETE-mode fallback, with a single WARNING logged once per process.
kanban_db.connect() succeeds on NFS via the same fallback.
/resume on a system where SessionDB genuinely cannot open returns a message containing the underlying cause.
New tests cover:
- WAL pragma raising OperationalError("locking protocol") → DELETE fallback fires, DB is usable.
- /resume error string includes the captured cause when _session_db is None.
No regression in existing SessionDB / kanban tests.

References

SQLite WAL documentation: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
Related symptom issues (downstream of this bug on NFS):
- #21708 — kanban duplicate column name: consecutive_failures
- #21374 — race condition in _migrate_add_optional_columns
Prior related PR (TUI degradation only, did not fix root cause): #14135

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Evidence

Root cause

Who this affects

Proposed fix

Deliberately out of scope for the PR

Acceptance criteria

References

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Evidence

Root cause

Who this affects

Proposed fix

Deliberately out of scope for the PR

Acceptance criteria

References

Still need to ship something?

RELATED_DISCOVERY

TRENDING