hermes - 💡(How to fix) Fix fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience disk I/O error failures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.

  1. sqlite3.OperationalError: disk I/O error during WAL checkpoint operations Add _is_on_btrfs() detection that proactively skips WAL mode on BTRFS and falls back to journal_mode=DELETE. This avoids silent corruption because the exception-based fallback in apply_wal_with_fallback() would be too late (data is already corrupted by that point).

Root Cause

Add _is_on_btrfs() detection that proactively skips WAL mode on BTRFS and falls back to journal_mode=DELETE. This avoids silent corruption because the exception-based fallback in apply_wal_with_fallback() would be too late (data is already corrupted by that point).

Fix Action

Fix / Workaround

RAW_BUFFERClick to expand / collapse

Environment

  • Hermes Agent: latest (main, 2026-05-23)
  • OS: Arch Linux
  • Python: 3.11.15
  • Filesystem: BTRFS (with Copy-on-Write enabled, compress=zstd:3, ssd)
  • SQLite: 3.53.1
  • Affected databases: state.db, kanban.db

Problem

When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience disk I/O error failures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.

The issue manifests as:

  1. sqlite3.OperationalError: disk I/O error during WAL checkpoint operations
  2. Worker processes hanging on database locks
  3. Gateway crashes and stale task claims
  4. Silent database corruption risk

Why it happens

  • WAL mode relies on shared memory (-shm files) and sequential writes
  • BTRFS COW operations can modify disk blocks after WAL records them
  • Without proper handling, concurrent writers block each other or cause I/O errors

Proposed Solution

Add _is_on_btrfs() detection that proactively skips WAL mode on BTRFS and falls back to journal_mode=DELETE. This avoids silent corruption because the exception-based fallback in apply_wal_with_fallback() would be too late (data is already corrupted by that point).

Changes

Three files modified:

  1. hermes_state.py — added _is_on_btrfs() function that checks /proc/self/mountinfo for BTRFS filesystems, and updated apply_wal_with_fallback() to accept db_path and proactively skip WAL on BTRFS
  2. hermes_cli/kanban_db.py — pass db_path to apply_wal_with_fallback() so BTRFS detection works for kanban database
  3. tools/terminal_tool.py — added _safe_getcwd() helper that falls back to home directory when os.getcwd() raises FileNotFoundError (e.g. when CWD was deleted). Fixes cleanup thread crashes

Testing

Tested on Arch Linux, BTRFS (compress=zstd:3, ssd), SQLite 3.53.1:

  • 5 concurrent writers + 3 readers, 50 operations each
  • Result: 400 operations, 0 errors, 0.50s total

Performance impact

WAL mode is 30-50% faster than DELETE mode for concurrent writes. On BTRFS, the fallback to DELETE mode reduces concurrency but ensures data integrity. Users who need WAL performance can use chattr +C on their Hermes directory to disable COW per-directory.

Related

Open Questions

  1. Should we expose a configuration flag (database.journal_mode: auto | wal | delete) for users to override the automatic fallback?
  2. Should we add CI tests that spin up a temporary BTRFS filesystem and verify the agent starts without SQLite errors?
  3. Are there other COW filesystems (ZFS, APFS) that exhibit similar incompatibilities?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS