hermes - 💡(How to fix) Fix Race condition in kanban _migrate_add_optional_columns on gateway startup

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

sqlite3.OperationalError: duplicate column name: consecutive_failures

Root Cause

Two async tasks are created concurrently in the gateway (gateway/run.py):

  • Line 3335: asyncio.create_task(self._kanban_notifier_watcher())
  • Line 3341: asyncio.create_task(self._kanban_dispatcher_watcher())

Both watchers call _kb.connect(board=slug)_migrate_add_optional_columns(conn) via asyncio.to_thread().

The _INITIALIZED_PATHS set (module-level, kanban_db.py:~917) is used as a cache to skip re-initialization, but it is not thread-safe. When both threads race on the first tick:

  1. Thread A checks needs_init = resolved not in _INITIALIZED_PATHSTrue
  2. Thread B checks needs_init = resolved not in _INITIALIZED_PATHSTrue (set not yet updated by A)
  3. Both threads run _migrate_add_optional_columns()
  4. Both read cols via PRAGMA table_info(tasks) — neither sees consecutive_failures yet
  5. Thread A succeeds with ALTER TABLE tasks ADD COLUMN consecutive_failures ...
  6. Thread B crashes with duplicate column name: consecutive_failures

The error is caught at the outer exception handler (gateway/run.py:3889) so the gateway keeps running, but the kanban dispatcher tick is lost.

Fix Action

Fix / Workaround

On gateway startup, the kanban dispatcher crashes with:

sqlite3.OperationalError: duplicate column name: consecutive_failures
  • Line 3335: asyncio.create_task(self._kanban_notifier_watcher())
  • Line 3341: asyncio.create_task(self._kanban_dispatcher_watcher())

The error is caught at the outer exception handler (gateway/run.py:3889) so the gateway keeps running, but the kanban dispatcher tick is lost.

Code Example

sqlite3.OperationalError: duplicate column name: consecutive_failures
RAW_BUFFERClick to expand / collapse

Bug

On gateway startup, the kanban dispatcher crashes with:

sqlite3.OperationalError: duplicate column name: consecutive_failures

Root Cause

Two async tasks are created concurrently in the gateway (gateway/run.py):

  • Line 3335: asyncio.create_task(self._kanban_notifier_watcher())
  • Line 3341: asyncio.create_task(self._kanban_dispatcher_watcher())

Both watchers call _kb.connect(board=slug)_migrate_add_optional_columns(conn) via asyncio.to_thread().

The _INITIALIZED_PATHS set (module-level, kanban_db.py:~917) is used as a cache to skip re-initialization, but it is not thread-safe. When both threads race on the first tick:

  1. Thread A checks needs_init = resolved not in _INITIALIZED_PATHSTrue
  2. Thread B checks needs_init = resolved not in _INITIALIZED_PATHSTrue (set not yet updated by A)
  3. Both threads run _migrate_add_optional_columns()
  4. Both read cols via PRAGMA table_info(tasks) — neither sees consecutive_failures yet
  5. Thread A succeeds with ALTER TABLE tasks ADD COLUMN consecutive_failures ...
  6. Thread B crashes with duplicate column name: consecutive_failures

The error is caught at the outer exception handler (gateway/run.py:3889) so the gateway keeps running, but the kanban dispatcher tick is lost.

Reproduction

Start the gateway with a fresh or existing kanban.db that already has consecutive_failures in the schema (i.e., after a previous successful migration). The race window is tight but triggers reliably on startup when both watchers hit their first tick close together.

Environment

  • Hermes v0.6+ (323 commits behind → updated to latest main as of bbff2f634)
  • Python 3.14, SQLite 3.x
  • Linux (NixOS)

Suggested Fix

Either:

  1. Quick fix: Wrap each ALTER TABLE in _migrate_add_optional_columns with try/except sqlite3.OperationalError catching only duplicate column errors. Other errors still propagate.
  2. Proper fix: Use a threading.Lock around the needs_init check + migration block in connect(), or use CREATE TABLE IF NOT EXISTS style guards.

I can submit a PR for option 1 or 2 if desired. Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING