codex - 💡(How to fix) Fix state runtime: corrupt `state_5.sqlite` (SQLite "file is not a database") wedges startup with no auto-recovery

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

async fn open_state_sqlite_recovering( path: &Path, migrator: &Migrator, ) -> anyhow::Result<SqlitePool> { match open_state_sqlite(path, migrator).await { Ok(pool) => Ok(pool), Err(err) if is_sqlite_not_database_error(&err) => { warn!( "state db at {} is not a valid sqlite database; quarantining and recreating it", path.display() ); quarantine_sqlite_files(path, "state").await?; open_state_sqlite(path, migrator).await } Err(err) => Err(err), } }

fn is_sqlite_not_database_error(err: &anyhow::Error) -> bool { err.chain().any(|cause| { let Some(sqlx_err) = cause.downcast_ref::sqlx::Error() else { return false; }; match sqlx_err { sqlx::Error::Database(db_err) => { db_err.code().as_deref() == Some("26") || db_err.message().contains("file is not a database") } _ => sqlx_err.to_string().contains("file is not a database"), } }) }

Root Cause

The user data on disk (history.jsonl, sessions/**/rollout-*.jsonl, external_agent_session_imports.json) is intact, and Codex is designed to be able to rebuild thread metadata from JSONL via the rollout backfill path. But because open_state_sqlite / open_logs_sqlite propagate the open error unconditionally, the agent cannot make forward progress on startup until the user manually moves the corrupt DB aside.

Code Example

failed to initialize state runtime at <codex_home>: error returned from database: (code: 26) file is not a database

---

async fn open_state_sqlite_recovering(
      path: &Path,
      migrator: &Migrator,
  ) -> anyhow::Result<SqlitePool> {
      match open_state_sqlite(path, migrator).await {
          Ok(pool) => Ok(pool),
          Err(err) if is_sqlite_not_database_error(&err) => {
              warn!(
                  "state db at {} is not a valid sqlite database; quarantining and recreating it",
                  path.display()
              );
              quarantine_sqlite_files(path, "state").await?;
              open_state_sqlite(path, migrator).await
          }
          Err(err) => Err(err),
      }
  }

  fn is_sqlite_not_database_error(err: &anyhow::Error) -> bool {
      err.chain().any(|cause| {
          let Some(sqlx_err) = cause.downcast_ref::<sqlx::Error>() else {
              return false;
          };
          match sqlx_err {
              sqlx::Error::Database(db_err) => {
                  db_err.code().as_deref() == Some("26")
                      || db_err.message().contains("file is not a database")
              }
              _ => sqlx_err.to_string().contains("file is not a database"),
          }
      })
  }

---

# Pick any codex_home with prior session activity.
  CODEX_HOME=$(mktemp -d)
  mkdir -p "$CODEX_HOME"
  printf 'not a sqlite database' > "$CODEX_HOME/state_5.sqlite"

  # Start any path that calls StateRuntime::init, e.g. `codex` (TUI).
  CODEX_HOME="$CODEX_HOME" codex
  # -> "failed to initialize state runtime at <CODEX_HOME>: error returned from database: (code: 26) file is not a database"

---

let codex_home = unique_temp_dir();
  tokio::fs::create_dir_all(&codex_home).await?;
  let state_path = state_db_path(codex_home.as_path());
  tokio::fs::write(&state_path, b"not a sqlite database").await?;

  // Today: fails with code 26 and the runtime cannot start.
  let _ = StateRuntime::init(codex_home.clone(), "test-provider".to_string()).await?;
RAW_BUFFERClick to expand / collapse

What version of Codex CLI is running?

sha: 5ecff051962e7299c743e4ce9c1545d71b756924

What subscription do you have?

Enterprise

Which model were you using?

gpt-5.5

What platform is your computer?

Linux 5.14.21-150400.24.184-default x86_64 x86_64

What terminal emulator and version are you using (if applicable)?

mate-terminal

What issue are you seeing?

Summary

When ~/.codex/state_5.sqlite (or its companion logs_2.sqlite) becomes unreadable as a SQLite database — e.g. truncated, partially written, or replaced with non-SQLite content — StateRuntime::init fails to open the pool with SQLite extended error code 26 / "file is not a database", and the failure surfaces as:

failed to initialize state runtime at <codex_home>: error returned from database: (code: 26) file is not a database

The user data on disk (history.jsonl, sessions/**/rollout-*.jsonl, external_agent_session_imports.json) is intact, and Codex is designed to be able to rebuild thread metadata from JSONL via the rollout backfill path. But because open_state_sqlite / open_logs_sqlite propagate the open error unconditionally, the agent cannot make forward progress on startup until the user manually moves the corrupt DB aside.

This is the "Expected behavior" already requested in #20493:

If state_5.sqlite migration/open fails, Desktop should detect the corruption, preserve the DB, rebuild from JSONL, and show a visible recovery/warning state instead of looking like the user's chats are gone.

I'd like to extend that ask to the CLI/state runtime layer as well, since the same DB and the same open path are shared.

Affected code

  • codex-rs/state/src/runtime.rs
    • StateRuntime::init -> open_state_sqlite -> SqlitePoolOptions::connect_with (returns sqlx::Error::Database with code 26)
    • StateRuntime::init -> open_logs_sqlite (same shape)
  • codex-rs/rollout/src/state_db.rs
    • init swallows the error and emits a startup warning, but the DB is then None for the rest of the process. Consumers that still expect a usable DB (e.g. embedded app-server thread state) operate in a degraded state.

PR #21481 ("Revert state DB injection and agent graph store") helpfully restored Option<StateDbHandle> through several call sites, which softens the blast radius. It does not, however, recover from corruption — the user still needs to manually quarantine the file before Codex can rebuild.

Why this happens in practice

I have seen this on Linux after an aborted/forced shutdown left state_5.sqlite truncated to ~0 bytes with stale -wal/-shm sidecars. #20493 reports the same shape on macOS Desktop after the import flow, where the file ends up as generic data rather than a SQLite header. Because the DB is rebuildable from JSONL, the failure mode is recoverable in principle — it just needs a code path that does the rebuild.

Proposed fix

At the SQLite open boundary in codex-rs/state/src/runtime.rs, detect SQLite extended code 26 / "file is not a database" specifically, quarantine the file (and its -wal, -shm, -journal sidecars) with a .corrupt-<UTC timestamp> suffix, and retry the open so migrations recreate a fresh schema. The rollout backfill path then rebuilds thread metadata as it normally would.

Important constraints:

  • Do not auto-quarantine on migration errors, permission errors, or lock errors. Those are not corruption and should continue to surface as real failures.
  • Do not delete the corrupt file. Rename with a timestamped suffix so users can keep it for forensics or send it in if asked.
  • Apply the same recovery to both state_5.sqlite and logs_2.sqlite, since both go through symmetric helpers.

Sketch of the change in runtime.rs (only the recovery wrapper shown; the body is straightforward):

async fn open_state_sqlite_recovering(
    path: &Path,
    migrator: &Migrator,
) -> anyhow::Result<SqlitePool> {
    match open_state_sqlite(path, migrator).await {
        Ok(pool) => Ok(pool),
        Err(err) if is_sqlite_not_database_error(&err) => {
            warn!(
                "state db at {} is not a valid sqlite database; quarantining and recreating it",
                path.display()
            );
            quarantine_sqlite_files(path, "state").await?;
            open_state_sqlite(path, migrator).await
        }
        Err(err) => Err(err),
    }
}

fn is_sqlite_not_database_error(err: &anyhow::Error) -> bool {
    err.chain().any(|cause| {
        let Some(sqlx_err) = cause.downcast_ref::<sqlx::Error>() else {
            return false;
        };
        match sqlx_err {
            sqlx::Error::Database(db_err) => {
                db_err.code().as_deref() == Some("26")
                    || db_err.message().contains("file is not a database")
            }
            _ => sqlx_err.to_string().contains("file is not a database"),
        }
    })
}

A regression test that writes garbage to state_5.sqlite, calls StateRuntime::init, and then asserts that:

  • init returns Ok(_),
  • the schema is present (e.g. get_backfill_state returns Pending),
  • the original file has been preserved with a .corrupt-<UTC> suffix,

is enough to lock the behavior in place.

What steps can reproduce the bug?

# Pick any codex_home with prior session activity.
CODEX_HOME=$(mktemp -d)
mkdir -p "$CODEX_HOME"
printf 'not a sqlite database' > "$CODEX_HOME/state_5.sqlite"

# Start any path that calls StateRuntime::init, e.g. `codex` (TUI).
CODEX_HOME="$CODEX_HOME" codex
# -> "failed to initialize state runtime at <CODEX_HOME>: error returned from database: (code: 26) file is not a database"

A unit test reproduces the same condition without needing the full TUI:

let codex_home = unique_temp_dir();
tokio::fs::create_dir_all(&codex_home).await?;
let state_path = state_db_path(codex_home.as_path());
tokio::fs::write(&state_path, b"not a sqlite database").await?;

// Today: fails with code 26 and the runtime cannot start.
let _ = StateRuntime::init(codex_home.clone(), "test-provider".to_string()).await?;

What is the expected behavior?

  1. On any path that calls StateRuntime::init, a corrupt SQLite state or logs DB no longer blocks startup.
  2. The corrupt file is preserved as <name>.corrupt-<UTC timestamp> next to the original, with -wal, -shm, -journal sidecars also moved aside.
  3. A fresh DB is created via the existing migrator; rollout backfill rebuilds thread metadata from JSONL.
  4. Other classes of open failure (migration, permission denied, lock contention) still surface as errors, unchanged.

Additional information

Alternatives considered

  • Front the open with a SQLite header sniff (read first 16 bytes, check "SQLite format 3\0"). Slightly cleaner error semantics, but only catches the truncated/garbage case. Corruption past the header still surfaces as code 26, so the chain-inspection path is still needed. I'd keep the chain-inspection design.
  • Keep recovery in callers (TUI, app-server, etc.). This duplicates logic and is racy across processes that share CODEX_HOME. The single chokepoint at StateRuntime::init covers every consumer, including future ones.
  • Always rebuild on any open error. Too broad. Hides real migration/permission/lock failures. Restricting to SQLite code 26 keeps the recovery scoped to corruption-shaped failures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix state runtime: corrupt `state_5.sqlite` (SQLite "file is not a database") wedges startup with no auto-recovery