openclaw - 💡(How to fix) Fix Sessions: log when the on-disk store changes between writer cache loads

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The session store at ~/.openclaw/sessions/store.json is serialized through an in-process writer queue (runExclusiveSessionStoreWrite, introduced in b4437047f4). Writers in the same process can't race each other.

There's still a class of mutation the queue cannot see: anything that touches store.json from outside the gateway process — a maintenance script, a manual edit, a test harness, a second gateway. The cache-invalidation guard inside loadMutableSessionStoreForWriter already detects it on the next read (cached mtimeMs / sizeBytes no longer match disk), but the guard silently drops the cache and continues. Operators have no signal that anything changed.

Error Message

A structured signal when the cache-invalidation guard fires. Today it's a silent reload; ideally it would be a structured warn log on the sessions/store subsystem with enough payload to triage. In loadMutableSessionStoreForWriter, when the in-process cache's recorded mtimeMs / sizeBytes no longer matches the live disk values, emit a single warn log via createSubsystemLogger("sessions/store") with a typed payload:

  • No new sink, no audit pipeline. Structured log.warn only. A future PR can promote to a typed audit event without renaming.

Root Cause

The session store at ~/.openclaw/sessions/store.json is serialized through an in-process writer queue (runExclusiveSessionStoreWrite, introduced in b4437047f4). Writers in the same process can't race each other.

There's still a class of mutation the queue cannot see: anything that touches store.json from outside the gateway process — a maintenance script, a manual edit, a test harness, a second gateway. The cache-invalidation guard inside loadMutableSessionStoreForWriter already detects it on the next read (cached mtimeMs / sizeBytes no longer match disk), but the guard silently drops the cache and continues. Operators have no signal that anything changed.

Code Example

{
  schemaVersion: 1,
  type: "session_store_cache_invalidated",
  reason: "stat_mismatch" | "missing_file",
  storePath: string,
  observedMtime: number | null,
  expectedMtime: number | null,
  observedSize: number | null,
  expectedSize: number | null,
}
RAW_BUFFERClick to expand / collapse

Context

The session store at ~/.openclaw/sessions/store.json is serialized through an in-process writer queue (runExclusiveSessionStoreWrite, introduced in b4437047f4). Writers in the same process can't race each other.

There's still a class of mutation the queue cannot see: anything that touches store.json from outside the gateway process — a maintenance script, a manual edit, a test harness, a second gateway. The cache-invalidation guard inside loadMutableSessionStoreForWriter already detects it on the next read (cached mtimeMs / sizeBytes no longer match disk), but the guard silently drops the cache and continues. Operators have no signal that anything changed.

What's missing

A structured signal when the cache-invalidation guard fires. Today it's a silent reload; ideally it would be a structured warn log on the sessions/store subsystem with enough payload to triage.

Proposed change (PR incoming)

In loadMutableSessionStoreForWriter, when the in-process cache's recorded mtimeMs / sizeBytes no longer matches the live disk values, emit a single warn log via createSubsystemLogger("sessions/store") with a typed payload:

{
  schemaVersion: 1,
  type: "session_store_cache_invalidated",
  reason: "stat_mismatch" | "missing_file",
  storePath: string,
  observedMtime: number | null,
  expectedMtime: number | null,
  observedSize: number | null,
  expectedSize: number | null,
}

The writer continues with the freshly loaded snapshot — no retry, no abort, no behavior change. Just a signal.

Approach

  • Additive. takeMutableSessionStoreCache keeps its public signature and return type. A new file-internal sibling tryTakeMutableSessionStoreCacheWithReason returns a discriminated WriterCacheLoad carrying observed/expected mtime/size on the invalidated path.
  • No new sink, no audit pipeline. Structured log.warn only. A future PR can promote to a typed audit event without renaming.
  • Fail-open. Logger failures are swallowed inside the emit helper — never propagate up to the writer.
  • Test seam. _setSessionStoreCacheInvalidatedEmitterForTest mirrors the existing withSessionStoreWriterForTest pattern, so tests can assert on the typed payload directly rather than parsing log strings.

Known limitations (deliberate, documented)

  • Mid-flight drift. Mutations that land strictly between one writer's own load and its post-merge atomic rename are not detected; covering that case requires a pre-write second mtime check or a cross-process file lock. Out of scope here.
  • Empty-cache bypass. The first writer after process boot has no cache to mismatch against, so external mutations pre-dating the cache are invisible.
  • Sub-second mtime resolution. Filesystems with ≥1s mtime granularity (FAT32, some SMB/NFS, certain Docker bind mounts) miss writes within the same second when size is unchanged.

These are acceptable for an observability-first phase and are surfaced in the PR.

Test coverage

Three regression tests in a new sibling file store-writer-concurrency.test.ts:

  1. Dual in-process merges through the queue — assert no tripwire fires (regression coverage for queue serialization).
  2. External `writeTextAtomic` between writes leaves the cache stale; the next writer's load fires the tripwire with a typed payload, and both mutations land in the merged result.
  3. ACP-bind preservation under cache eviction (`preserveExistingAcpMetadata` snapshots from the post-tripwire load, so external ACP changes survive).

Diff size: ~290 LOC total (~105 production, ~185 tests).

Why I'm opening an issue first

Wanted to confirm direction before opening the PR. This surfaces a contract rather than fixing a bug, so if the right answer is "we don't want this signal" or "we want it shaped differently," I'd rather hear it now. Happy to adjust scope, drop tests, or rework the framing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Sessions: log when the on-disk store changes between writer cache loads