openclaw - ✅(Solved) Fix EmbeddedAttemptSessionTakeoverError fires on legitimate co-tenant writes to shared sessions (regression in 2026.5.17) [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84071Fetched 2026-05-20 03:44:25
View on GitHub
Comments
2
Participants
3
Timeline
20
Reactions
2
Author
Timeline (top)
labeled ×7referenced ×7cross-referenced ×3commented ×2

The new fingerprint-based session-takeover fence introduced in 2026.5.17 (Agents/sessions: ... release the embedded run's coarse transcript lock before model I/O while locking persistence and cleanup separately. Fixes #13744) treats any write to the session jsonl during the releaseForPrompt() window as adversarial takeover — including writes from legitimate co-tenants on the same session (heartbeat, cron, channel ingress) that go through the installSessionEventWriteLock / installSessionExternalHookWriteLock hooks.

Once tripped, hasSessionTakeover() is sticky and every subsequent withSessionWriteLock call throws. The diagnostic surfaces as a stalled session with recovery=none; the user-facing TUI shows "gateway disconnected: closed | idle" because the WS lane stalls at model_call:started and never streams.

Error Message

[diagnostic] lane task error: lane=main durationMs=116088 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: ...sessions/<sid>.jsonl" [diagnostic] lane task error: lane=session:agent:main:main durationMs=116091 ... [model-fallback/decision] decision=candidate_failed ... reason=unknown detail=session file changed while embedded prompt lock was released [diagnostic] stalled session: ... activeWorkKind=model_call lastProgress=model_call:started lastProgressAge=150s recovery=none

Root Cause

Once tripped, hasSessionTakeover() is sticky and every subsequent withSessionWriteLock call throws. The diagnostic surfaces as a stalled session with recovery=none; the user-facing TUI shows "gateway disconnected: closed | idle" because the WS lane stalls at model_call:started and never streams.

Fix Action

Fix / Workaround

Workarounds

  • Restart gateway to clear the stalled lane (resets controller state — works until the next co-tenant write).
  • Pin TUI / critical turns to a dedicated agent / session not shared with heartbeat / cron / channels.
  • Roll back to 2026.5.16 (last build before the fence was added).

PR fix notes

PR #84046: fix(agents): stop false-positive session-takeover on runner's own transcript appends

Description (problem / solution / changelog)

Summary

The embedded attempt session-lock fence uses a stat()-based fingerprint (dev, ino, size, mtimeNs, ctimeNs, birthtimeNs) to detect "another lane queued a new user turn against this session" while the runner releases its coarse lock for the LLM prompt window. The fingerprint is a proxy that overfires: every transcript-append the runner itself performs during the released window (appendSessionTranscriptMessageLocked writes tool calls, tool results, and assistant replies via acquireSessionWriteLock(..., allowReentrant: true)) changes the file's size/mtimeNs/ctimeNs without going through refreshSessionFileFence. Long DM turns with tool calls then trip assertSessionFileFence and throw EmbeddedAttemptSessionTakeoverError after the reply has already been delivered, surfacing a misleading "Agent failed before reply" message on top of a successful turn.

Keep the stat fingerprint as the fast path. On mismatch, only tolerate the runner's append-only assistant/tool transcript shape; treat everything else as a real takeover. The verification requires ALL of:

  • dev/ino match: no atomic replacement onto a new inode.
  • birthtimeNs matches: catches the unlink+recreate-same-inode case (common on tmpfs/ext4 with rapid recreation, where dev and ino are reused but the inode itself is fresh).
  • size grew: file is append-only between fence-set and assert.
  • stat/read consistency: both snapshotSessionFileFence and verifyAppendOnlyRunnerOwnedExtension run a post-read stat and compare every fingerprint field (size, dev, ino, mtimeNs, ctimeNs, birthtimeNs) against the pre-read fingerprint. Any mismatch fails closed: the snapshot returns prefixHashHex: null (forces the next assertion to treat the file as taken over) and the slow path rejects directly. Catches the same-size in-place rewrite race where byte length alone would not detect drift.
  • bytes [0, fenceSize) hash identical to the fenced prefix hash: no in-place rewrite of earlier transcript content.
  • bytes [fenceSize, currentSize) are canonical session entries: every appended line must satisfy isSessionEntry from transcript-file-state.ts (record shape, type, id, parentId, timestamp, plus the per-type message-shape contract via isAgentMessage), with role narrowed to assistant, toolResult, or bashExecution. Non-object parsed values, user-role entries, custom entries, branch_summary entries, and any entry missing the outer base fields remain a real takeover signal.

Anything else (replacement, shrink, in-place rewrite at any size, new user turn, compaction, a malformed appended entry, a non-object tail value, or a stat/read inconsistency window) remains a real takeover. The fence snapshot captures the prefix SHA-256 alongside the fingerprint at releaseForPrompt and refreshSessionFileFence so the slow path can verify rigorously without keeping a file handle open.

The slow-path tail validation delegates to the canonical isSessionEntry whole-entry validator (now export-visible from transcript-file-state.ts, signature widened to unknown with an internal isRecord guard so non-record JSON values fail closed). No local mirror of the persisted message contract is maintained: if the upstream FileEntry/SessionEntry contract changes, the fence's accept-set tracks it automatically.

The nextSnapshot returned from a successful slow-path verification hashes the full new buffer (not the pre-fence prefix). Otherwise, when the guarded operation throws between assertSessionFileFence and refreshSessionFileFence, the stored snapshot would describe the new size but hold the old prefix hash, and the next legitimate runner append would false-positive on prefix-hash mismatch.

Lock-held invariant

Both snapshotSessionFileFence call sites run while the session write lock is held: releaseForPrompt (snapshot at L452 before lock.release() at L454) and refreshSessionFileFence (called inside withSessionWriteLock's runWithLock wrapped by activeWriteLock.run(lock, ...)). Canonical transcript writers (appendSessionTranscriptMessageLocked, migrateLinearTranscriptToParentLinked, ensureTranscriptHeader) all acquire that same lock. The stat/read consistency guard above is belt-and-suspenders for this invariant: it defends against multi-process scenarios (a second gateway against the same session file), an external editor, or an internal bug that bypasses the lock.

Closes #83436. Refs #83615 (this PR addresses the EmbeddedAttemptSessionTakeoverError surface; the issue also tracks unrelated kimi-k2.6 schema and DNS-fetch failures).

Real behavior proof

Behavior addressed: Embedded runner false-positive EmbeddedAttemptSessionTakeoverError fires after long Telegram DM turns whose reply was already delivered, surfacing a misleading "Agent failed before reply" message on top of a successful run.

Real environment tested: OpenClaw 2026.5.19-beta.1, Linux x86_64, Node 24.13.0. Long-running Telegram DM session against claude-opus-4-7 (~3.5 min turn with multiple exec and message tool calls), reproduced reliably before this patch. Production behavior probed via runtime-patched dist/selection-BpjGe-Y0.js carrying the same fix semantics, then long DM runs (41s to 122s) exercised through the Telegram channel post-restart.

Exact steps or command run after this patch:

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/agents/pi-embedded-runner/transcript-file-state.test.ts
CI=true pnpm check:changed --staged=false
openclaw gateway restart

Evidence after fix:

Pre-patch failure mode, redacted runtime log from gateway (/tmp/openclaw/openclaw-2026-05-19.log):

03:30:06.971  embedded run agent end: runId=bbd7d487 isError=false
03:30:07.008  embedded run prompt end: runId=bbd7d487 sessionId=<session-id> durationMs=217236
03:30:07.054  lane task error: lane=main durationMs=219867 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /home/.../agents/main/sessions/<session-id>.jsonl"
03:30:07.071  Embedded agent failed before reply: session file changed while embedded prompt lock was released: /home/.../<session-id>.jsonl

Note isError=false at 03:30:06.971: the assistant turn and message tool delivery both succeeded, followed 47ms later by the takeover throw, surfacing as a misleading Telegram error on top of a delivered reply.

Post-patch behavior on the same gateway, four long DM runs after gateway restart at 03:46 EDT, every one delivering cleanly with zero takeover errors (redacted runtime log):

03:51:18  embedded run prompt end: runId=62395ffd  durationMs=122344  (no takeover)
03:52:39  embedded run prompt end: runId=67e21652  durationMs=41345   (no takeover)
04:05:05  embedded run prompt end: runId=a5994980  durationMs=101810  (no takeover)
04:07:34  embedded run prompt end: runId=a837440b  durationMs=94395   (no takeover)

The post-patch window contains only two "Embedded agent failed before reply" entries (at 04:05:06 and 04:07:35), both with cause LLM request timed out, unrelated to the takeover path. Zero SessionTakeoverError occurrences post-patch.

Observed result after fix: Existing acceptance test exercises all three runner-owned roles (assistant, toolResult, bashExecution) with their canonical persisted shapes. Rejection tests cover: another owner queueing a new user turn, a runner-owned role with malformed message contents, a runner-owned message entry missing its outer id, a non-message entry (e.g. custom/branch_summary/compaction), a tail line that parses to a non-object value (e.g. literal null), in-place rewrite (rejected via prefix hash mismatch), unlink+recreate-same-inode replacement (rejected via birthtimeNs mismatch), a same-size in-place rewrite landing during the snapshot's stat-to-read window (rejected via post-read re-stat), and the same race during the slow-path verification's read (rejected via the same guard). A fence-consistency test covers the post-throw path: after a guarded operation throws between assert and refresh, the next valid runner-owned append must not false-positive on prefix-hash mismatch. Targeted Vitest output:

Test Files  4 passed (4)
     Tests  82 passed (82)
  Duration  3.86s

check:changed --staged=false is clean across the full repo lane (lint, typecheck, import-cycles, dependency guards, build artifacts, security/quality lanes).

What was not tested: Multi-process write contention on the same sessionFile from a different gateway process (no test infrastructure for cross-process locks in this surface). The prefix-hash check is O(fenceSize) I/O per fence-set; in production sessions seen so far (~150 KB and below) this is sub-millisecond. Both snapshot producers do one extra fs.stat (sub-millisecond on local disk) to enforce the stat/read consistency guard.

Notes

  • Adds three helpers near readSessionFileFingerprint: FenceSnapshot type, snapshotSessionFileFence (used at releaseForPrompt and refreshSessionFileFence), and verifyAppendOnlyRunnerOwnedExtension (used in the slow path of assertSessionFileFence). No new module-level dependencies beyond node:crypto's createHash.
  • Replaces one piece of controller state (fenceFingerprint: SessionFileFingerprint | undefined) with fenceSnapshot: FenceSnapshot | undefined. No new long-lived resources (no held file handles).
  • Exports isSessionEntry from transcript-file-state.ts so the fence's slow path can delegate the whole-entry contract to the canonical validator. The signature is widened to unknown with an internal isRecord guard; non-record JSON values fail closed at the canonical boundary instead of needing per-callsite type-shape checks. isAgentMessage is not exported (the fence reaches it transitively through isSessionEntry).

Changed files

  • src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts (modified, +297/-2)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts (modified, +168/-6)
  • src/agents/pi-embedded-runner/transcript-file-state.ts (modified, +9/-13)

PR #84149: fix(session-lock): allow co-tenant writes during prompt I/O window

Description (problem / solution / changelog)

Summary

EmbeddedAttemptSessionTakeoverError fires on legitimate co-tenant writes (heartbeat, cron, channel ingress) to shared sessions during the prompt I/O window.

Root Cause

attempt.session-lock.ts:284-291: releaseForPrompt() records a file stat fingerprint and releases the lock. Co-tenant writers acquire the same fs-level lock and append to the session file (changing the stat). When the controller reacquires, assertSessionFileFence() sees a fingerprint mismatch and sets the sticky takeoverDetected flag — even though the write was coordinated through the same lock.

Fix

Add a process-scoped write epoch counter to acquireSessionWriteLock() that increments on each lock acquisition. The controller records the epoch at fence activation. On reacquisition, it checks the epoch delta:

  • delta == 1 (only controller acquired): fingerprint change = external takeover → throw
  • delta > 1 (co-tenants also acquired): fingerprint change = coordinated write → update fence, continue

Cross-process writes don't advance the in-process epoch, so true external takeovers are still detected.

Tests

4 new tests: co-tenant write accepted, external write detected, mixed co-tenant + external detected, multiple co-tenant writes succeed. All 26 existing session-lock tests pass.

Fixes #84071

Changed files

  • src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts (modified, +131/-0)
  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts (modified, +22/-3)
  • src/agents/session-write-lock.ts (modified, +8/-0)

Code Example

[diagnostic] lane task error: lane=main durationMs=116088
  error="EmbeddedAttemptSessionTakeoverError: session file changed while
  embedded prompt lock was released: ...sessions/<sid>.jsonl"
[diagnostic] lane task error: lane=session:agent:main:main durationMs=116091 ...
[model-fallback/decision] decision=candidate_failed ... reason=unknown
  detail=session file changed while embedded prompt lock was released
[diagnostic] stalled session: ... activeWorkKind=model_call
  lastProgress=model_call:started lastProgressAge=150s recovery=none
RAW_BUFFERClick to expand / collapse

Summary

The new fingerprint-based session-takeover fence introduced in 2026.5.17 (Agents/sessions: ... release the embedded run's coarse transcript lock before model I/O while locking persistence and cleanup separately. Fixes #13744) treats any write to the session jsonl during the releaseForPrompt() window as adversarial takeover — including writes from legitimate co-tenants on the same session (heartbeat, cron, channel ingress) that go through the installSessionEventWriteLock / installSessionExternalHookWriteLock hooks.

Once tripped, hasSessionTakeover() is sticky and every subsequent withSessionWriteLock call throws. The diagnostic surfaces as a stalled session with recovery=none; the user-facing TUI shows "gateway disconnected: closed | idle" because the WS lane stalls at model_call:started and never streams.

Environment

  • OpenClaw 2026.5.18 (50a2481), Node 24.15.0, Linux LXC (Proxmox)
  • Gateway: local, loopback only, single-user
  • Default agent main, heartbeat 30m (default), kimi-k2.6:cloud primary via ollama-iron provider (~100s typical model call)
  • Shared session agent:main:main is also used by 8+ cron jobs and a Discord channel

Reproduction

  1. Configure default agent with Heartbeat 30m (default).
  2. Run any embedded turn through a slow provider (Ollama cloud, ~100s).
  3. Within ~30 minutes, heartbeat (or any other co-tenant) writes to the same session via the registered write-lock hooks while the model I/O window is open.
  4. The next withSessionWriteLock throws EmbeddedAttemptSessionTakeoverError; model_call stalls; subsequent retries on the same controller also throw.

Observed

Journal:

[diagnostic] lane task error: lane=main durationMs=116088
  error="EmbeddedAttemptSessionTakeoverError: session file changed while
  embedded prompt lock was released: ...sessions/<sid>.jsonl"
[diagnostic] lane task error: lane=session:agent:main:main durationMs=116091 ...
[model-fallback/decision] decision=candidate_failed ... reason=unknown
  detail=session file changed while embedded prompt lock was released
[diagnostic] stalled session: ... activeWorkKind=model_call
  lastProgress=model_call:started lastProgressAge=150s recovery=none

Both lane=main and lane=session:agent:main:main error at the same instant on the same session file with near-identical durationMs (off by 2–3 ms across multiple occurrences), confirming a within-process race rather than an external-process modification. Reproduced 4× in 2 hours on agent:main:main — cadence matches heartbeat (30 min).

Expected

The fence should distinguish writes by registered co-tenants (which already synchronize via installSessionEventWriteLock / installSessionExternalHookWriteLock) from external/uncoordinated mutators. A coordinated write should either (a) participate in the fingerprint by refreshing it under the write lock, or (b) not trip the fence at all.

Alternatively, provide a recovery path so the controller can re-fingerprint and resume after a legitimate concurrent write, rather than becoming permanently stuck on recovery=none.

Code references (2026.5.18 bundle)

  • dist/plugin-sdk/src/agents/pi-embedded-runner/run/attempt.session-lock.d.ts
  • dist/selection-Cr-9-UpD.js lines ~7827 (error class), ~7884 (createEmbeddedAttemptSessionLockController), ~7911 (assertSessionFileFence), ~7919 (refreshSessionFileFence)
  • The tunables session.writeLock.{acquireTimeoutMs, staleMs, maxHoldMs} (and corresponding OPENCLAW_SESSION_WRITE_LOCK_* env vars) do not affect the fence — it is fingerprint-based, not timeout-based.

Workarounds

  • Restart gateway to clear the stalled lane (resets controller state — works until the next co-tenant write).
  • Pin TUI / critical turns to a dedicated agent / session not shared with heartbeat / cron / channels.
  • Roll back to 2026.5.16 (last build before the fence was added).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix EmbeddedAttemptSessionTakeoverError fires on legitimate co-tenant writes to shared sessions (regression in 2026.5.17) [2 pull requests, 2 comments, 3 participants]