openclaw - 💡(How to fix) Fix Fallback chain consumed by EmbeddedAttemptSessionTakeoverError: local session-file race triggers cross-provider fallback that can't help [1 comments, 2 participants]

openclaw2026-05-19 15:27:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#84204•Fetched 2026-05-20 03:42:44

View on GitHub

Comments

Participants

Timeline

Reactions

Author

esqandil

Participants

clawsweeper[bot]

esqandil

Timeline (top)

labeled ×3closed ×1commented ×1

Reproduced cleanly today: 3 attempts (Anthropic Claude Opus 4.7 → OpenAI GPT-5.5 → Google Gemini 3 Flash) failed the embedded-prompt-lock fingerprint check in succession; only Gemini happened to win the race on the third attempt.

Error Message

In 2026.5.18 (and earlier observed in 2026.5.16-beta.7), a subagent turn can be killed mid-attempt by EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released"). The model-fallback chain then walks through every configured candidate hitting the same error, because the trigger is a concurrent local writer to the session .jsonl — not the model provider — so the fallback isn't actually mitigating the failure mode it was designed for. 4. Fallback layer treats it as a candidate failure and tries the next model. Because the trigger is local-file mutation (which has nothing to do with which model is being called), every fallback hits the same error until either: (a) no writer happens to fire during the window, or (b) the chain is exhausted. | 10:15:24 | 2 | openai/gpt-5.5 | candidate_failed | same error | (1) alone would have made today's repro a non-event — same error 3×, just retry Opus 2 more times instead of fanning out to a different model.

Wastes a fallback budget on a non-provider error.

Root Cause

Fix Action

Fix / Workaround

EmbeddedAttemptSessionTakeoverError is raised when the post-LLM-call fingerprint (dev, ino, size, mtimeNs, ctimeNs) differs from the pre-call snapshot.
The lock is intentionally released across the provider call (good — long-running network IO shouldn't block other writers).
But there's a _processAgentEvent write-lock installed on the session, and hook-side locks (beforeToolCall, afterToolCall, onPayload, onResponse, compact) installed on the agent.
For subagents, two things can still slip past:
1. Parent-session writers reaching this session's .jsonl (e.g. dispatcher persisting a steer/new user message) — those don't go through _processAgentEvent or any of the installed lockable functions on the subagent's agent.
2. Plugin write paths (memory-episodic indexer flushing the just-finished turn's events on agent_end) — agent_end fires on the previous turn's completion but can still be writing when the next turn starts.

Code Example

session file changed while embedded prompt lock was released:
/home/shadeform/.openclaw/agents/coli/sessions/<id>.jsonl

RAW_BUFFERClick to expand / collapse

Summary

Repro

Setup:

OpenClaw 2026.5.18 (also observed on 2026.5.16-beta.7)
Node 22.22.2, Linux
Subagent session with the memory-episodic plugin enabled and agent_end / before_prompt_build hooks installed (this plugin actively writes to per-agent session/episodic state during a turn)
Fallback chain configured: anthropic/claude-opus-4-7, openai/gpt-5.5, google/gemini-3-flash-preview, xai/grok-4.20-0309-non-reasoning

Trigger:

Subagent is in an LLM turn.
Parent (or any other writer in this process tree) appends to the subagent's session .jsonl — e.g. a follow-up user message arrives, or a session-write event fires from a hook — between the lock being released to call the provider and being re-acquired to commit the response.
The provider call returns successfully, but readSessionFileFingerprint sees dev/ino/size/mtimeNs changed and the request is aborted with:

session file changed while embedded prompt lock was released:
/home/shadeform/.openclaw/agents/coli/sessions/<id>.jsonl

Fallback layer treats it as a candidate failure and tries the next model. Because the trigger is local-file mutation (which has nothing to do with which model is being called), every fallback hits the same error until either: (a) no writer happens to fire during the window, or (b) the chain is exhausted.

Three-attempt sample from today (UTC):

Time	Attempt	Model	Decision	Failure detail
10:14:15	1	anthropic/claude-opus-4-7	candidate_failed	`session file changed while embedded prompt lock was released: …`
10:15:24	2	openai/gpt-5.5	candidate_failed	same error
10:15:37	3	google/gemini-3-flash-preview	candidate_succeeded	—

All three attempts share the same sessionId (subagent session). Provider/model is irrelevant; the same errorHash repeats. The fallback chain's "tries next model" semantics aren't useful here — what's needed is "wait for the writer to settle and retry the same candidate."

What I think is happening

Reading dist/selection-Cr-9-UpD.js around line 7800:

EmbeddedAttemptSessionTakeoverError is raised when the post-LLM-call fingerprint (dev, ino, size, mtimeNs, ctimeNs) differs from the pre-call snapshot.
The lock is intentionally released across the provider call (good — long-running network IO shouldn't block other writers).
But there's a _processAgentEvent write-lock installed on the session, and hook-side locks (beforeToolCall, afterToolCall, onPayload, onResponse, compact) installed on the agent.
For subagents, two things can still slip past:
1. Parent-session writers reaching this session's .jsonl (e.g. dispatcher persisting a steer/new user message) — those don't go through _processAgentEvent or any of the installed lockable functions on the subagent's agent.
2. Plugin write paths (memory-episodic indexer flushing the just-finished turn's events on agent_end) — agent_end fires on the previous turn's completion but can still be writing when the next turn starts.

Either way, by the time the provider responds, the fingerprint is dirty and the attempt is voided.

Suggested fix(es)

A few options, in increasing order of invasiveness:

Treat EmbeddedAttemptSessionTakeoverError as a retryable on the SAME candidate, not as a fallback trigger. This is the smallest fix. Fallback exists to route around provider-side failures; a local fingerprint race is the opposite. Bound the retry count (e.g. 2–3) before giving up.
Hold the embedded prompt lock across appendUserMessage from a parent session into a child session. That is, route external writes to a subagent's .jsonl through the same lockable function chain as the subagent's own events. The current installSessionEventWriteLock only wraps session._processAgentEvent — cross-session writes bypass it.
Re-fingerprint AFTER the lock is re-acquired rather than checking on entry — i.e. drain pending writes that arrived during the provider call, fold them in, and commit. The "takeover" is only a takeover if some other writer is actively producing a contradictory transcript; an append from a parent (a queued user message) is mergeable, not a takeover.

(1) alone would have made today's repro a non-event — same error 3×, just retry Opus 2 more times instead of fanning out to a different model.

Impact

Wastes a fallback budget on a non-provider error.
Produces confusing telemetry — operator sees "model fallback" entries pointing at Opus → GPT-5.5 → Gemini and reasonably assumes Anthropic was the problem, when it was actually a local file race.
The "winning" model on attempt N is essentially random (whichever provider happens to respond in a quiet window).
When the chain happens to land on a different family, the persona/style of the run shifts mid-conversation. From a user's point of view, "my Opus agent suddenly became Gemini for one message" — without anything visible explaining why.

Happy to attach the full JSON event records (event=model_fallback_decision, sub-requestedModelMatched: false, errorHash: sha256:3a22ecfdbc85 shared across all three) on request.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model save/load #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Fallback chain consumed by EmbeddedAttemptSessionTakeoverError: local session-file race triggers cross-provider fallback that can't help [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Repro

What I think is happening

Suggested fix(es)

Impact

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Fallback chain consumed by EmbeddedAttemptSessionTakeoverError: local session-file race triggers cross-provider fallback that can't help [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Repro

What I think is happening

Suggested fix(es)

Impact

Still need to ship something?

RELATED_DISCOVERY

TRENDING