openclaw - 💡(How to fix) Fix Fallback chain consumed by EmbeddedAttemptSessionTakeoverError: local session-file race triggers cross-provider fallback that can't help [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84204Fetched 2026-05-20 03:42:44
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
1
Author
Timeline (top)
labeled ×3closed ×1commented ×1

In 2026.5.18 (and earlier observed in 2026.5.16-beta.7), a subagent turn can be killed mid-attempt by EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released"). The model-fallback chain then walks through every configured candidate hitting the same error, because the trigger is a concurrent local writer to the session .jsonl — not the model provider — so the fallback isn't actually mitigating the failure mode it was designed for.

Reproduced cleanly today: 3 attempts (Anthropic Claude Opus 4.7 → OpenAI GPT-5.5 → Google Gemini 3 Flash) failed the embedded-prompt-lock fingerprint check in succession; only Gemini happened to win the race on the third attempt.

Error Message

In 2026.5.18 (and earlier observed in 2026.5.16-beta.7), a subagent turn can be killed mid-attempt by EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released"). The model-fallback chain then walks through every configured candidate hitting the same error, because the trigger is a concurrent local writer to the session .jsonl — not the model provider — so the fallback isn't actually mitigating the failure mode it was designed for. 4. Fallback layer treats it as a candidate failure and tries the next model. Because the trigger is local-file mutation (which has nothing to do with which model is being called), every fallback hits the same error until either: (a) no writer happens to fire during the window, or (b) the chain is exhausted. | 10:15:24 | 2 | openai/gpt-5.5 | candidate_failed | same error | (1) alone would have made today's repro a non-event — same error 3×, just retry Opus 2 more times instead of fanning out to a different model.

  • Wastes a fallback budget on a non-provider error.

Root Cause

In 2026.5.18 (and earlier observed in 2026.5.16-beta.7), a subagent turn can be killed mid-attempt by EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released"). The model-fallback chain then walks through every configured candidate hitting the same error, because the trigger is a concurrent local writer to the session .jsonl — not the model provider — so the fallback isn't actually mitigating the failure mode it was designed for.

Fix Action

Fix / Workaround

  • EmbeddedAttemptSessionTakeoverError is raised when the post-LLM-call fingerprint (dev, ino, size, mtimeNs, ctimeNs) differs from the pre-call snapshot.
  • The lock is intentionally released across the provider call (good — long-running network IO shouldn't block other writers).
  • But there's a _processAgentEvent write-lock installed on the session, and hook-side locks (beforeToolCall, afterToolCall, onPayload, onResponse, compact) installed on the agent.
  • For subagents, two things can still slip past:
    1. Parent-session writers reaching this session's .jsonl (e.g. dispatcher persisting a steer/new user message) — those don't go through _processAgentEvent or any of the installed lockable functions on the subagent's agent.
    2. Plugin write paths (memory-episodic indexer flushing the just-finished turn's events on agent_end) — agent_end fires on the previous turn's completion but can still be writing when the next turn starts.

Code Example

session file changed while embedded prompt lock was released:
/home/shadeform/.openclaw/agents/coli/sessions/<id>.jsonl
RAW_BUFFERClick to expand / collapse

Summary

In 2026.5.18 (and earlier observed in 2026.5.16-beta.7), a subagent turn can be killed mid-attempt by EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released"). The model-fallback chain then walks through every configured candidate hitting the same error, because the trigger is a concurrent local writer to the session .jsonl — not the model provider — so the fallback isn't actually mitigating the failure mode it was designed for.

Reproduced cleanly today: 3 attempts (Anthropic Claude Opus 4.7 → OpenAI GPT-5.5 → Google Gemini 3 Flash) failed the embedded-prompt-lock fingerprint check in succession; only Gemini happened to win the race on the third attempt.

Repro

Setup:

  • OpenClaw 2026.5.18 (also observed on 2026.5.16-beta.7)
  • Node 22.22.2, Linux
  • Subagent session with the memory-episodic plugin enabled and agent_end / before_prompt_build hooks installed (this plugin actively writes to per-agent session/episodic state during a turn)
  • Fallback chain configured: anthropic/claude-opus-4-7, openai/gpt-5.5, google/gemini-3-flash-preview, xai/grok-4.20-0309-non-reasoning

Trigger:

  1. Subagent is in an LLM turn.
  2. Parent (or any other writer in this process tree) appends to the subagent's session .jsonl — e.g. a follow-up user message arrives, or a session-write event fires from a hook — between the lock being released to call the provider and being re-acquired to commit the response.
  3. The provider call returns successfully, but readSessionFileFingerprint sees dev/ino/size/mtimeNs changed and the request is aborted with:
session file changed while embedded prompt lock was released:
/home/shadeform/.openclaw/agents/coli/sessions/<id>.jsonl
  1. Fallback layer treats it as a candidate failure and tries the next model. Because the trigger is local-file mutation (which has nothing to do with which model is being called), every fallback hits the same error until either: (a) no writer happens to fire during the window, or (b) the chain is exhausted.

Three-attempt sample from today (UTC):

TimeAttemptModelDecisionFailure detail
10:14:151anthropic/claude-opus-4-7candidate_failedsession file changed while embedded prompt lock was released: …
10:15:242openai/gpt-5.5candidate_failedsame error
10:15:373google/gemini-3-flash-previewcandidate_succeeded

All three attempts share the same sessionId (subagent session). Provider/model is irrelevant; the same errorHash repeats. The fallback chain's "tries next model" semantics aren't useful here — what's needed is "wait for the writer to settle and retry the same candidate."

What I think is happening

Reading dist/selection-Cr-9-UpD.js around line 7800:

  • EmbeddedAttemptSessionTakeoverError is raised when the post-LLM-call fingerprint (dev, ino, size, mtimeNs, ctimeNs) differs from the pre-call snapshot.
  • The lock is intentionally released across the provider call (good — long-running network IO shouldn't block other writers).
  • But there's a _processAgentEvent write-lock installed on the session, and hook-side locks (beforeToolCall, afterToolCall, onPayload, onResponse, compact) installed on the agent.
  • For subagents, two things can still slip past:
    1. Parent-session writers reaching this session's .jsonl (e.g. dispatcher persisting a steer/new user message) — those don't go through _processAgentEvent or any of the installed lockable functions on the subagent's agent.
    2. Plugin write paths (memory-episodic indexer flushing the just-finished turn's events on agent_end) — agent_end fires on the previous turn's completion but can still be writing when the next turn starts.

Either way, by the time the provider responds, the fingerprint is dirty and the attempt is voided.

Suggested fix(es)

A few options, in increasing order of invasiveness:

  1. Treat EmbeddedAttemptSessionTakeoverError as a retryable on the SAME candidate, not as a fallback trigger. This is the smallest fix. Fallback exists to route around provider-side failures; a local fingerprint race is the opposite. Bound the retry count (e.g. 2–3) before giving up.
  2. Hold the embedded prompt lock across appendUserMessage from a parent session into a child session. That is, route external writes to a subagent's .jsonl through the same lockable function chain as the subagent's own events. The current installSessionEventWriteLock only wraps session._processAgentEvent — cross-session writes bypass it.
  3. Re-fingerprint AFTER the lock is re-acquired rather than checking on entry — i.e. drain pending writes that arrived during the provider call, fold them in, and commit. The "takeover" is only a takeover if some other writer is actively producing a contradictory transcript; an append from a parent (a queued user message) is mergeable, not a takeover.

(1) alone would have made today's repro a non-event — same error 3×, just retry Opus 2 more times instead of fanning out to a different model.

Impact

  • Wastes a fallback budget on a non-provider error.
  • Produces confusing telemetry — operator sees "model fallback" entries pointing at Opus → GPT-5.5 → Gemini and reasonably assumes Anthropic was the problem, when it was actually a local file race.
  • The "winning" model on attempt N is essentially random (whichever provider happens to respond in a quiet window).
  • When the chain happens to land on a different family, the persona/style of the run shifts mid-conversation. From a user's point of view, "my Opus agent suddenly became Gemini for one message" — without anything visible explaining why.

Happy to attach the full JSON event records (event=model_fallback_decision, sub-requestedModelMatched: false, errorHash: sha256:3a22ecfdbc85 shared across all three) on request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Fallback chain consumed by EmbeddedAttemptSessionTakeoverError: local session-file race triggers cross-provider fallback that can't help [1 comments, 2 participants]