openclaw - 💡(How to fix) Fix EmbeddedAttemptSessionTakeoverError cluster on 2026.5.22 — 13 events / 7 jobs / 42h

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Since 2026-05-24 we're seeing a cluster of cron runs failing with EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released. 13 distinct failure events across 7 jobs and 4 agents over a 42-hour span, on a 2026.5.22 gateway. The class survives docker restart because the failed session's .jsonl is replayed on the next scheduled run of the same cron. We initially hoped #85256 (in the 5.22 changelog) closed this lane, but the error is still firing post-5.22 and post-restart.

This issue is filed at the prompt of an internal escalation rule: "file upstream if the next-cycle audit reproduces." Today's audit reproduced it, so we're surfacing.

Error Message

{ "action": "finished", "status": "error", "error": "EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl", "diagnostics": { "summary": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl", "entries": [ { "source": "agent-run", "severity": "error", "message": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl" } ] }, "deliveryStatus": "not-requested", "sessionId": "<sid>", "sessionKey": "agent:<agent-1>:cron:<job>:run:<sid>", "durationMs": 227552 }

Root Cause

Since 2026-05-24 we're seeing a cluster of cron runs failing with EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released. 13 distinct failure events across 7 jobs and 4 agents over a 42-hour span, on a 2026.5.22 gateway. The class survives docker restart because the failed session's .jsonl is replayed on the next scheduled run of the same cron. We initially hoped #85256 (in the 5.22 changelog) closed this lane, but the error is still firing post-5.22 and post-restart.

Fix Action

Fix / Workaround

The only workaround we've validated for an adjacent error class (Invalid signature in thinking block) was to manually archive the affected .jsonl and clear its entry from sessions.json. We haven't applied that here yet — most of the affected crons are reminder/alert class, so we've been letting the next scheduled interval roll a fresh session — but that's coin-flip recovery, not a real fix.

  • Acknowledgment that this class is open / known
  • Either a mechanism explanation (so we can patch heartbeat timing / cron spacing locally as a stopgap) or an upstream fix that doesn't require manual .jsonl archiving + sessions.json surgery

Code Example

{
  "action": "finished",
  "status": "error",
  "error": "EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl",
  "diagnostics": {
    "summary": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl",
    "entries": [
      {
        "source": "agent-run",
        "severity": "error",
        "message": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl"
      }
    ]
  },
  "deliveryStatus": "not-requested",
  "sessionId": "<sid>",
  "sessionKey": "agent:<agent-1>:cron:<job>:run:<sid>",
  "durationMs": 227552
}

---

2026-05-24T11:03:47Z  agent-1  job-A (4×/day reminder)     sess=<s1>   dur=227.6s
2026-05-24T13:04:35Z  agent-1  job-A (4×/day reminder)     sess=<s2>   dur=226.1s
2026-05-24T14:04:37Z  agent-1  job-B (daily quiz)          sess=<s3>   dur=228.3s
2026-05-24T15:04:46Z  agent-1  job-A (4×/day reminder)     sess=<s4>   dur=226.2s
2026-05-24T17:06:32Z  agent-1  job-C (daily reminder)      sess=<s5>   dur=225.2s
2026-05-25T03:04:27Z  agent-4  job-D (morning brief)       sess=<s6>   dur=189.2s
2026-05-25T05:33:50Z  agent-3  job-E (recurring alert)     sess=<s7>   dur=229.5s
2026-05-25T06:05:04Z  agent-2  job-F (weekly maintenance)  sess=<s8>   dur=236.2s
2026-05-25T09:03:46Z  agent-1  job-A (4×/day reminder)     sess=<s9>   dur=226.8s
2026-05-25T11:03:44Z  agent-1  job-A (4×/day reminder)     sess=<s10>  dur=224.6s
2026-05-25T15:04:33Z  agent-1  job-A (4×/day reminder)     sess=<s11>  dur=225.9s
2026-05-25T18:05:14Z  agent-2  job-G (evening brief)       sess=<s12>  dur=226.0s
2026-05-26T05:33:50Z  agent-3  job-E (recurring alert)     sess=<s13>  dur=229.7s

---

-rw-r--r-- 1 node node 2677 May 24 19:03 /data/.openclaw/agents/<agent-1>/sessions/<sid-1>.jsonl
-rw-r--r-- 1 node node 2475 May 25 14:03 /data/.openclaw/agents/<agent-2>/sessions/<sid-2>.jsonl
RAW_BUFFERClick to expand / collapse

Summary

Since 2026-05-24 we're seeing a cluster of cron runs failing with EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released. 13 distinct failure events across 7 jobs and 4 agents over a 42-hour span, on a 2026.5.22 gateway. The class survives docker restart because the failed session's .jsonl is replayed on the next scheduled run of the same cron. We initially hoped #85256 (in the 5.22 changelog) closed this lane, but the error is still firing post-5.22 and post-restart.

This issue is filed at the prompt of an internal escalation rule: "file upstream if the next-cycle audit reproduces." Today's audit reproduced it, so we're surfacing.

Environment

  • OpenClaw 2026.5.22 (a374c3a)
  • Docker container on a VPS, Node --max-old-space-size=2048
  • Config dir /data/.openclaw/, sessions under /data/.openclaw/agents/<agent>/sessions/
  • 5 heartbeat-active agents; 1 cold agent (opt-in heartbeat). 4 of the 5 hot agents appear in this cluster.
  • All affected crons use isolated sessions (--session isolated)

Error signature

Every failed run records exactly this shape — no stack trace, only the summary message:

{
  "action": "finished",
  "status": "error",
  "error": "EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl",
  "diagnostics": {
    "summary": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl",
    "entries": [
      {
        "source": "agent-run",
        "severity": "error",
        "message": "session file changed while embedded prompt lock was released: /data/.openclaw/agents/<agent-1>/sessions/<sid>.jsonl"
      }
    ]
  },
  "deliveryStatus": "not-requested",
  "sessionId": "<sid>",
  "sessionKey": "agent:<agent-1>:cron:<job>:run:<sid>",
  "durationMs": 227552
}

durationMs lands consistently in the 225–236s band across 12 of 13 events, suggesting the takeover trips at a similar phase rather than at random. (One outlier at 189s.)

Cluster — 13 events, 4 agents, 7 jobs, 42-hour span

All 13 events below are verified against on-disk cron/runs/<job>.jsonl lines and the named session .jsonl exists on disk for every row.

2026-05-24T11:03:47Z  agent-1  job-A (4×/day reminder)     sess=<s1>   dur=227.6s
2026-05-24T13:04:35Z  agent-1  job-A (4×/day reminder)     sess=<s2>   dur=226.1s
2026-05-24T14:04:37Z  agent-1  job-B (daily quiz)          sess=<s3>   dur=228.3s
2026-05-24T15:04:46Z  agent-1  job-A (4×/day reminder)     sess=<s4>   dur=226.2s
2026-05-24T17:06:32Z  agent-1  job-C (daily reminder)      sess=<s5>   dur=225.2s
2026-05-25T03:04:27Z  agent-4  job-D (morning brief)       sess=<s6>   dur=189.2s
2026-05-25T05:33:50Z  agent-3  job-E (recurring alert)     sess=<s7>   dur=229.5s
2026-05-25T06:05:04Z  agent-2  job-F (weekly maintenance)  sess=<s8>   dur=236.2s
2026-05-25T09:03:46Z  agent-1  job-A (4×/day reminder)     sess=<s9>   dur=226.8s
2026-05-25T11:03:44Z  agent-1  job-A (4×/day reminder)     sess=<s10>  dur=224.6s
2026-05-25T15:04:33Z  agent-1  job-A (4×/day reminder)     sess=<s11>  dur=225.9s
2026-05-25T18:05:14Z  agent-2  job-G (evening brief)       sess=<s12>  dur=226.0s
2026-05-26T05:33:50Z  agent-3  job-E (recurring alert)     sess=<s13>  dur=229.7s

Spread:

  • Per UTC day: 5 (2026-05-24) → 7 (2026-05-25) → 1 (2026-05-26, partial-day at audit time)
  • Per agent: agent-1 ×8, agent-2 ×2, agent-3 ×2, agent-4 ×1
  • Per job (7 distinct): job-A ×6 (hot spot), job-E ×2, then job-B, job-C, job-D, job-F, job-G at ×1
  • Verified properties: 13 unique (job, session) pairs (0 duplicates); 0 retry-clusters within 300s (no two events on the same job within 5 min); 13/13 referenced session .jsonl files present on disk
  • Hot-spot detail: job-A fires 4×/day at 09/11/13/15 UTC. In this 42h window all four firing slots have at least one takeover failure (09 UTC ×1, 11 UTC ×2, 13 UTC ×1, 15 UTC ×2). Approx 12 expected firings (~2 days × 4/day within the window), 6 takeovers = ~50% per-job failure rate. Not a generalized claim — limited to this 42h window — but the per-job density is high enough that maintainers may find it useful as a reproducer focus.

(Two job IDs are unusual-looking but real: job-E's ID is a legacy semantic ID that predates the UUID convention, and job-G's ID is a UUID that coincidentally resembles a placeholder string. Both verified against cron/jobs.json.)

Restart does not clear the class

A docker restart does not resolve the issue. The failed session's .jsonl is still on disk and gets replayed on the next scheduled invocation of the same cron, producing the same error. For example, two of the affected session files are still present, node-owned, last-modified at the failure timestamp:

-rw-r--r-- 1 node node 2677 May 24 19:03 /data/.openclaw/agents/<agent-1>/sessions/<sid-1>.jsonl
-rw-r--r-- 1 node node 2475 May 25 14:03 /data/.openclaw/agents/<agent-2>/sessions/<sid-2>.jsonl

The only workaround we've validated for an adjacent error class (Invalid signature in thinking block) was to manually archive the affected .jsonl and clear its entry from sessions.json. We haven't applied that here yet — most of the affected crons are reminder/alert class, so we've been letting the next scheduled interval roll a fresh session — but that's coin-flip recovery, not a real fix.

Possibly-adjacent prior fixes (not asserted — flagged for maintainers to confirm)

These are best-guess adjacencies based on naming overlap and changelog text. We have not validated the code paths and don't claim these are the same component.

  • #85256 — listed in the 2026.5.22 changelog as fixing a "session-stuck-running" class. We hoped this would close the takeover lane, but the cluster reproduces on a freshly-restarted 5.22 gateway. Could be a different sub-case, or the takeover error path was out of scope for that PR. Worth checking whether the EmbeddedAttemptSessionTakeover code path was touched.
  • #85764 — a recent fix in the file-lock area. Our error message names the "embedded prompt lock," so the lock subsystem is at least adjacent. Our file-lock-stale event count is at baseline (~79 events / 46 sessions over 7d), so lock-stale traffic is normal — the takeover is happening on top of that baseline, but the two may share a lock-acquisition code path.

Adjacency assertions limited to "naming overlap" — maintainers will know the actual code-path relationship better than we can.

Open questions for maintainers

  1. What actually triggers "session file changed while embedded prompt lock was released"? Concurrent cron + heartbeat collision on the same session? Sub-agent spawn writing back to the parent transcript? Compaction running mid-prompt? An async writer racing the prompt-render?
  2. Why does the affected .jsonl re-poison the next run on the same cron instead of being treated as stale/dead and rotated or skipped?
  3. Should EmbeddedAttemptSessionTakeoverError carry a real stack trace? Currently the diagnostics block has only the summary string — there is no information pointing at the writer that mutated the file or the holder of the released lock. With a stack we could narrow this down ourselves.

What we'd like

  • Acknowledgment that this class is open / known
  • Either a mechanism explanation (so we can patch heartbeat timing / cron spacing locally as a stopgap) or an upstream fix that doesn't require manual .jsonl archiving + sessions.json surgery

Reproducer offer

We can share anonymized session .jsonl files from any of the 13 affected runs if useful for tracing. We will scrub before sharing: phone numbers, personal names, business names, third-party API tokens, and any database / bot IDs. Just let us know which jobs/agents (by the job-A..G / agent-1..4 / s1..s13 labels above) you'd like reproducers from.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix EmbeddedAttemptSessionTakeoverError cluster on 2026.5.22 — 13 events / 7 jobs / 42h