openclaw - 💡(How to fix) Fix EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) [2 pull requests]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

In failover-error.ts, this is correctly classified via isNonProviderRuntimeCoordinationError, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but ~6% of affected turns propagate as user-visible Embedded agent failed before reply: All models failed, so the reply silently drops.

  • src/agents/failover-error.tsisNonProviderRuntimeCoordinationError classifier with comment referencing internal #83510. The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in failover-error.ts already treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.

Fix Action

Fixed

Code Example

const ACTIVE_EMBEDDED_PROMPTS = new Map<string, Promise<void>>();

async function acquireForPrompt(sessionFile: string, ...): Promise<SessionLock> {
  const existing = ACTIVE_EMBEDDED_PROMPTS.get(sessionFile);
  if (existing) {
    if (onContention === "wait") {
      await existing;
    } else {
      throw new EmbeddedAttemptSessionContendedError(sessionFile);
    }
  }
  let releaseSignal: () => void;
  ACTIVE_EMBEDDED_PROMPTS.set(sessionFile, new Promise(r => (releaseSignal = r)));
  // ... existing logic ...
  // on final release / completion / takeover, releaseSignal() and ACTIVE_EMBEDDED_PROMPTS.delete(sessionFile)
}
RAW_BUFFERClick to expand / collapse

Problem

EmbeddedAttemptSessionTakeoverError is fired when two embedded runs concurrently access the same session file — typically an agent's heartbeat lane racing a channel or direct lane on the same sessions/<uuid>.jsonl. The session-lock controller releases the write-lock around every provider stream call (releaseForPrompt() in attempt.session-lock.ts); during that release window, the other lane's writes mutate the file, the fence fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes, and the original lane throws on reacquire.

In failover-error.ts, this is correctly classified via isNonProviderRuntimeCoordinationError, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but ~6% of affected turns propagate as user-visible Embedded agent failed before reply: All models failed, so the reply silently drops.

A pre-existing comment in attempt.session-lock.ts references internal issue #83510, so this is known but unfixed.

Reproduction

Any agent whose heartbeat is not marked isolatedSession: true and whose lanes share a session UUID:

  1. Configure an agent whose agents.list[].heartbeat has isolatedSession: false (or omits it; default behavior depends on the agent template).
  2. While the heartbeat embedded run is mid-stream (lock released for the provider call), an inbound chat event arrives on a lane that resolves to the same session file.
  3. The chat-handler lane writes user-context entries; the heartbeat lane reacquires, fence mismatches, throws EmbeddedAttemptSessionTakeoverError.

Across a 3-day window in one observed deployment, 122 occurrences were logged across 37 distinct session files. Lane histogram:

Lane classCount
synthetic main mirror60
session:agent:<id>:main:heartbeat36
session:agent:<id>:<channel-type>:channel:<id-A>20
session:agent:<id>:<channel-type>:direct:<user>3
session:agent:<id>:cron:…1
session:agent:<other-id>:<channel-type>:channel:…1
cron-nested1

The worst single session accumulated 32 hits on one channel-bound UUID; another got 8 hits at the same channel. Heartbeat overlapping channel is the dominant pattern, not user-side double-send.

A specific reproducible trace from one session UUID: the gateway's stuck-session-recovery aborted an active embedded run and immediately re-fired a new run on the same session file, then both racing copies took turns invalidating each other's fence. durationMs values up to 1,303,727 ms (~22 min) confirm long-running embedded runs are exactly the ones that get stomped.

Code reference

  • src/agents/pi-embedded-runner/run/attempt.session-lock.tsEmbeddedAttemptSessionTakeoverError class; fence comparison in assertSessionFileFence; releaseForPrompt / reacquire path that builds the fingerprint.
  • src/agents/failover-error.tsisNonProviderRuntimeCoordinationError classifier with comment referencing internal #83510.
  • src/agents/pi-embedded-runner/google-prompt-cache.ts — imports + catches EmbeddedAttemptSessionTakeoverError; second observation point.

The takeover detection itself works as designed; the bug is upstream — two lanes that should not be sharing the same session file are. The stuck-recovery path is the most reproducible contender: it kicks off a new embedded run on a session UUID that still has an active embedded prompt lock from the original run.

Proposed change

Two-part fix:

Part A — attempt.session-lock.ts: refuse to start a new embedded run on a session that already has one active.

Maintain a per-session-file registry of active embedded-prompt holders (in-process Map keyed by absolute path). When acquireForPrompt is invoked, check the registry; if a prior holder hasn't released, either wait on a promise the prior holder resolves on release, or escalate with a typed EmbeddedAttemptSessionContendedError so the caller can pick a fresh UUID.

const ACTIVE_EMBEDDED_PROMPTS = new Map<string, Promise<void>>();

async function acquireForPrompt(sessionFile: string, ...): Promise<SessionLock> {
  const existing = ACTIVE_EMBEDDED_PROMPTS.get(sessionFile);
  if (existing) {
    if (onContention === "wait") {
      await existing;
    } else {
      throw new EmbeddedAttemptSessionContendedError(sessionFile);
    }
  }
  let releaseSignal: () => void;
  ACTIVE_EMBEDDED_PROMPTS.set(sessionFile, new Promise(r => (releaseSignal = r)));
  // ... existing logic ...
  // on final release / completion / takeover, releaseSignal() and ACTIVE_EMBEDDED_PROMPTS.delete(sessionFile)
}

Part B — stuck-session-recovery: never re-fire on a session with an active embedded-prompt holder.

The abort_embedded_run recovery path should consult the same ACTIVE_EMBEDDED_PROMPTS registry before retriggering. If a holder exists, recovery should wait for natural release (timeout-bounded) or escalate to a fresh session UUID rather than racing.

Optional Part C — for agents without heartbeat.isolatedSession: true, the heartbeat lane competes with channel/direct lanes on the same UUID. Consider either making isolatedSession: true the default for new agent templates, or surfacing a warning at config-validation time so operators are aware of the contention surface.

Why

The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in failover-error.ts already treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.

A coordination-only change (no behavior change for callers that don't contend) keeps the takeover-detection mechanism intact as a safety net while removing the actual cause.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) [2 pull requests]