openclaw - 💡(How to fix) Fix pi-embedded-runner: stale sessionLastAssistant leaks prior provider's error string into later candidates in model-fallback

In pi-embedded-runner, the model-fallback loop reuses a single shared session file across every candidate provider. When the first candidate writes an assistant row with an errorMessage (e.g. OpenAI returns a real 429), and a later candidate (e.g. Anthropic, Google) times out without producing a new assistant for the current attempt, the failover path falls back to sessionLastAssistant.errorMessage from the shared session file and reports the previous provider's error string as if it came from the current candidate.

The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.

Root Cause

Misleads operators into believing all three providers are concurrently exhausted, when only one actually is.
Drives unnecessary top-up / billing action on providers that aren't out of quota.
Makes accurate triage of multi-provider gateways effectively impossible because the surfaced detail is unreliable for any candidate after the first failing provider.
Hard to spot from the operator side because the detail looks like a fully-formed quota error.

Fix Action

Workaround

Operators can read from= in [agent/embedded] embedded run failover decision log lines instead of trusting the surfaced detail / outer error string. The workaround works but is fragile — it requires log access and inner-line correlation, and any wrapper that surfaces the outer error to humans (Slack alert, dashboard) will still show the stale text.

const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant; … new FailoverError(resolveAssistantFailoverErrorMessage(params), { …, rawError: params.lastAssistant?.errorMessage?.trim(), });

Summary

The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.

Where

src/agents/pi-embedded-runner/run/attempt.ts — lastAssistant is computed by scanning messagesSnapshot for the most recent assistant regardless of provider.
Bundled (production) lines from a 2026-05-24 build:
- pi-embedded-Bcz04p2i.js:2865 (failover error construction):
```
const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant;
…
new FailoverError(resolveAssistantFailoverErrorMessage(params), {
  …,
  rawError: params.lastAssistant?.errorMessage?.trim(),
});
```
- model-fallback-DIXhOaxb.js:379 (recordFailedCandidateAttempt) stores error: described.rawError ?? described.message, so the stale rawError wins over the candidate-attributed message.

(File names with hashes are from the published build artifact; map back to the corresponding source modules.)

Reproduction / observed cascade

OpenAI candidate hits a real 429 → session file now contains an assistant row with errorMessage = "You exceeded your current quota…" (and OpenAI as provider).
runWithModelFallback advances to Anthropic and spawns a fresh runEmbeddedPiAgent against the same sessionFile/sessionId.
The Anthropic request queues / hangs / aborts at the run-level timeout — no new assistant produced this attempt.
Failover-decision construction sees currentAttemptAssistant === undefined and falls back to sessionLastAssistant — which is still the OpenAI errored row.
The resulting FailoverError carries the OpenAI quota text as rawError, attributed (by the outer model-fallback) to Anthropic.
Same again for Google.

Smoking gun in our logs

Every [agent/embedded] embedded run failover decision line for two distinct runs (3c5d7ca0-83df-418b-be48-a9327459046a and b9ad1b27-…) logs from=openai/gpt-5.5 — including the decisions that the outer model-fallback layer interprets as Anthropic and Google candidate failures. There are zero from=anthropic/… or from=google/… decision logs. Inner pi-embedded-runner never saw a non-OpenAI-attributed assistant error.

Evidence table

For one failed run (runId=3c5d7ca0-83df-418b-be48-a9327459046a, 2026-05-24 06:38–06:43 PT):

Candidate	OUTER `reason`	OUTER `detail` (logged error string)	Real upstream call status
openai/gpt-5.5	`rate_limit`	OpenAI quota text	Real 429 — `/v1/responses` returned a genuine quota error from OpenAI (proxy logs confirm)
openai/gpt-5.5 (retry, different profile)	`rate_limit`	OpenAI quota text	Real 429 — same
anthropic/claude-opus-4-7	`timeout`	OpenAI quota text	No Anthropic 429 observed. Run ran ~123s and ended on run-level timeout; upstream proxy shows queued/rate-limited entries but no terminal quota-exhausted response for opus-4-7
google/gemini-3-pro-preview	`timeout`	OpenAI quota text	Same shape — `reason=timeout`, but `detail` is again the OpenAI quota text verbatim

Note specifically: two of the three statuses are timeout, not rate_limit — real quota errors produce immediate 429s, not run-level timeouts. And the error message is verbatim OpenAI's quota string; Anthropic and Google use entirely different wording for billing/quota errors.

Impact

Misleads operators into believing all three providers are concurrently exhausted, when only one actually is.
Drives unnecessary top-up / billing action on providers that aren't out of quota.
Makes accurate triage of multi-provider gateways effectively impossible because the surfaced detail is unreliable for any candidate after the first failing provider.
Hard to spot from the operator side because the detail looks like a fully-formed quota error.

In our own incident this caused a false-positive triple-provider quota outage that was only caught by RCA inspection of inner failover-decision logs.

Suggested fix

Two complementary guards:

In pi-embedded-runner failover construction (around the assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant site):
- If currentAttemptAssistant is undefined AND sessionLastAssistant?.provider !== <this candidate's provider>, do not propagate sessionLastAssistant.errorMessage as the FailoverError.rawError. Fall through to the candidate-attributed default (e.g. "LLM request timed out." / "no response from provider").
In recordFailedCandidateAttempt (the error: described.rawError ?? described.message site in model-fallback-…):
- Additionally guard: if described.provider (from rawError attribution) differs from params.candidate.provider, prefer described.message over described.rawError.

Either guard alone would have prevented the false-positive in our case; both together are belt-and-suspenders.

Workaround

Severity

Medium. Functional fallback still works (requests do route to the next provider), and the workaround exists. But the misleading attribution actively damages triage of multi-provider failures, which is exactly when accurate signals matter most.

Filed by

Kentro.io engineering. Internal RCA tracked at KEN-4598 / KEN-4603. Happy to attach more log excerpts or test against a patch if useful.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix pi-embedded-runner: stale sessionLastAssistant leaks prior provider's error string into later candidates in model-fallback

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Where

Reproduction / observed cascade

Smoking gun in our logs

Evidence table

Impact

Suggested fix

Workaround

Severity

Filed by

Still need to ship something?

TRENDING