openclaw - 💡(How to fix) Fix pi-embedded-runner: stale sessionLastAssistant leaks prior provider's error string into later candidates in model-fallback

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In pi-embedded-runner, the model-fallback loop reuses a single shared session file across every candidate provider. When the first candidate writes an assistant row with an errorMessage (e.g. OpenAI returns a real 429), and a later candidate (e.g. Anthropic, Google) times out without producing a new assistant for the current attempt, the failover path falls back to sessionLastAssistant.errorMessage from the shared session file and reports the previous provider's error string as if it came from the current candidate.

The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.

Error Message

const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant; … new FailoverError(resolveAssistantFailoverErrorMessage(params), { …, rawError: params.lastAssistant?.errorMessage?.trim(), });

Root Cause

  • Misleads operators into believing all three providers are concurrently exhausted, when only one actually is.
  • Drives unnecessary top-up / billing action on providers that aren't out of quota.
  • Makes accurate triage of multi-provider gateways effectively impossible because the surfaced detail is unreliable for any candidate after the first failing provider.
  • Hard to spot from the operator side because the detail looks like a fully-formed quota error.

Fix Action

Workaround

Operators can read from= in [agent/embedded] embedded run failover decision log lines instead of trusting the surfaced detail / outer error string. The workaround works but is fragile — it requires log access and inner-line correlation, and any wrapper that surfaces the outer error to humans (Slack alert, dashboard) will still show the stale text.

Code Example

const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant;
    new FailoverError(resolveAssistantFailoverErrorMessage(params), {
,
      rawError: params.lastAssistant?.errorMessage?.trim(),
    });
RAW_BUFFERClick to expand / collapse

Summary

In pi-embedded-runner, the model-fallback loop reuses a single shared session file across every candidate provider. When the first candidate writes an assistant row with an errorMessage (e.g. OpenAI returns a real 429), and a later candidate (e.g. Anthropic, Google) times out without producing a new assistant for the current attempt, the failover path falls back to sessionLastAssistant.errorMessage from the shared session file and reports the previous provider's error string as if it came from the current candidate.

The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.

Where

  • src/agents/pi-embedded-runner/run/attempt.tslastAssistant is computed by scanning messagesSnapshot for the most recent assistant regardless of provider.
  • Bundled (production) lines from a 2026-05-24 build:
    • pi-embedded-Bcz04p2i.js:2865 (failover error construction):
      const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant;
      new FailoverError(resolveAssistantFailoverErrorMessage(params), {
      ,
        rawError: params.lastAssistant?.errorMessage?.trim(),
      });
    • model-fallback-DIXhOaxb.js:379 (recordFailedCandidateAttempt) stores error: described.rawError ?? described.message, so the stale rawError wins over the candidate-attributed message.

(File names with hashes are from the published build artifact; map back to the corresponding source modules.)

Reproduction / observed cascade

  1. OpenAI candidate hits a real 429 → session file now contains an assistant row with errorMessage = "You exceeded your current quota…" (and OpenAI as provider).
  2. runWithModelFallback advances to Anthropic and spawns a fresh runEmbeddedPiAgent against the same sessionFile/sessionId.
  3. The Anthropic request queues / hangs / aborts at the run-level timeout — no new assistant produced this attempt.
  4. Failover-decision construction sees currentAttemptAssistant === undefined and falls back to sessionLastAssistant — which is still the OpenAI errored row.
  5. The resulting FailoverError carries the OpenAI quota text as rawError, attributed (by the outer model-fallback) to Anthropic.
  6. Same again for Google.

Smoking gun in our logs

Every [agent/embedded] embedded run failover decision line for two distinct runs (3c5d7ca0-83df-418b-be48-a9327459046a and b9ad1b27-…) logs from=openai/gpt-5.5including the decisions that the outer model-fallback layer interprets as Anthropic and Google candidate failures. There are zero from=anthropic/… or from=google/… decision logs. Inner pi-embedded-runner never saw a non-OpenAI-attributed assistant error.

Evidence table

For one failed run (runId=3c5d7ca0-83df-418b-be48-a9327459046a, 2026-05-24 06:38–06:43 PT):

CandidateOUTER reasonOUTER detail (logged error string)Real upstream call status
openai/gpt-5.5rate_limitOpenAI quota textReal 429/v1/responses returned a genuine quota error from OpenAI (proxy logs confirm)
openai/gpt-5.5 (retry, different profile)rate_limitOpenAI quota textReal 429 — same
anthropic/claude-opus-4-7timeoutOpenAI quota textNo Anthropic 429 observed. Run ran ~123s and ended on run-level timeout; upstream proxy shows queued/rate-limited entries but no terminal quota-exhausted response for opus-4-7
google/gemini-3-pro-previewtimeoutOpenAI quota textSame shape — reason=timeout, but detail is again the OpenAI quota text verbatim

Note specifically: two of the three statuses are timeout, not rate_limit — real quota errors produce immediate 429s, not run-level timeouts. And the error message is verbatim OpenAI's quota string; Anthropic and Google use entirely different wording for billing/quota errors.

Impact

  • Misleads operators into believing all three providers are concurrently exhausted, when only one actually is.
  • Drives unnecessary top-up / billing action on providers that aren't out of quota.
  • Makes accurate triage of multi-provider gateways effectively impossible because the surfaced detail is unreliable for any candidate after the first failing provider.
  • Hard to spot from the operator side because the detail looks like a fully-formed quota error.

In our own incident this caused a false-positive triple-provider quota outage that was only caught by RCA inspection of inner failover-decision logs.

Suggested fix

Two complementary guards:

  1. In pi-embedded-runner failover construction (around the assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant site):
    • If currentAttemptAssistant is undefined AND sessionLastAssistant?.provider !== <this candidate's provider>, do not propagate sessionLastAssistant.errorMessage as the FailoverError.rawError. Fall through to the candidate-attributed default (e.g. "LLM request timed out." / "no response from provider").
  2. In recordFailedCandidateAttempt (the error: described.rawError ?? described.message site in model-fallback-…):
    • Additionally guard: if described.provider (from rawError attribution) differs from params.candidate.provider, prefer described.message over described.rawError.

Either guard alone would have prevented the false-positive in our case; both together are belt-and-suspenders.

Workaround

Operators can read from= in [agent/embedded] embedded run failover decision log lines instead of trusting the surfaced detail / outer error string. The workaround works but is fragile — it requires log access and inner-line correlation, and any wrapper that surfaces the outer error to humans (Slack alert, dashboard) will still show the stale text.

Severity

Medium. Functional fallback still works (requests do route to the next provider), and the workaround exists. But the misleading attribution actively damages triage of multi-provider failures, which is exactly when accurate signals matter most.

Filed by

Kentro.io engineering. Internal RCA tracked at KEN-4598 / KEN-4603. Happy to attach more log excerpts or test against a patch if useful.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix pi-embedded-runner: stale sessionLastAssistant leaks prior provider's error string into later candidates in model-fallback