openclaw - ✅(Solved) Fix Session corruption: prefill error cascades into provider cooldown + repair makes it worse [3 pull requests, 1 comments, 2 participants]

altierac · 2026-05-04T09:46:42Z

[openclaw] PR 77280: fix auth-profiles : exclude format rejections from profile cooldown - Repository: openclaw/openclaw - Author: openperf - State: open | mer… # PR #77280: fix(auth-profiles): exclude format rejections from profile cooldown - Repository: openclaw/openclaw - Author: openperf - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/77280 ## Description (problem / solution / changelog) ### Summary - **Problem**: A single session-specific request-shape rejection from a provider takes down every other healthy session sharing the same auth profile, and when all configured profiles for a provider share the same fault, locks out the entire provider for the configured backoff window. The reporter in #77228 saw "Provider github-copilot is in cooldown (all profiles unavailable)" persist for 42+ minutes after a single 400 — `"This model does not support assistant message prefill. The conversation must end with a user message."` — caused by one corrupted session whose transcript ended in a stream-error placeholder assistant turn. Healthy sessions on the same profile were unable to make any provider call during the entire window. - **Root Cause**: `src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts:5-14` resolves which `FailoverReason`s should be persisted as auth-profile health signals. Today it excludes `policy === "local"` (helper-local runs) and `failoverReason === "timeout"` (transport timeouts) — both annotated as "should not poison shared provider auth health". A `format`-classified failure (`src/agents/pi-embedded-helpers/errors.ts:710-727`: a 400/422 whose payload couldn't be reclassified as auth/billing/rate-limit/etc.) is *also* a non-poisonous signal — it means the provider rejected the request payload shape, which is per-session and per-transcript, not a profile-wide reliability problem — but it is currently passed through as `failoverReason` and reaches `markAuthProfileFailure` at `src/agents/auth-profiles/usage.ts:649`. Inside `computeNextProfileUsageStats` (`usage.ts:539-642`), `format` runs through `calculateAuthProfileCooldownMs` (`usage.ts:363-372`) just like `rate_limit` / `overloaded`, producing 30s → 60s → 5min capped backoff, and crucially without the model scoping that `rate_limit` gets (`usage.ts:637`: `cooldownModel` is only set for `rate_limit`). So one bad transcript in one session repeatedly hits the same 400, the post-cooldown retry hits the same 400, the cooldown re-lengthens to its 5-min cap, and back-to-back cycles produce the 42-min provider-wide outage observed in the report. Other sessions on the same profile, with valid transcripts, are blocked the entire time. - **Fix**: Add `failoverReason === "format"` to the existing exclusion list in `resolveAuthProfileFailureReason`. This is the single chokepoint through which `markAuthProfileFailure` learns about run-time failovers in `pi-embedded-runner/run.ts` (call sites at `:1858`, `:2005`, `:2506`, `:2615` all funnel through `resolveRunAuthProfileFailureReason` at `:872`). When a session's transcript shape is rejected, the rejection still surfaces to the user via the existing `FailoverError`, the run still logs the failure, but the auth-profile cooldown machinery is no longer triggered. The bad session continues to fail — that is a separate, per-session repair concern explicitly tracked as the other two open items in #77228 — but **other sessions on the same profile keep working**, and the provider is no longer killed for everyone for the cooldown window. - **What changed**: - `src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts` — extend the existing `policy === "local"` / `timeout` exclusion guard to also cover `failoverReason === "format"`. Comment expanded to document why a request-shape rejection is per-session, not profile-wide. - `src/agents/pi-embedded-runner/run/auth-profile-failure-policy.test.ts` — add a `format`-rejection case (with and without `policy: "shared"`), asserting the resolver returns `null` so `markAuthProfileFailure` is never called. - `CHANGELOG.md` — single Fixes line under Unreleased referencing the issue with non-closing `Refs` syntax. - **What did NOT change (scope boundary)**: - No changes to the failure-reason classification (`src/agents/pi-embedded-helpers/errors.ts`); 400/422 schema rejections still classify as `format` and still surface to the user as a `FailoverError`. - No changes to `markAuthProfileFailure` / `computeNextProfileUsageStats` / `calculateAuthProfileCooldownMs`. Profile-cooldown semantics for legitimately profile-poisoning reasons (`auth`, `auth_permanent`, `billing`, `rate_limit`, `overloaded`, `model_not_found`, `unknown`, …) are untouched, so the existing behavior verified by `src/agents/auth-profiles.markauthprofilefailure.test.ts` is preserved. - No changes to the streaming-error placeholder (`src/agents/stream-message-shared.ts:STREAM_ERROR_FALLBACK_TEXT`) that ends up in the transcript and is the upstream root of th

openclaw2026-05-04 09:46:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77228•Fetched 2026-05-05 05:50:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

altierac

Participants

altierac

clawsweeper[bot]

Timeline (top)

referenced ×9cross-referenced ×3commented ×1

Error Message

A single 400 This model does not support assistant message prefill error from the LLM provider cascades into a full provider cooldown, making the agent completely unresponsive. The session file auto-repair mechanism then corrupts the transcript further, requiring a manual new session.

Phase 1: Prefill error (10:59:27 CEST)

[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload. The format error puts the entire provider into cooldown: After repair, the error changes to:

A prefill/format error on one request should NOT put the entire provider into long-term cooldown

Single format error → 42+ minutes of complete agent unresponsiveness The core issue appears to be that [assistant turn failed before producing content] placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.

Root Cause

The core issue appears to be that [assistant turn failed before producing content] placeholder messages create invalid message sequences. When combined with blank/dropped user messages, the conversation violates the provider's constraint that it must end with a user message. The cooldown mechanism then amplifies a single-request format error into a prolonged outage.

Code Example

[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload.
rawError=400 This model does not support assistant message prefill. The conversation must end with a user message.

---

[model-fallback/decision] decision=candidate_failed reason=format
[model-fallback/decision] decision=skip_candidate reason=format detail=Provider github-copilot is in cooldown (all profiles unavailable)
Embedded agent failed before reply: All models failed (1): github-copilot/claude-opus-4.6: Provider github-copilot is in cooldown

---

[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)

---

rawError=400 messages: at least one message is required

RAW_BUFFERClick to expand / collapse

Bug Summary

Environment

OpenClaw 4.29, Linux 6.17.0-1011-azure (x64)
Provider: github-copilot / claude-opus-4.6
Channel: WhatsApp

Steps to Reproduce

Agent is mid-conversation with active tool calls
A tool call fails, producing an [assistant turn failed before producing content] placeholder
A blank/empty user message follows (possibly from WhatsApp inbound during the failed turn)
This creates an invalid message sequence where the conversation ends with an assistant message (the failed placeholder) rather than a user message

What Happens

Phase 1: Prefill error (10:59:27 CEST)

[agent/embedded] embedded run agent end: isError=true error=LLM request failed: provider rejected the request schema or tool payload.
rawError=400 This model does not support assistant message prefill. The conversation must end with a user message.

Phase 2: Provider cooldown cascade

The format error puts the entire provider into cooldown:

[model-fallback/decision] decision=candidate_failed reason=format
[model-fallback/decision] decision=skip_candidate reason=format detail=Provider github-copilot is in cooldown (all profiles unavailable)
Embedded agent failed before reply: All models failed (1): github-copilot/claude-opus-4.6: Provider github-copilot is in cooldown

Every subsequent user message for the next ~42 minutes hits the same cooldown wall. The agent cannot respond at all.

Phase 3: Session repair makes it worse (11:41:15)

[session-init] session file repair: rewrote 1 assistant message(s), dropped 1 blank user message(s)

After repair, the error changes to:

rawError=400 messages: at least one message is required

The transcript is now fully corrupted — both .reset and .bak files contain 935+ entries with null roles (complete structural JSONL corruption).

Expected Behavior

A prefill/format error on one request should NOT put the entire provider into long-term cooldown
The cooldown should either be very short or only apply to that specific session, not block all sessions
Session file repair should not produce a worse state than what it started with
If a transcript is irrecoverable, the system should auto-create a fresh session rather than repeatedly failing

Actual Behavior

Single format error → 42+ minutes of complete agent unresponsiveness
Auto-repair corrupts the transcript further
User had to manually start a new session

Root Cause Analysis

Suggested Fixes

Short cooldown for format errors: Format errors are session-specific, not provider-wide issues. Cooldown should be seconds, not minutes, and scoped to the session.
Safer transcript repair: Validate the repaired transcript before committing. If repair produces an invalid state, fall back to creating a fresh session.
Handle [assistant turn failed] placeholders: These should be cleaned from the transcript before sending to the provider, or replaced with a valid assistant message.
Auto-recovery: If a session is stuck in repeated format errors, offer to reset it automatically rather than failing silently for 40+ minutes.

Log References

Session ID: 229feaa0-2692-401c-a828-66939bf80acc
Failed run IDs: 66a98a07, 3ef84b65 (and several in between)
Corrupted files: .jsonl.reset.2026-05-04T09-42-09.905Z, .jsonl.bak-59249-*

extent analysis

TL;DR

Implement a short cooldown for format errors, scoped to the session, and improve transcript repair to prevent corruption and auto-create a fresh session when necessary.

Guidance

Validate the cooldown mechanism to ensure it only applies to the specific session that encountered the format error, rather than the entire provider.
Modify the transcript repair process to validate the repaired transcript before committing changes and fall back to creating a fresh session if the repair produces an invalid state.
Consider handling [assistant turn failed] placeholders by cleaning them from the transcript or replacing them with a valid assistant message before sending to the provider.
Implement auto-recovery for sessions stuck in repeated format errors, offering to reset the session automatically after a certain threshold.

Example

No specific code example is provided due to the lack of explicit code in the issue, but the suggested fixes imply modifications to the cooldown mechanism, transcript repair logic, and error handling for [assistant turn failed] placeholders.

Notes

The provided analysis and suggested fixes assume that the cooldown mechanism and transcript repair process are modifiable. If these are part of a third-party library or service, alternative workarounds may be necessary.

Recommendation

Apply the suggested fixes, particularly implementing a short cooldown for format errors and improving transcript repair, as these address the root cause of the issue and can prevent similar outages in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Session corruption: prefill error cascades into provider cooldown + repair makes it worse [3 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Phase 1: Prefill error (10:59:27 CEST)

Root Cause

Fix Action

Fixed

PR fix notes

PR #77280: fix(auth-profiles): exclude format rejections from profile cooldown

Description (problem / solution / changelog)

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Changed files

PR #77287: fix(replay-history): drop trailing stream-error placeholder before pr…

Description (problem / solution / changelog)

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Changed files

PR #77288: fix(session-file-repair): drop null-role message entries instead of p…

Description (problem / solution / changelog)

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Changed files

Code Example

Bug Summary

Environment

Steps to Reproduce

What Happens

Phase 1: Prefill error (10:59:27 CEST)

Phase 2: Provider cooldown cascade

Phase 3: Session repair makes it worse (11:41:15)

Expected Behavior

Actual Behavior

Root Cause Analysis

Suggested Fixes

Log References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING