openclaw - 💡(How to fix) Fix Session corruption: leading-assistant transcript causes infinite "messages: at least one message is required" loop [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75235Fetched 2026-05-01 05:36:29
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
2
Timeline (top)
cross-referenced ×3commented ×1

A session can enter a permanently broken state where every subsequent turn fails with a 400 from the Anthropic API:

400 invalid_request_error: messages: at least one message is required

The runtime detects this as failoverReason: "format" / providerRuntimeFailureKind: "schema" and surfaces the generic GENERIC_EXTERNAL_RUN_FAILURE_TEXT ("⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.") to the user — but does not auto-reset the session, so every retry hits the same 400.

Error Message

400 invalid_request_error: messages: at least one message is required

Root Cause

From that point onward, every send to that session re-submitted the broken transcript. The Anthropic API rejected each request because the messages array was effectively empty (no leading user turn). The session stayed broken for ~3 hours across multiple turns until I manually reset it via /new (*.jsonl.reset.<timestamp> was written at the same moment things started working again).

Code Example

400 invalid_request_error: messages: at least one message is required

---

warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
  "error":"LLM request rejected: messages: at least one message is required",
  "failoverReason":"format","providerRuntimeFailureKind":"schema",
  "providerErrorType":"invalid_request_error","httpCode":"400"}

warn model-fallback/decision {"decision":"candidate_failed","reason":"format",
  "fallbackStepFinalOutcome":"chain_exhausted","fallbackConfigured":false}

error diagnostic lane task error: lane=session:agent:main:telegram:direct:<id>
  error="FailoverError: LLM request rejected: messages: at least one message is required"

---

{"type":"session","version":3,"id":"<sid>","timestamp":"...Z"}
{"type":"model_change",...}
{"type":"thinking_level_change",...}
{"type":"custom","customType":"model-snapshot",...}
{"type":"message","message":{"role":"assistant",
  "content":[{"type":"text","text":"[assistant turn failed before producing content]"}],
  "stopReason":"error","errorMessage":"400 ... messages: at least one message is required"}}
{"type":"thinking_level_change",...}
{"type":"message","message":{"role":"assistant",...same shape...}}
RAW_BUFFERClick to expand / collapse

Summary

A session can enter a permanently broken state where every subsequent turn fails with a 400 from the Anthropic API:

400 invalid_request_error: messages: at least one message is required

The runtime detects this as failoverReason: "format" / providerRuntimeFailureKind: "schema" and surfaces the generic GENERIC_EXTERNAL_RUN_FAILURE_TEXT ("⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.") to the user — but does not auto-reset the session, so every retry hits the same 400.

Reproduction (observed)

In my deployment (openclaw 2026.4.27), a turn was aborted between "user message persisted" and "assistant reply written" — the host that the gateway runs on was wedged by an orphan SSH child holding the shell. The session transcript ended up with assistant-role entries containing "text":"[assistant turn failed before producing content]" and no preceding user-role message.

From that point onward, every send to that session re-submitted the broken transcript. The Anthropic API rejected each request because the messages array was effectively empty (no leading user turn). The session stayed broken for ~3 hours across multiple turns until I manually reset it via /new (*.jsonl.reset.<timestamp> was written at the same moment things started working again).

Evidence from logs

Same errorFingerprint: sha256:5d882a6629dc on every failure, different runId each time, identical 400 body. Snippet from openclaw logs:

warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
  "error":"LLM request rejected: messages: at least one message is required",
  "failoverReason":"format","providerRuntimeFailureKind":"schema",
  "providerErrorType":"invalid_request_error","httpCode":"400"}

warn model-fallback/decision {"decision":"candidate_failed","reason":"format",
  "fallbackStepFinalOutcome":"chain_exhausted","fallbackConfigured":false}

error diagnostic lane task error: lane=session:agent:main:telegram:direct:<id>
  error="FailoverError: LLM request rejected: messages: at least one message is required"

Transcript snippet (agents/main/sessions/<sid>.jsonl.reset.<ts>):

{"type":"session","version":3,"id":"<sid>","timestamp":"...Z"}
{"type":"model_change",...}
{"type":"thinking_level_change",...}
{"type":"custom","customType":"model-snapshot",...}
{"type":"message","message":{"role":"assistant",
  "content":[{"type":"text","text":"[assistant turn failed before producing content]"}],
  "stopReason":"error","errorMessage":"400 ... messages: at least one message is required"}}
{"type":"thinking_level_change",...}
{"type":"message","message":{"role":"assistant",...same shape...}}

No user-role entry exists in the transcript — only the orphan assistant turns.

Expected behaviour (issue 1: auto-detect & reset)

The runtime already has auto-reset paths for two corruption modes:

  • Gemini function-call ordering (isSessionCorruption branch in agent-runner.runtime)
  • Role-ordering conflict (isRoleOrderingError branch)

Please add a third reset path for leading-assistant / no-user-message transcripts. Detection is reliable:

  • HTTP 400
  • providerErrorType === "invalid_request_error"
  • providerErrorMessagePreview starts with "messages: at least one message is required"
  • (and/or) on-disk transcript has zero user-role entries

When detected, the runtime should:

  1. Snapshot the bad transcript to *.jsonl.reset.<ts> (same as existing reset paths)
  2. Drop the session from the active session store
  3. Reply to the user with the same friendly message used for the role-ordering reset: "⚠️ Session history was corrupted. I've reset the conversation - please try again!"

Without this, the user only sees the generic "Something went wrong" text and has no idea they need to /new — and on a Telegram channel they may not even know /new is an option. In my case the loop persisted across ~4 turns over hours.

Expected behaviour (issue 2: post-reset confirmation ping)

Related UX gap from the same incident. After typing /new, the user gets only a system-rendered "✅ New session started." and no further signal. After watching repeated error messages, that acknowledgement reads ambiguously — did the agent actually boot? Is it waiting? Is it broken in a different way?

In my case I waited ~19 minutes after /new before sending another message because there was no proof the agent was alive on the new session. The moment I did send a message, the agent replied instantly — so the round-trip was always working post-reset, but I had no way to know that without gambling another message.

Suggested fix: after /new (or after any auto-reset path), have the runtime emit a brief agent-side ping like "👋 Fresh session — ready when you are." so the user sees a real round-trip and knows the new session works. Same fix benefits the auto-detect path in issue 1.

Root cause guard (optional, broader fix)

The persistence layer probably should not write an orphan assistant entry in the first place. If the assistant turn errored before producing content, either:

  • Don't persist the assistant entry at all (rollback the turn), or
  • Persist a marker that triggers the corruption-recovery path on the next send.

Currently the on-disk shape (stopReason:"error", no preceding user turn) is a state the schema permits but the API rejects forever. Treating it as terminal-corrupt on read would fix this class of issue without needing the API call to fail first.

Environment

  • openclaw 2026.4.27 (cbc2ba0)
  • Provider: anthropic, model: claude-opus-4-7
  • Channel: telegram (lane session:agent:main:telegram:direct:<id>)

extent analysis

TL;DR

The session can be fixed by implementing an auto-reset path for leading-assistant / no-user-message transcripts, which detects the issue based on the 400 error and transcript state, and resets the session.

Guidance

  • Implement a new auto-reset path in the runtime to detect and reset sessions with leading-assistant / no-user-message transcripts, based on the 400 error and transcript state.
  • Update the detection logic to check for providerErrorType === "invalid_request_error" and providerErrorMessagePreview starting with "messages: at least one message is required".
  • When detected, snapshot the bad transcript, drop the session from the active session store, and reply to the user with a friendly reset message.
  • Consider adding a post-reset confirmation ping to emit a brief agent-side message after /new or auto-reset, to confirm the new session is working.

Example

No code snippet is provided as the issue requires changes to the runtime logic and detection mechanisms.

Notes

The root cause of the issue is the persistence of an orphan assistant entry in the transcript, which can be addressed by either not persisting the entry or persisting a marker that triggers corruption recovery.

Recommendation

Apply the workaround by implementing the new auto-reset path and detection logic, as this will fix the issue without requiring changes to the underlying persistence layer.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING