openclaw - 💡(How to fix) Fix bug: research session can get stuck until /new or /reset [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62354Fetched 2026-04-08 03:05:34
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

A research agent session in a Feishu group became permanently unhealthy after an upstream provider error. While the same agent, same model, same deployment, and same group all worked normally after /new, the old session kept failing on every subsequent message.

This strongly suggests a session continuation / recovery bug in OpenClaw rather than a pure provider outage.

Error Message

In the broken session, all assistant turns repeatedly had:

  • stopReason = error
  • usage.input = 0
  • usage.output = 0
  • content = []

Root Cause

Why this looks like an OpenClaw bug

This does not look like a simple shared-provider outage because:

  • the main agent could still answer during the same broader time window
  • the research agent in a new session worked immediately
  • the same research agent, same model, same deployment, same group only failed when continuing the old session
RAW_BUFFERClick to expand / collapse

bug: research agent can get stuck on a bad session until /new or /reset

Summary

A research agent session in a Feishu group became permanently unhealthy after an upstream provider error. While the same agent, same model, same deployment, and same group all worked normally after /new, the old session kept failing on every subsequent message.

This strongly suggests a session continuation / recovery bug in OpenClaw rather than a pure provider outage.

Impact

When a session enters this state:

  • every subsequent user message in that session fails
  • failures can persist for a long time
  • simple short messages like hello also fail
  • users are effectively trapped until they manually run /new or /reset

Environment

  • OpenClaw runtime: local install on macOS
  • Channel: Feishu group
  • Agent: research
  • Model provider: litellm
  • Model: azure-gpt-5.4
  • API style: openai-responses
  • Same deployment/path as working requests

What happened

Broken session

  • session id: 9f77c944-6355-4e0f-91a8-0fd78a60d375
  • archived transcript:
    • /Users/eggie/.openclaw/agents/research/sessions/9f77c944-6355-4e0f-91a8-0fd78a60d375.jsonl.reset.2026-04-07T06-35-45.861Z

Recovered session

  • new session id after /new: 7765088f-31fa-4ceb-97ff-641ea934815e
  • current transcript:
    • /Users/eggie/.openclaw/agents/research/sessions/7765088f-31fa-4ceb-97ff-641ea934815e.jsonl

Reproduction pattern

  1. Use an existing research group session.
  2. A provider error occurs once.
  3. Continue sending normal short messages in the same session.
  4. Every follow-up request keeps failing.
  5. Run /new or /reset.
  6. The same agent in the same group immediately works again.

Observed behavior

In the broken session, all assistant turns repeatedly had:

  • stopReason = error
  • usage.input = 0
  • usage.output = 0
  • content = []

The provider-facing error message was repeatedly:

The system is currently experiencing high demand and cannot process your request. Your request exceeds the maximum usage size allowed during peak load. For improved capacity reliability, consider switching to Provisioned Throughput.

Repeated failure timestamps in the broken session:

  • 2026-04-07T06:15:56.740Z
  • 2026-04-07T06:16:25.729Z
  • 2026-04-07T06:16:54.796Z
  • 2026-04-07T06:17:16.606Z
  • 2026-04-07T06:19:55.289Z
  • 2026-04-07T06:23:33.047Z
  • 2026-04-07T06:35:19.999Z

Messages that still failed in the bad session included very small inputs such as:

  • 你好,你在么
  • 你在么
  • hello

Expected behavior

After a transient provider failure, the session should either:

  • recover normally on the next message, or
  • be marked unhealthy and automatically rolled to a fresh session, or
  • at minimum surface a targeted instruction that the session is unhealthy and needs reset

Users should not remain stuck in a permanently bad session.

Why this looks like an OpenClaw bug

This does not look like a simple shared-provider outage because:

  • the main agent could still answer during the same broader time window
  • the research agent in a new session worked immediately
  • the same research agent, same model, same deployment, same group only failed when continuing the old session

That strongly suggests the key variable is continuing the old session, not the deployment itself.

Additional evidence

The old broken session kept producing 0 input / 0 output usage, which suggests the request may have been rejected before normal generation/usage accounting. But even if the provider error is real, OpenClaw appears to keep reusing a poisoned or unrecoverable session state instead of breaking out of it.

Code paths worth inspecting

Relevant bundled code paths observed locally:

  • dist/pi-embedded-D6PpOsxP.js

High-interest areas:

  1. initSessionState(...)
    • reuses the prior sessionId when the session is still considered fresh
  2. /new / /reset handling
    • creates a new sessionId and archives the old transcript
  3. updateSessionStoreAfterAgentRun(...)
    • writes session run results and state back into sessions.json

Hypotheses

1. Bad session continuation state is being reused

After certain provider errors, the session may retain internal continuation state that should have been cleared.

2. Old-session payload construction becomes invalid

The old session may be reconstructing payload/history differently from a fresh session in a way that causes repeated rejection.

3. Fatal provider errors do not trigger a recovery path

Repeated 0-token + error runs may need special handling, but currently the session remains active and keeps being reused.

Suggested fixes

Product behavior

  • Detect repeated stopReason=error with usage.input=0 and usage.output=0
  • After N consecutive failures, mark the session unhealthy
  • Automatically roll to a fresh session, or explicitly prompt the user to reset

Diagnostics

Add debug logging for session-continuation requests:

  • session id used for each run
  • message count in reconstructed history
  • payload size / prompt size
  • provider request id
  • HTTP status / retry metadata

Recovery safety

Treat some provider errors as continuation-breaking events, not just normal turn failures.

Minimal artifact set

  • broken archived transcript:
    • /Users/eggie/.openclaw/agents/research/sessions/9f77c944-6355-4e0f-91a8-0fd78a60d375.jsonl.reset.2026-04-07T06-35-45.861Z
  • recovered live transcript:
    • /Users/eggie/.openclaw/agents/research/sessions/7765088f-31fa-4ceb-97ff-641ea934815e.jsonl
  • current session store:
    • /Users/eggie/.openclaw/agents/research/sessions/sessions.json
  • local diagnostic note:
    • /Users/eggie/.openclaw/workspace/reports/research-session-bug-2026-04-07.md

Short version

A session can become permanently bad after a provider error, and OpenClaw appears to keep reusing that bad session until the user manually resets it. A fresh session immediately fixes the problem. That behavior strongly suggests a session recovery bug.

extent analysis

TL;DR

Detect and handle repeated errors in a session by marking it unhealthy and automatically rolling to a fresh session or prompting the user to reset.

Guidance

  1. Implement session health tracking: Monitor sessions for consecutive failures with stopReason=error, usage.input=0, and usage.output=0, and mark them as unhealthy after a certain threshold.
  2. Automate session recovery: When a session is marked unhealthy, automatically create a new session or prompt the user to reset the current one to prevent continued failures.
  3. Enhance diagnostics: Add logging for session continuation requests, including session ID, message count, payload size, and provider request ID, to better understand the issue.
  4. Review session state management: Inspect code paths like initSessionState and updateSessionStoreAfterAgentRun to ensure proper session state handling and potential clearance of continuation state after errors.

Example

// Pseudo-code example for tracking session health
let consecutiveFailures = 0;
const maxConsecutiveFailures = 3;

// On each request failure
if (stopReason === 'error' && usage.input === 0 && usage.output === 0) {
  consecutiveFailures++;
  if (consecutiveFailures >= maxConsecutiveFailures) {
    // Mark session as unhealthy and recover
    markSessionUnhealthy();
    createNewSessionOrPromptReset();
  }
} else {
  consecutiveFailures = 0; // Reset on successful requests
}

Notes

  • The exact implementation details may vary based on the specific requirements and constraints of the OpenClaw system.
  • It's crucial to test the recovery mechanism thoroughly to ensure it correctly handles various scenarios and doesn't introduce new issues.

Recommendation

Apply the suggested fixes, particularly focusing on detecting and handling repeated errors in sessions, to improve the robustness of the OpenClaw system against provider errors and session corruption.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

After a transient provider failure, the session should either:

  • recover normally on the next message, or
  • be marked unhealthy and automatically rolled to a fresh session, or
  • at minimum surface a targeted instruction that the session is unhealthy and needs reset

Users should not remain stuck in a permanently bad session.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING