After a transient provider failure, the session should either: - recover normally on the next message, or - be marked unhealthy and automatically rolled to a fresh session, or - at minimum surface a targeted instruction that the session is unhealthy and needs reset Users should not remain stuck in a permanently bad session.

openclaw - 💡(How to fix) Fix bug: research session can get stuck until /new or /reset [1 participants]

openclaw2026-04-07 07:14:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62354•Fetched 2026-04-08 03:05:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

eggie-1988

Participants

eggie-1988

A research agent session in a Feishu group became permanently unhealthy after an upstream provider error. While the same agent, same model, same deployment, and same group all worked normally after /new, the old session kept failing on every subsequent message.

This strongly suggests a session continuation / recovery bug in OpenClaw rather than a pure provider outage.

Error Message

In the broken session, all assistant turns repeatedly had:

stopReason = error
usage.input = 0
usage.output = 0
content = []

Root Cause

Why this looks like an OpenClaw bug

This does not look like a simple shared-provider outage because:

the main agent could still answer during the same broader time window
the research agent in a new session worked immediately
the same research agent, same model, same deployment, same group only failed when continuing the old session

RAW_BUFFERClick to expand / collapse

bug: research agent can get stuck on a bad session until /new or /reset

Summary

This strongly suggests a session continuation / recovery bug in OpenClaw rather than a pure provider outage.

Impact

When a session enters this state:

every subsequent user message in that session fails
failures can persist for a long time
simple short messages like hello also fail
users are effectively trapped until they manually run /new or /reset

Environment

OpenClaw runtime: local install on macOS
Channel: Feishu group
Agent: research
Model provider: litellm
Model: azure-gpt-5.4
API style: openai-responses
Same deployment/path as working requests

What happened

Broken session

session id: 9f77c944-6355-4e0f-91a8-0fd78a60d375
archived transcript:
- /Users/eggie/.openclaw/agents/research/sessions/9f77c944-6355-4e0f-91a8-0fd78a60d375.jsonl.reset.2026-04-07T06-35-45.861Z

Recovered session

new session id after /new: 7765088f-31fa-4ceb-97ff-641ea934815e
current transcript:
- /Users/eggie/.openclaw/agents/research/sessions/7765088f-31fa-4ceb-97ff-641ea934815e.jsonl

Reproduction pattern

Use an existing research group session.
A provider error occurs once.
Continue sending normal short messages in the same session.
Every follow-up request keeps failing.
Run /new or /reset.
The same agent in the same group immediately works again.

Observed behavior

In the broken session, all assistant turns repeatedly had:

stopReason = error
usage.input = 0
usage.output = 0
content = []

The provider-facing error message was repeatedly:

The system is currently experiencing high demand and cannot process your request. Your request exceeds the maximum usage size allowed during peak load. For improved capacity reliability, consider switching to Provisioned Throughput.

Repeated failure timestamps in the broken session:

2026-04-07T06:15:56.740Z
2026-04-07T06:16:25.729Z
2026-04-07T06:16:54.796Z
2026-04-07T06:17:16.606Z
2026-04-07T06:19:55.289Z
2026-04-07T06:23:33.047Z
2026-04-07T06:35:19.999Z

Messages that still failed in the bad session included very small inputs such as:

你好，你在么
你在么
hello

Expected behavior

After a transient provider failure, the session should either:

recover normally on the next message, or
be marked unhealthy and automatically rolled to a fresh session, or
at minimum surface a targeted instruction that the session is unhealthy and needs reset

Users should not remain stuck in a permanently bad session.

Why this looks like an OpenClaw bug

This does not look like a simple shared-provider outage because:

the main agent could still answer during the same broader time window
the research agent in a new session worked immediately
the same research agent, same model, same deployment, same group only failed when continuing the old session

That strongly suggests the key variable is continuing the old session, not the deployment itself.

Additional evidence

The old broken session kept producing 0 input / 0 output usage, which suggests the request may have been rejected before normal generation/usage accounting. But even if the provider error is real, OpenClaw appears to keep reusing a poisoned or unrecoverable session state instead of breaking out of it.

Code paths worth inspecting

Relevant bundled code paths observed locally:

dist/pi-embedded-D6PpOsxP.js

High-interest areas:

initSessionState(...)
- reuses the prior sessionId when the session is still considered fresh
/new / /reset handling
- creates a new sessionId and archives the old transcript
updateSessionStoreAfterAgentRun(...)
- writes session run results and state back into sessions.json

Hypotheses

1. Bad session continuation state is being reused

After certain provider errors, the session may retain internal continuation state that should have been cleared.

2. Old-session payload construction becomes invalid

The old session may be reconstructing payload/history differently from a fresh session in a way that causes repeated rejection.

3. Fatal provider errors do not trigger a recovery path

Repeated 0-token + error runs may need special handling, but currently the session remains active and keeps being reused.

Suggested fixes

Product behavior

Detect repeated stopReason=error with usage.input=0 and usage.output=0
After N consecutive failures, mark the session unhealthy
Automatically roll to a fresh session, or explicitly prompt the user to reset

Diagnostics

Add debug logging for session-continuation requests:

session id used for each run
message count in reconstructed history
payload size / prompt size
provider request id
HTTP status / retry metadata

Recovery safety

Treat some provider errors as continuation-breaking events, not just normal turn failures.

Minimal artifact set

broken archived transcript:
- /Users/eggie/.openclaw/agents/research/sessions/9f77c944-6355-4e0f-91a8-0fd78a60d375.jsonl.reset.2026-04-07T06-35-45.861Z
recovered live transcript:
- /Users/eggie/.openclaw/agents/research/sessions/7765088f-31fa-4ceb-97ff-641ea934815e.jsonl
current session store:
- /Users/eggie/.openclaw/agents/research/sessions/sessions.json
local diagnostic note:
- /Users/eggie/.openclaw/workspace/reports/research-session-bug-2026-04-07.md

Short version

A session can become permanently bad after a provider error, and OpenClaw appears to keep reusing that bad session until the user manually resets it. A fresh session immediately fixes the problem. That behavior strongly suggests a session recovery bug.

extent analysis

TL;DR

Detect and handle repeated errors in a session by marking it unhealthy and automatically rolling to a fresh session or prompting the user to reset.

Guidance

Implement session health tracking: Monitor sessions for consecutive failures with stopReason=error, usage.input=0, and usage.output=0, and mark them as unhealthy after a certain threshold.
Automate session recovery: When a session is marked unhealthy, automatically create a new session or prompt the user to reset the current one to prevent continued failures.
Enhance diagnostics: Add logging for session continuation requests, including session ID, message count, payload size, and provider request ID, to better understand the issue.
Review session state management: Inspect code paths like initSessionState and updateSessionStoreAfterAgentRun to ensure proper session state handling and potential clearance of continuation state after errors.

Example

// Pseudo-code example for tracking session health
let consecutiveFailures = 0;
const maxConsecutiveFailures = 3;

// On each request failure
if (stopReason === 'error' && usage.input === 0 && usage.output === 0) {
  consecutiveFailures++;
  if (consecutiveFailures >= maxConsecutiveFailures) {
    // Mark session as unhealthy and recover
    markSessionUnhealthy();
    createNewSessionOrPromptReset();
  }
} else {
  consecutiveFailures = 0; // Reset on successful requests
}

Notes

The exact implementation details may vary based on the specific requirements and constraints of the OpenClaw system.
It's crucial to test the recovery mechanism thoroughly to ensure it correctly handles various scenarios and doesn't introduce new issues.

Recommendation

Apply the suggested fixes, particularly focusing on detecting and handling repeated errors in sessions, to improve the robustness of the OpenClaw system against provider errors and session corruption.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

After a transient provider failure, the session should either:

recover normally on the next message, or
be marked unhealthy and automatically rolled to a fresh session, or
at minimum surface a targeted instruction that the session is unhealthy and needs reset

Users should not remain stuck in a permanently bad session.

#api #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix bug: research session can get stuck until /new or /reset [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Why this looks like an OpenClaw bug

bug: research agent can get stuck on a bad session until /new or /reset

Summary

Impact

Environment

What happened

Broken session

Recovered session

Reproduction pattern

Observed behavior

Expected behavior

Why this looks like an OpenClaw bug

Additional evidence

Code paths worth inspecting

Hypotheses

1. Bad session continuation state is being reused

2. Old-session payload construction becomes invalid

3. Fatal provider errors do not trigger a recovery path

Suggested fixes

Product behavior

Diagnostics

Recovery safety

Minimal artifact set

Short version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING