openclaw - 💡(How to fix) Fix Bug: session stuck in status:"running" after surface_error from failover chain [1 comments, 2 participants]

openclaw2026-04-29 20:38:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74607•Fetched 2026-04-30 06:22:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

millerc79

Participants

millerc79

steipete

Timeline (top)

closed ×1commented ×1cross-referenced ×1

When the failover chain surfaces surface_error reason=timeout (commonly downstream of #73789, but in principle any cause), the session-state machine does not transition out of status: "running". Subsequent user messages on the session don't process — the gateway treats them as queued behind a run that will never complete. The session is wedged until manually patched in sessions.json and the gateway restarted.

This is downstream of #73789 / 1dbc250b1a / #73930. Those address the cause (empty input). This is about the lifecycle gap that lets the wedge persist independent of which upstream cause produced the surface_error.

Error Message

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

After this entry: no further agent/embedded activity for the session, no transition to a done state, no error message delivered to the channel. Diagnostic stuck-session warnings keep firing every 30s but nothing recovers:

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s

Root Cause

Fix Action

Workaround

Back up sessions.json, set status: "running" → "done" and abortedLastRun: false → true on the affected session key, restart the gateway. Session resumes accepting messages immediately.

Code Example

status: "running"
abortedLastRun: false
updatedAt: <hours ago, frozen at the failover timestamp>

---

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

---

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: 2026.4.26
Provider/model: primary openai-codex/gpt-5.4, fallbacks anthropic/claude-sonnet-4-6, openai-codex/gpt-5.4-mini
Surface: Telegram direct session (also reproduces for cron-driven sessions and agent CLI)
Runtime: gateway via Node 25.9.0 on macOS 26.4.1

Repro

From a Telegram session bound to openai-codex/gpt-5.4, send a message that triggers the empty-input bug from #73789 (or any other cause that walks the failover chain to surface_error).
Observe the failover chain run and surface surface_error reason=timeout.
Send another message on the same session.

Expected

After surface_error, the session transitions to a terminal state (e.g. done with an error marker), an error is delivered to the channel, and the next message is accepted normally.

Actual

The session remains in status: "running" in sessions.json. New messages don't process. From the user side this looks like the bot typing then never responding. Gateway restarts don't recover it — the status field is persisted to disk and rehydrated on startup.

Concrete state in ~/.openclaw/agents/main/sessions/sessions.json after the wedge:

status: "running"
abortedLastRun: false
updatedAt: <hours ago, frozen at the failover timestamp>

Logs

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s

Suspected cause

The session-state machine has transitions for normal completion and explicit abort, but appears to lack a transition out of running for the surface_error outcome of the failover chain. Once the chain surfaces an error, no code path moves the session to a terminal state. The [diagnostic] stuck session warnings detect the condition but don't drive recovery.

Workaround

Back up sessions.json, set status: "running" → "done" and abortedLastRun: false → true on the affected session key, restart the gateway. Session resumes accepting messages immediately.

Relation to existing issues

#73789 — provides the most common upstream cause (empty input from bare /new)
#73930 — fail-fast guard would convert the 15–31s timeout into an immediate error, which might give the session machine a cleaner exit signal, but doesn't directly fix the missing transition described here
1dbc250b1a — same as above

This issue is meant to track the lifecycle gap independent of which upstream cause produced the surface_error.

extent analysis

TL;DR

The session-state machine lacks a transition out of the "running" state when a surface_error occurs, causing the session to become wedged until manually patched and the gateway restarted.

Guidance

Review the session-state machine transitions to identify why a surface_error outcome from the failover chain does not trigger a transition to a terminal state.
Consider adding a new transition to handle the surface_error case, moving the session to a "done" state with an error marker.
In the meantime, apply the provided workaround: manually update the affected session's status to "done" and abortedLastRun to true in sessions.json, then restart the gateway.
Investigate the diagnostic stuck-session warnings to determine if they can be modified to drive recovery instead of just detecting the condition.

Example

No code snippet is provided as the issue does not include specific code references.

Notes

The root cause of the issue is not directly related to the upstream causes (e.g., empty input) but rather the missing transition in the session-state machine. The workaround provides a temporary solution but a permanent fix requires updating the session-state machine transitions.

Recommendation

Apply the workaround until a permanent fix can be implemented, as it allows the session to resume accepting messages immediately.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Bug: session stuck in status:"running" after surface_error from failover chain [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Repro

Expected

Actual

Logs

Suspected cause

Workaround

Relation to existing issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Bug: session stuck in status:"running" after surface_error from failover chain [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Repro

Expected

Actual

Logs

Suspected cause

Workaround

Relation to existing issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING