openclaw - 💡(How to fix) Fix Bug: session stuck in status:"running" after surface_error from failover chain [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74607Fetched 2026-04-30 06:22:23
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1

When the failover chain surfaces surface_error reason=timeout (commonly downstream of #73789, but in principle any cause), the session-state machine does not transition out of status: "running". Subsequent user messages on the session don't process — the gateway treats them as queued behind a run that will never complete. The session is wedged until manually patched in sessions.json and the gateway restarted.

This is downstream of #73789 / 1dbc250b1a / #73930. Those address the cause (empty input). This is about the lifecycle gap that lets the wedge persist independent of which upstream cause produced the surface_error.

Error Message

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

After this entry: no further agent/embedded activity for the session, no transition to a done state, no error message delivered to the channel. Diagnostic stuck-session warnings keep firing every 30s but nothing recovers:

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s

Root Cause

When the failover chain surfaces surface_error reason=timeout (commonly downstream of #73789, but in principle any cause), the session-state machine does not transition out of status: "running". Subsequent user messages on the session don't process — the gateway treats them as queued behind a run that will never complete. The session is wedged until manually patched in sessions.json and the gateway restarted.

This is downstream of #73789 / 1dbc250b1a / #73930. Those address the cause (empty input). This is about the lifecycle gap that lets the wedge persist independent of which upstream cause produced the surface_error.

Fix Action

Workaround

Back up sessions.json, set status: "running" → "done" and abortedLastRun: false → true on the affected session key, restart the gateway. Session resumes accepting messages immediately.

Code Example

status: "running"
abortedLastRun: false
updatedAt: <hours ago, frozen at the failover timestamp>

---

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

---

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s
RAW_BUFFERClick to expand / collapse

Summary

When the failover chain surfaces surface_error reason=timeout (commonly downstream of #73789, but in principle any cause), the session-state machine does not transition out of status: "running". Subsequent user messages on the session don't process — the gateway treats them as queued behind a run that will never complete. The session is wedged until manually patched in sessions.json and the gateway restarted.

This is downstream of #73789 / 1dbc250b1a / #73930. Those address the cause (empty input). This is about the lifecycle gap that lets the wedge persist independent of which upstream cause produced the surface_error.

Environment

  • OpenClaw: 2026.4.26
  • Provider/model: primary openai-codex/gpt-5.4, fallbacks anthropic/claude-sonnet-4-6, openai-codex/gpt-5.4-mini
  • Surface: Telegram direct session (also reproduces for cron-driven sessions and agent CLI)
  • Runtime: gateway via Node 25.9.0 on macOS 26.4.1

Repro

  1. From a Telegram session bound to openai-codex/gpt-5.4, send a message that triggers the empty-input bug from #73789 (or any other cause that walks the failover chain to surface_error).
  2. Observe the failover chain run and surface surface_error reason=timeout.
  3. Send another message on the same session.

Expected

After surface_error, the session transitions to a terminal state (e.g. done with an error marker), an error is delivered to the channel, and the next message is accepted normally.

Actual

The session remains in status: "running" in sessions.json. New messages don't process. From the user side this looks like the bot typing then never responding. Gateway restarts don't recover it — the status field is persisted to disk and rehydrated on startup.

Concrete state in ~/.openclaw/agents/main/sessions/sessions.json after the wedge:

status: "running"
abortedLastRun: false
updatedAt: <hours ago, frozen at the failover timestamp>

Logs

2026-04-29T13:02:07.998-05:00 [agent/embedded] embedded run failover decision: runId=f8ec3197-... stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4-mini profile=sha256:c21d1bebb3b0

After this entry: no further agent/embedded activity for the session, no transition to a done state, no error message delivered to the channel. Diagnostic stuck-session warnings keep firing every 30s but nothing recovers:

2026-04-29T14:03:32.509-05:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:telegram:direct:... state=processing age=143s queueDepth=0
2026-04-29T14:04:02.513-05:00 [diagnostic] stuck session: ... age=173s
2026-04-29T14:04:32.512-05:00 [diagnostic] stuck session: ... age=203s

Suspected cause

The session-state machine has transitions for normal completion and explicit abort, but appears to lack a transition out of running for the surface_error outcome of the failover chain. Once the chain surfaces an error, no code path moves the session to a terminal state. The [diagnostic] stuck session warnings detect the condition but don't drive recovery.

Workaround

Back up sessions.json, set status: "running" → "done" and abortedLastRun: false → true on the affected session key, restart the gateway. Session resumes accepting messages immediately.

Relation to existing issues

  • #73789 — provides the most common upstream cause (empty input from bare /new)
  • #73930 — fail-fast guard would convert the 15–31s timeout into an immediate error, which might give the session machine a cleaner exit signal, but doesn't directly fix the missing transition described here
  • 1dbc250b1a — same as above

This issue is meant to track the lifecycle gap independent of which upstream cause produced the surface_error.

extent analysis

TL;DR

The session-state machine lacks a transition out of the "running" state when a surface_error occurs, causing the session to become wedged until manually patched and the gateway restarted.

Guidance

  • Review the session-state machine transitions to identify why a surface_error outcome from the failover chain does not trigger a transition to a terminal state.
  • Consider adding a new transition to handle the surface_error case, moving the session to a "done" state with an error marker.
  • In the meantime, apply the provided workaround: manually update the affected session's status to "done" and abortedLastRun to true in sessions.json, then restart the gateway.
  • Investigate the diagnostic stuck-session warnings to determine if they can be modified to drive recovery instead of just detecting the condition.

Example

No code snippet is provided as the issue does not include specific code references.

Notes

The root cause of the issue is not directly related to the upstream causes (e.g., empty input) but rather the missing transition in the session-state machine. The workaround provides a temporary solution but a permanent fix requires updating the session-state machine transitions.

Recommendation

Apply the workaround until a permanent fix can be implemented, as it allows the session to resume accepting messages immediately.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Bug: session stuck in status:"running" after surface_error from failover chain [1 comments, 2 participants]