Once an assistant turn has timed out and the session is already stuck in `processing`, OpenClaw should either: 1. abort the active attempt cleanly and fail over once, or 2. terminate the turn with a clear error, but it should not keep reissuing near-identical assistant requests on the same stuck turn.

openclaw - ✅(Solved) Fix [Bug]: Single turn can enter pathological retry churn with same runId, near-identical prompt size, and stuck processing state [1 pull requests, 1 participants]

openclaw2026-04-08 19:38:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#63335•Fetched 2026-04-09 07:55:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

prudkov

Participants

prudkov

A single assistant turn on OpenClaw 2026.4.8 entered pathological retry churn, repeatedly reissuing near-identical requests to the same model/provider before falling back to other candidates. The session stayed in processing, queueDepth=0, and no user-visible output appeared until a later fallback finally succeeded.

This does not look like duplicate queued tasks or multiple fresh runs. It appears to be one stuck run repeatedly retried.

Error Message

terminate the turn with a clear error,

Root Cause

What had already been ruled out

This was not caused by stale TaskFlow rows continuing to run. Related task rows had already been manually cleaned from SQLite before analysis.
This did not appear to be multiple fresh user turns.
This did not appear to be a queue backlog (queueDepth=0).

PR fix notes

PR #1681: fix(openclaw): prevent gateway retry from spawning duplicate error messages

Repository: netease-youdao/LobsterAI
Author: btc69m979y-dotcom
State: closed | merged: True
Link: https://github.com/netease-youdao/LobsterAI/pull/1681

Description (problem / solution / changelog)

问题

当 provider 返回错误后（如 MiniMax HTTP 500 / 错误码 2061「套餐不支持该模型」），openclaw-runtime gateway 会用相同的 runId 做指数退避重试。每次重试失败都会发出 lifecycle phase=error 事件。

由于 phase=error 走的是 stream='lifecycle'（而非 stream='error'），handleAgentEvent 的现有排除条件无法拦截这些事件重建 ActiveTurn。结果是每次重试都走了完整的错误处理流程，UI 中不断追加错误消息（原始错误 + 本地化「服务端出现错误」），每次重试新增一对。

已通过 gateway 日志确认：

所有重试复用同一 runId（openclaw/openclaw#63335）
每次 phase=error → ActiveTurn 重建 → 消息追加

修复方案

引入 terminatedRunIds: Set<string>。dispatchAgentEvent 在收到 lifecycle phase=error 事件时立即将 runId 记录到该集合；handleAgentEvent 在重建 ActiveTurn 前检查该集合，命中则丢弃后续所有事件。

已知风险与取舍

openclaw gateway 在重试时固定复用同一 runId，无论最终成功还是失败。若某次重试成功（如真实瞬时网络抖动），因 runId 已在 terminatedRunIds 中，成功的事件会被静默丢弃，用户看到任务无输出结束（ghost output）。

可接受的原因：

永久性错误（套餐限制、鉴权失败）重试永远不会成功，无数据丢失
正确的上游修法是 openclaw 在可恢复 lifecycle 错误中携带 retrying: true，届时前端可区分可恢复与最终失败（openclaw/openclaw#64051，目前 open）
待 openclaw-runtime 更新支持 retrying 语义后，本 workaround 应替换为检查该字段

测试

使用未订阅套餐的 MiniMax 账号发送消息，确认错误消息不再随重试不断追加。

🤖 Generated with Claude Code

Changed files

src/main/libs/agentEngine/openclawRuntimeAdapter.ts (modified, +13/-1)

Code Example

2026-04-08T20:56:47.937+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:57:01.279+02:00 [agent] Profile moonshot:default timed out. Trying next account...
2026-04-08T20:57:01.286+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=moonshot/kimi-k2.5 profile=sha256:a3568bc3c72e
2026-04-08T20:57:01.317+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=openrouter/qwen/qwen3.6-plus detail=LLM request timed out.
2026-04-08T20:59:11.537+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:59:12.490+02:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:main state=processing age=121s queueDepth=0
2026-04-08T20:59:37.028+02:00 [agent] Profile openrouter:default timed out. Trying next account...
2026-04-08T20:59:37.032+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=openrouter/qwen/qwen3.6-plus profile=sha256:ac092b59b472
2026-04-08T20:59:37.154+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.6-plus reason=timeout next=openrouter/qwen/qwen3.5-397b-a17b detail=LLM request timed out.
2026-04-08T20:59:59.238+02:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.5-397b-a17b reason=unknown next=none

RAW_BUFFERClick to expand / collapse

Summary

This does not look like duplicate queued tasks or multiple fresh runs. It appears to be one stuck run repeatedly retried.

Environment

OpenClaw: 2026.4.8
Surface: Telegram direct chat
Session: agent:main:main
Observed date: 2026-04-08
Timezone: Europe/Berlin

What was observed

From the provider/request view, the same turn kept making repeated calls with almost the same request size. The screenshot showed constant retries on the same model with only very small prompt-size changes between attempts.

From gateway logs, the same runId persisted across the whole incident:

runId=986fae75-fe4e-45dc-91cc-1810c89abe8b

Behavior observed:

repeated requests with nearly identical payload size
repeated attempts on the same model/provider candidate before moving on
later cross-provider fallback
session remained stuck in processing
queueDepth=0
no visible response until the last fallback succeeded

Relevant logs

2026-04-08T20:56:47.937+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:57:01.279+02:00 [agent] Profile moonshot:default timed out. Trying next account...
2026-04-08T20:57:01.286+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=moonshot/kimi-k2.5 profile=sha256:a3568bc3c72e
2026-04-08T20:57:01.317+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=openrouter/qwen/qwen3.6-plus detail=LLM request timed out.
2026-04-08T20:59:11.537+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:59:12.490+02:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:main state=processing age=121s queueDepth=0
2026-04-08T20:59:37.028+02:00 [agent] Profile openrouter:default timed out. Trying next account...
2026-04-08T20:59:37.032+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=openrouter/qwen/qwen3.6-plus profile=sha256:ac092b59b472
2026-04-08T20:59:37.154+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.6-plus reason=timeout next=openrouter/qwen/qwen3.5-397b-a17b detail=LLM request timed out.
2026-04-08T20:59:59.238+02:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.5-397b-a17b reason=unknown next=none

Why this looks buggy

The concerning part is not ordinary model failover by itself. It is the combination of:

one persistent runId
same turn apparently being reissued multiple times
repeated near-identical request sizes on the same model/provider candidate
session marked stuck while queueDepth=0
eventual success only after extended retry churn

This burns credits and latency while making the system look like it is looping from the provider side, even though it is still one logical turn.

What had already been ruled out

This was not caused by stale TaskFlow rows continuing to run. Related task rows had already been manually cleaned from SQLite before analysis.
This did not appear to be multiple fresh user turns.
This did not appear to be a queue backlog (queueDepth=0).

Expected behavior

Once an assistant turn has timed out and the session is already stuck in processing, OpenClaw should either:

abort the active attempt cleanly and fail over once, or
terminate the turn with a clear error,

but it should not keep reissuing near-identical assistant requests on the same stuck turn.

Suspected area

Possibly an interaction between:

embedded runner timeout handling
provider/account retry logic
model fallback logic
stuck-session handling for assistant stage
context/bootstrap/compaction re-entry during failover

Nice-to-have diagnostics

If useful, exposing per-run retry counters and clearly distinguishing:

same-candidate retry
same-provider different-profile retry
cross-provider fallback

would make this much easier to diagnose.

extent analysis

TL;DR

The most likely fix involves modifying the embedded runner timeout handling and provider/account retry logic to prevent repeated reissuing of near-identical assistant requests on the same stuck turn.

Guidance

Review the embedded runner timeout handling to ensure it properly aborts the active attempt and fails over once when a timeout occurs.
Investigate the provider/account retry logic to prevent repeated retries on the same model/provider candidate.
Examine the model fallback logic to ensure it correctly handles stuck sessions and terminates the turn with a clear error when necessary.
Consider adding diagnostics, such as per-run retry counters, to better understand the retry behavior and distinguish between same-candidate retry, same-provider different-profile retry, and cross-provider fallback.

Example

No specific code snippet can be provided without more information about the implementation, but a potential solution might involve adding a retry counter and a check to prevent repeated retries on the same model/provider candidate.

Notes

The exact solution will depend on the specific implementation details of the embedded runner, provider/account retry logic, and model fallback logic. Additional logging and diagnostics may be necessary to fully understand the issue and develop an effective fix.

Recommendation

Apply a workaround to modify the embedded runner timeout handling and provider/account retry logic to prevent repeated reissuing of near-identical assistant requests on the same stuck turn, as this is likely to be a more immediate and effective solution than attempting to upgrade to a fixed version.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Once an assistant turn has timed out and the session is already stuck in processing, OpenClaw should either:

abort the active attempt cleanly and fail over once, or
terminate the turn with a clear error,

but it should not keep reissuing near-identical assistant requests on the same stuck turn.

#logging issue #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Single turn can enter pathological retry churn with same runId, near-identical prompt size, and stuck processing state [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

What had already been ruled out

PR fix notes

PR #1681: fix(openclaw): prevent gateway retry from spawning duplicate error messages

Description (problem / solution / changelog)

问题

修复方案

已知风险与取舍

相关 openclaw issues

测试

Changed files

Code Example

Summary

Environment

What was observed

Relevant logs

Why this looks buggy

What had already been ruled out

Expected behavior

Suspected area

Nice-to-have diagnostics

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING