openclaw - ✅(Solved) Fix [Bug]: Single turn can enter pathological retry churn with same runId, near-identical prompt size, and stuck processing state [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63335Fetched 2026-04-09 07:55:12
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

A single assistant turn on OpenClaw 2026.4.8 entered pathological retry churn, repeatedly reissuing near-identical requests to the same model/provider before falling back to other candidates. The session stayed in processing, queueDepth=0, and no user-visible output appeared until a later fallback finally succeeded.

This does not look like duplicate queued tasks or multiple fresh runs. It appears to be one stuck run repeatedly retried.

Error Message

  1. terminate the turn with a clear error,

Root Cause

What had already been ruled out

  • This was not caused by stale TaskFlow rows continuing to run. Related task rows had already been manually cleaned from SQLite before analysis.
  • This did not appear to be multiple fresh user turns.
  • This did not appear to be a queue backlog (queueDepth=0).

PR fix notes

PR #1681: fix(openclaw): prevent gateway retry from spawning duplicate error messages

Description (problem / solution / changelog)

问题

当 provider 返回错误后(如 MiniMax HTTP 500 / 错误码 2061「套餐不支持该模型」),openclaw-runtime gateway 会用相同的 runId 做指数退避重试。每次重试失败都会发出 lifecycle phase=error 事件。

由于 phase=error 走的是 stream='lifecycle'(而非 stream='error'),handleAgentEvent 的现有排除条件无法拦截这些事件重建 ActiveTurn。结果是每次重试都走了完整的错误处理流程,UI 中不断追加错误消息(原始错误 + 本地化「服务端出现错误」),每次重试新增一对。

已通过 gateway 日志确认:

  • 所有重试复用同一 runId(openclaw/openclaw#63335)
  • 每次 phase=error → ActiveTurn 重建 → 消息追加

修复方案

引入 terminatedRunIds: Set<string>dispatchAgentEvent 在收到 lifecycle phase=error 事件时立即将 runId 记录到该集合;handleAgentEvent 在重建 ActiveTurn 前检查该集合,命中则丢弃后续所有事件。

已知风险与取舍

openclaw gateway 在重试时固定复用同一 runId,无论最终成功还是失败。若某次重试成功(如真实瞬时网络抖动),因 runId 已在 terminatedRunIds 中,成功的事件会被静默丢弃,用户看到任务无输出结束(ghost output)。

可接受的原因:

  • 永久性错误(套餐限制、鉴权失败)重试永远不会成功,无数据丢失
  • 正确的上游修法是 openclaw 在可恢复 lifecycle 错误中携带 retrying: true,届时前端可区分可恢复与最终失败(openclaw/openclaw#64051,目前 open)
  • 待 openclaw-runtime 更新支持 retrying 语义后,本 workaround 应替换为检查该字段

相关 openclaw issues

  • openclaw/openclaw#64051 — lifecycle 合约 bug:可恢复错误与最终失败被混淆,导致 ghost output 和 UI 过早结束(权威说明)
  • openclaw/openclaw#63335 — 所有重试复用同一 runId(已确认行为)
  • openclaw/openclaw#62141 — gateway 对 503 无限重试而非 fallback(同类误分类重试问题)

测试

使用未订阅套餐的 MiniMax 账号发送消息,确认错误消息不再随重试不断追加。

🤖 Generated with Claude Code

Changed files

  • src/main/libs/agentEngine/openclawRuntimeAdapter.ts (modified, +13/-1)

Code Example

2026-04-08T20:56:47.937+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:57:01.279+02:00 [agent] Profile moonshot:default timed out. Trying next account...
2026-04-08T20:57:01.286+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=moonshot/kimi-k2.5 profile=sha256:a3568bc3c72e
2026-04-08T20:57:01.317+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=openrouter/qwen/qwen3.6-plus detail=LLM request timed out.
2026-04-08T20:59:11.537+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:59:12.490+02:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:main state=processing age=121s queueDepth=0
2026-04-08T20:59:37.028+02:00 [agent] Profile openrouter:default timed out. Trying next account...
2026-04-08T20:59:37.032+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=openrouter/qwen/qwen3.6-plus profile=sha256:ac092b59b472
2026-04-08T20:59:37.154+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.6-plus reason=timeout next=openrouter/qwen/qwen3.5-397b-a17b detail=LLM request timed out.
2026-04-08T20:59:59.238+02:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.5-397b-a17b reason=unknown next=none
RAW_BUFFERClick to expand / collapse

Summary

A single assistant turn on OpenClaw 2026.4.8 entered pathological retry churn, repeatedly reissuing near-identical requests to the same model/provider before falling back to other candidates. The session stayed in processing, queueDepth=0, and no user-visible output appeared until a later fallback finally succeeded.

This does not look like duplicate queued tasks or multiple fresh runs. It appears to be one stuck run repeatedly retried.

Environment

  • OpenClaw: 2026.4.8
  • Surface: Telegram direct chat
  • Session: agent:main:main
  • Observed date: 2026-04-08
  • Timezone: Europe/Berlin

What was observed

From the provider/request view, the same turn kept making repeated calls with almost the same request size. The screenshot showed constant retries on the same model with only very small prompt-size changes between attempts.

From gateway logs, the same runId persisted across the whole incident:

  • runId=986fae75-fe4e-45dc-91cc-1810c89abe8b

Behavior observed:

  • repeated requests with nearly identical payload size
  • repeated attempts on the same model/provider candidate before moving on
  • later cross-provider fallback
  • session remained stuck in processing
  • queueDepth=0
  • no visible response until the last fallback succeeded

Relevant logs

2026-04-08T20:56:47.937+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:57:01.279+02:00 [agent] Profile moonshot:default timed out. Trying next account...
2026-04-08T20:57:01.286+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=moonshot/kimi-k2.5 profile=sha256:a3568bc3c72e
2026-04-08T20:57:01.317+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=openrouter/qwen/qwen3.6-plus detail=LLM request timed out.
2026-04-08T20:59:11.537+02:00 [agent] embedded run timeout: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b sessionId=1fedc66d-f400-403a-a267-3c15de01ecfc timeoutMs=120000
2026-04-08T20:59:12.490+02:00 [diagnostic] stuck session: sessionId=main sessionKey=agent:main:main state=processing age=121s queueDepth=0
2026-04-08T20:59:37.028+02:00 [agent] Profile openrouter:default timed out. Trying next account...
2026-04-08T20:59:37.032+02:00 [agent] embedded run failover decision: runId=986fae75-fe4e-45dc-91cc-1810c89abe8b stage=assistant decision=fallback_model reason=timeout provider=openrouter/qwen/qwen3.6-plus profile=sha256:ac092b59b472
2026-04-08T20:59:37.154+02:00 [model-fallback] model fallback decision: decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.6-plus reason=timeout next=openrouter/qwen/qwen3.5-397b-a17b detail=LLM request timed out.
2026-04-08T20:59:59.238+02:00 [model-fallback] model fallback decision: decision=candidate_succeeded requested=moonshot/kimi-k2.5 candidate=openrouter/qwen/qwen3.5-397b-a17b reason=unknown next=none

Why this looks buggy

The concerning part is not ordinary model failover by itself. It is the combination of:

  • one persistent runId
  • same turn apparently being reissued multiple times
  • repeated near-identical request sizes on the same model/provider candidate
  • session marked stuck while queueDepth=0
  • eventual success only after extended retry churn

This burns credits and latency while making the system look like it is looping from the provider side, even though it is still one logical turn.

What had already been ruled out

  • This was not caused by stale TaskFlow rows continuing to run. Related task rows had already been manually cleaned from SQLite before analysis.
  • This did not appear to be multiple fresh user turns.
  • This did not appear to be a queue backlog (queueDepth=0).

Expected behavior

Once an assistant turn has timed out and the session is already stuck in processing, OpenClaw should either:

  1. abort the active attempt cleanly and fail over once, or
  2. terminate the turn with a clear error,

but it should not keep reissuing near-identical assistant requests on the same stuck turn.

Suspected area

Possibly an interaction between:

  • embedded runner timeout handling
  • provider/account retry logic
  • model fallback logic
  • stuck-session handling for assistant stage
  • context/bootstrap/compaction re-entry during failover

Nice-to-have diagnostics

If useful, exposing per-run retry counters and clearly distinguishing:

  • same-candidate retry
  • same-provider different-profile retry
  • cross-provider fallback

would make this much easier to diagnose.

extent analysis

TL;DR

The most likely fix involves modifying the embedded runner timeout handling and provider/account retry logic to prevent repeated reissuing of near-identical assistant requests on the same stuck turn.

Guidance

  • Review the embedded runner timeout handling to ensure it properly aborts the active attempt and fails over once when a timeout occurs.
  • Investigate the provider/account retry logic to prevent repeated retries on the same model/provider candidate.
  • Examine the model fallback logic to ensure it correctly handles stuck sessions and terminates the turn with a clear error when necessary.
  • Consider adding diagnostics, such as per-run retry counters, to better understand the retry behavior and distinguish between same-candidate retry, same-provider different-profile retry, and cross-provider fallback.

Example

No specific code snippet can be provided without more information about the implementation, but a potential solution might involve adding a retry counter and a check to prevent repeated retries on the same model/provider candidate.

Notes

The exact solution will depend on the specific implementation details of the embedded runner, provider/account retry logic, and model fallback logic. Additional logging and diagnostics may be necessary to fully understand the issue and develop an effective fix.

Recommendation

Apply a workaround to modify the embedded runner timeout handling and provider/account retry logic to prevent repeated reissuing of near-identical assistant requests on the same stuck turn, as this is likely to be a more immediate and effective solution than attempting to upgrade to a fixed version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Once an assistant turn has timed out and the session is already stuck in processing, OpenClaw should either:

  1. abort the active attempt cleanly and fail over once, or
  2. terminate the turn with a clear error,

but it should not keep reissuing near-identical assistant requests on the same stuck turn.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING