openclaw - 💡(How to fix) Fix [Proposal] Self-Healing Execution: 解决工具调用死循环问题 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60767Fetched 2026-04-08 02:47:27
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
2
Participants

Error Message

  1. 工具层:输入约束严格,模型很容易产生 validation error
  2. Validation error、schema mismatch、missing field 这类确定性错误,禁止无限重试 | 可自动修复错误 | validation error、provider not found、gateway 未启动 | 进入 self-heal 流程 |
  3. Error Classifier — 判断 retryable / self-healable / non-retryable 收到请求 → 执行动作 → 报错 → Error Classifier →
  4. 工具参数自修 — validation error 后重新读取 schema,只保留 required 字段
  • Validation error 一律禁止直接重试
  1. Classify the error first.
  2. If the error is deterministic (validation error, schema mismatch, missing required field, invalid enum, bad config reference, provider not found), do not retry the original action directly.
  3. If the same class of error happens again in the same turn, stop and return a structured failure summary.
  • 短期:止血,禁止 deterministic error 无限重试
RAW_BUFFERClick to expand / collapse

概要

让 OpenClaw 从"工具失败后瞎重试"升级为"先判断、先修复、再验证、再继续"的闭环执行机制。

核心目标:把 OpenClaw 的错误处理从"无限重试"改成"有限自愈"。


1. 为什么现在总是出错

当前问题不是单点故障,而是 3 层问题叠加:

  1. 工具层:输入约束严格,模型很容易产生 validation error
  2. Gateway 层:把不可重试的错误也当成可重试错误处理
  3. Session 层:Compaction 只压缩了对话,却把"这一轮已经失败过"的运行记忆一起压没了

2. 自愈的设计原则

  1. 不要直接重试原任务,先重试"修复动作"
  2. Validation error、schema mismatch、missing field 这类确定性错误,禁止无限重试
  3. 运行态记忆必须独立于聊天上下文保存
  4. 修复动作执行完以后,必须先验证,再决定是否继续原任务
  5. 同一轮 turn 中,相同工具调用失败后要去重,同一条消息也要去重

3. 错误分类

错误类型典型示例处理方式
可自动修复错误validation error、provider not found、gateway 未启动进入 self-heal 流程
可重试但有限制网络超时、429、临时文件锁允许有限次数重试,超阈值熔断
不可自动修复错误权限不足、密钥无效、破坏性动作立即停止,返回结构化错误摘要

4. 自愈架构:4 个核心模块

  1. Error Classifier — 判断 retryable / self-healable / non-retryable
  2. Repair Planner — 生成修复动作
  3. State Guard — 维护 turn_id、失败工具、参数哈希等运行态
  4. Verifier — 修复后验证通过才允许恢复原任务

5. 推荐运行态字段

字段用途
turn_id唯一回合标记
message_hash防重复发送
last_tool_name / last_tool_args_hash标记失败工具
last_tool_error / error_code明确错误类型
retry_count / repair_attempt_count限制重试次数
non_retryable_failure不可重试标记

6. 核心执行流程

收到请求 → 执行动作 → 报错 → Error Classifier → 判断可自修 → 生成 repair plan → 执行修复 → 验证 → 验证通过后最多重试原任务 1 次;同类错误再现 → 立即熔断


7. 最值得优先实现的 6 个自修动作

  1. 工具参数自修 — validation error 后重新读取 schema,只保留 required 字段
  2. Provider 配置自修 — 检测 primary model 可用性,不可用则自动切换
  3. Gateway 进程自修 — 无响应时 graceful restart
  4. Session 自修 — 转存运行态,重建轻量 session
  5. 重复发送拦截 — message hash 去重
  6. Compaction 安全保护 — 不压缩运行态

8. 最低限度止血规则

  • Validation error 一律禁止直接重试
  • 同一 args hash 失败后禁止再次执行
  • 同一条消息禁止重复发送
  • Repair 次数不超过 2 次/turn
  • Compaction 不得删除失败状态

9. 英文系统规则(可直接嵌入 prompt)

You are a self-healing execution agent. When a tool call fails, do not blindly retry the original action.

  1. Classify the error first.
  2. If the error is deterministic (validation error, schema mismatch, missing required field, invalid enum, bad config reference, provider not found), do not retry the original action directly.
  3. Switch to repair mode and generate a repair plan.
  4. Execute only the repair plan first.
  5. Validate the repair result before re-running the original action.
  6. Re-run the original action at most once after successful validation.
  7. If the same class of error happens again in the same turn, stop and return a structured failure summary.
  8. Never send the same message twice in the same turn.
  9. Never repeat the same failed tool call with the same argument hash in the same turn.
  10. Preserve execution state outside compacted conversation history.
  11. If repair attempts exceed 2 in the same turn, trigger circuit breaker and stop.
  12. Prefer safe fallback over repeated retries.

10. 落地建议

  • 短期:止血,禁止 deterministic error 无限重试
  • 中期:运行态从 prompt 拆分,单独 State Guard
  • 长期:主 agent + repair agent 双层架构

一句话总结:先判断能不能修、修完先验证、同类错误不重复第二次。

extent analysis

TL;DR

Implement a self-healing mechanism that classifies errors, generates repair plans, and validates repairs before re-running original actions to prevent infinite retries.

Guidance

  • Identify and classify errors into retryable, self-healable, and non-retryable categories to determine the best course of action.
  • Implement a repair planner to generate repair actions for self-healable errors, and a verifier to validate the success of these repairs.
  • Preserve execution state outside of compacted conversation history to prevent loss of failure information and enable more informed retry decisions.
  • Limit the number of repair attempts and retries to prevent infinite loops and implement a circuit breaker to stop execution when repair attempts exceed a threshold.

Example

A simple example of error classification and repair planning could involve checking the error type and generating a repair plan accordingly, such as:

if error_type == "validation_error":
    repair_plan = generate_repair_plan_for_validation_error()
elif error_type == "provider_not_found":
    repair_plan = generate_repair_plan_for_provider_not_found()

Notes

The implementation of the self-healing mechanism will require significant changes to the existing system, including the addition of new modules and the modification of existing execution flows. It is essential to carefully consider the design and implementation of each component to ensure that the system can effectively classify errors, generate repair plans, and validate repairs.

Recommendation

Apply the workaround of implementing a self-healing mechanism with error classification, repair planning, and validation to prevent infinite retries and improve system reliability. This approach will allow the system to recover from errors more effectively and reduce the likelihood of repeated failures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Proposal] Self-Healing Execution: 解决工具调用死循环问题 [1 participants]