openclaw - 💡(How to fix) Fix Two-Strike Enforcement for Sub-Agent Error Handling [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52703Fetched 2026-04-08 01:20:06
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

  • Strike: A non-transient error (e.g., network timeout, API failure, uncaught exception) that is not automatically recoverable.

Error Classification

  • Are two strikes too few or too many? Should we have escalation (strike 1: warn, strike 2: restart)?

Code Example

{
  "two_strike": {
    "max_strikes": 2,
    "window_seconds": 600,
    "auto_restart": true
  }
}
RAW_BUFFERClick to expand / collapse

Problem

Sub-agent sessions may encounter repeated non-transient errors but continue running indefinitely, consuming resources and potentially causing cascading failures. There is no automated mechanism to terminate or restart misbehaving sessions.

Proposed Solution: Two-Strike Enforcement

Definition

  • Strike: A non-transient error (e.g., network timeout, API failure, uncaught exception) that is not automatically recoverable.
  • Two-Strike Rule: If a sub-agent accumulates >=2 strikes within a 10-minute window, it is automatically terminated or restarted.

Implementation

  • Heartbeat task (every 30 minutes) scans all sub-agent sessions via subagents list.
  • For each session, check metadata error_count and timestamps of recent errors.
  • If condition met: subagents kill --sessionId <id> and optionally restart (if critical).

Configuration

{
  "two_strike": {
    "max_strikes": 2,
    "window_seconds": 600,
    "auto_restart": true
  }
}

Error Classification

Need a way to distinguish transient vs non-transient errors. Proposal:

  • Transient: network blips, rate limits (429), temporary unavailability.
  • Non-transient: auth failures, invalid parameters, persistent 5xx, OOM.
  • Classification function in agent harness; update error_count only for non-transient.

Monitoring

Log each strike to audit trail; optionally notify user if critical service affected.

Request for Feedback

  • Are two strikes too few or too many? Should we have escalation (strike 1: warn, strike 2: restart)?
  • Should errors be weighted differently?
  • Should this be configurable per agent type?

extent analysis

Fix Plan

To implement the Two-Strike Enforcement mechanism, follow these steps:

  • Step 1: Configure Two-Strike Settings Update the configuration file with the desired settings:

{ "two_strike": { "max_strikes": 2, "window_seconds": 600, "auto_restart": true } }

* **Step 2: Implement Error Classification**
  Create a function to classify errors as transient or non-transient:
  ```python
def classify_error(error):
  if error.code in [429, 503]:  # transient errors
    return "transient"
  elif error.code in [401, 500, 502, 504]:  # non-transient errors
    return "non_transient"
  else:
    return "unknown"
  • Step 3: Update Heartbeat Task Modify the heartbeat task to scan sub-agent sessions and enforce the Two-Strike Rule:

import schedule import time

def heartbeat(): sessions = subagents_list() for session in sessions: error_count = session.metadata["error_count"] recent_errors = session.metadata["recent_errors"] if error_count >= config["two_strike"]["max_strikes"]: # terminate or restart the session subagents_kill(session.id) if config["two_strike"]["auto_restart"]: subagents_restart(session.id)

schedule.every(30).minutes.do(heartbeat) # run every 30 minutes

* **Step 4: Log Strikes and Notify Users (optional)**
  Log each strike to the audit trail and notify users if a critical service is affected:
  ```python
def log_strike(session, error):
  # log the strike to the audit trail
  audit_trail.log(session.id, error)
  # notify users if a critical service is affected
  if session.service == "critical":
    notify_users(session.id, error)

Verification

To verify that the fix worked, monitor the sub-agent sessions and check that:

  • Sessions with >=2 non-transient errors within a 10-minute window are terminated or restarted.
  • The audit trail logs each strike correctly.
  • Users are notified if a critical service is affected.

Extra Tips

  • Consider weighting errors differently based on their severity.
  • Make the Two-Strike Enforcement mechanism configurable per agent type.
  • Continuously monitor and evaluate the effectiveness of the Two-Strike Rule to ensure it is not too lenient or too strict.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Two-Strike Enforcement for Sub-Agent Error Handling [1 participants]