openclaw - 💡(How to fix) Fix Two-Strike Enforcement for Sub-Agent Error Handling [1 participants]

DockeGumi · 2026-03-23T07:34:39Z

[openclaw] Problem Sub-agent sessions may encounter repeated non-transient errors but continue running indefinitely, consuming resources and potentially causin… ## Problem Sub-agent sessions may encounter repeated non-transient errors but continue running indefinitely, consuming resources and potentially causing cascading failures. There is no automated mechanism to terminate or restart misbehaving sessions. ## Proposed Solution: Two-Strike Enforcement ### Definition - **Strike**: A non-transient error (e.g., network timeout, API failure, uncaught exception) that is not automatically recoverable. - **Two-Strike Rule**: If a sub-agent accumulates >=2 strikes within a 10-minute window, it is automatically terminated or restarted. ### Implementation - Heartbeat task (every 30 minutes) scans all sub-agent sessions via `subagents list`. - For each session, check metadata `error_count` and timestamps of recent errors. - If condition met: `subagents kill --sessionId ` and optionally restart (if critical). ### Configuration ```json { "two_strike": { "max_strikes": 2, "window_seconds": 600, "auto_restart": true } } ``` ### Error Classification Need a way to distinguish transient vs non-transient errors. Proposal: - Transient: network blips, rate limits (429), temporary unavailability. - Non-transient: auth failures, invalid parameters, persistent 5xx, OOM. - Classification function in agent harness; update `error_count` only for non-transient. ### Monitoring Log each strike to audit trail; optionally notify user if critical service affected. ## Request for Feedback - Are two strikes too few or too many? Should we have escalation (strike 1: warn, strike 2: restart)? - Should errors be weighted differently? - Should this be configurable per agent type?

openclaw2026-03-23 07:34:39

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#52703•Fetched 2026-04-08 01:20:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

DockeGumi

Participants

DockeGumi

Error Message

Strike: A non-transient error (e.g., network timeout, API failure, uncaught exception) that is not automatically recoverable.

Error Classification

Are two strikes too few or too many? Should we have escalation (strike 1: warn, strike 2: restart)?

Code Example

{
  "two_strike": {
    "max_strikes": 2,
    "window_seconds": 600,
    "auto_restart": true
  }
}

RAW_BUFFERClick to expand / collapse

Problem

Sub-agent sessions may encounter repeated non-transient errors but continue running indefinitely, consuming resources and potentially causing cascading failures. There is no automated mechanism to terminate or restart misbehaving sessions.

Proposed Solution: Two-Strike Enforcement

Definition

Strike: A non-transient error (e.g., network timeout, API failure, uncaught exception) that is not automatically recoverable.
Two-Strike Rule: If a sub-agent accumulates >=2 strikes within a 10-minute window, it is automatically terminated or restarted.

Implementation

Heartbeat task (every 30 minutes) scans all sub-agent sessions via subagents list.
For each session, check metadata error_count and timestamps of recent errors.
If condition met: subagents kill --sessionId <id> and optionally restart (if critical).

Configuration

{
  "two_strike": {
    "max_strikes": 2,
    "window_seconds": 600,
    "auto_restart": true
  }
}

Error Classification

Need a way to distinguish transient vs non-transient errors. Proposal:

Transient: network blips, rate limits (429), temporary unavailability.
Non-transient: auth failures, invalid parameters, persistent 5xx, OOM.
Classification function in agent harness; update error_count only for non-transient.

Monitoring

Log each strike to audit trail; optionally notify user if critical service affected.

Request for Feedback

Are two strikes too few or too many? Should we have escalation (strike 1: warn, strike 2: restart)?
Should errors be weighted differently?
Should this be configurable per agent type?

extent analysis

Fix Plan

To implement the Two-Strike Enforcement mechanism, follow these steps:

Step 1: Configure Two-Strike Settings Update the configuration file with the desired settings:

{ "two_strike": { "max_strikes": 2, "window_seconds": 600, "auto_restart": true } }

* **Step 2: Implement Error Classification**
  Create a function to classify errors as transient or non-transient:
  ```python
def classify_error(error):
  if error.code in [429, 503]:  # transient errors
    return "transient"
  elif error.code in [401, 500, 502, 504]:  # non-transient errors
    return "non_transient"
  else:
    return "unknown"

Step 3: Update Heartbeat Task Modify the heartbeat task to scan sub-agent sessions and enforce the Two-Strike Rule:

import schedule import time

def heartbeat(): sessions = subagents_list() for session in sessions: error_count = session.metadata["error_count"] recent_errors = session.metadata["recent_errors"] if error_count >= config["two_strike"]["max_strikes"]: # terminate or restart the session subagents_kill(session.id) if config["two_strike"]["auto_restart"]: subagents_restart(session.id)

schedule.every(30).minutes.do(heartbeat) # run every 30 minutes

* **Step 4: Log Strikes and Notify Users (optional)**
  Log each strike to the audit trail and notify users if a critical service is affected:
  ```python
def log_strike(session, error):
  # log the strike to the audit trail
  audit_trail.log(session.id, error)
  # notify users if a critical service is affected
  if session.service == "critical":
    notify_users(session.id, error)

Verification

To verify that the fix worked, monitor the sub-agent sessions and check that:

Sessions with >=2 non-transient errors within a 10-minute window are terminated or restarted.
The audit trail logs each strike correctly.
Users are notified if a critical service is affected.

Extra Tips

Consider weighting errors differently based on their severity.
Make the Two-Strike Enforcement mechanism configurable per agent type.
Continuously monitor and evaluate the effectiveness of the Two-Strike Rule to ensure it is not too lenient or too strict.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Two-Strike Enforcement for Sub-Agent Error Handling [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Classification

Code Example

Problem

Proposed Solution: Two-Strike Enforcement

Definition

Implementation

Configuration

Error Classification

Monitoring

Request for Feedback

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Two-Strike Enforcement for Sub-Agent Error Handling [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Classification

Code Example

Problem

Proposed Solution: Two-Strike Enforcement

Definition

Implementation

Configuration

Error Classification

Monitoring

Request for Feedback

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING