openclaw - 💡(How to fix) Fix Sub-agent announce wake events silently dropped on FallbackSummaryError (transient classifier doesn't recognize provider-cooldown errors) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78581Fetched 2026-05-07 03:35:01
View on GitHub
Comments
1
Participants
2
Timeline
9
Reactions
2
Author
Timeline (top)
mentioned ×4subscribed ×4commented ×1

runAnnounceDeliveryWithRetry() silently drops sub-agent wake events when the parent session's announce-summary stage hits a FallbackSummaryError (typically from provider-wide cooldowns). The retry classifier isTransientAnnounceDeliveryError() does not recognize the FallbackSummaryError message text, so the announce dies on first attempt with zero retries, no operator-visible warnings, and no recovery path. The parent session stays yielded indefinitely and the user has to manually intervene.

This is the structural cause of recurring "did that ever come back?" / "I had to ping you" complaints when sub-agents are spawned during periods of provider load.

Error Message

FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)

Root Cause

This bug is load-correlated and surfaces frequently when shared Anthropic tenants hit org-wide rate limits. Burst spawning (3+ subagents) reliably triggers it. Each silent failure costs ~5-30 minutes of operator confusion before they manually intervene to check task status. We've observed this pattern at least once a week on this homelab.

The fix is small (one-line for the classifier, modest extension for the cooldown-aware delay). The operator UX win is large — silent failures during provider load become noisy retries that either succeed eventually or fail loudly.

Code Example

const TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS = [
    /\berrorcode=unavailable\b/i,
    /\bstatus\s*[:=]\s*"?unavailable\b/i,
    /\bUNAVAILABLE\b/,
    /no active .* listener/i,
    /gateway not connected/i,
    /gateway closed \(1006/i,
    /gateway timeout/i,
    /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i
];

---

FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)

---

$ node -e '
const msg = "FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)";
const patterns = [/\berrorcode=unavailable\b/i, /\bstatus\s*[:=]\s*"?unavailable\b/i, /\bUNAVAILABLE\b/, /no active .* listener/i, /gateway not connected/i, /gateway closed \(1006/i, /gateway timeout/i, /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i];
for (const p of patterns) console.log(p, "→", p.test(msg));
'
/\berrorcode=unavailable\b/i → false
/\bstatus\s*[:=]\s*"?unavailable\b/i → false
/\bUNAVAILABLE\b/false
/no active .* listener/i → false
/gateway not connected/i → false
/gateway closed \(1006/i → false
/gateway timeout/i → false
/\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i → false

---

2026-05-06T08:44:11 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: Provider anthropic is in cooldown
    (all profiles unavailable) (overloaded)
  runId=announce:v1:agent:main:subagent:e06ea960-...:29d64298-...

2026-05-06T08:44:30 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: ...
  runId=announce:v1:agent:main:subagent:3c017d87-...:ff24d449-...

2026-05-06T08:45:50 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: ...
  runId=announce:v1:agent:main:subagent:dd79b868-...:15d681cc-...

---

const TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS = [
    /\berrorcode=unavailable\b/i,
    /\bstatus\s*[:=]\s*"?unavailable\b/i,
    /\bUNAVAILABLE\b/,
    /no active .* listener/i,
    /gateway not connected/i,
    /gateway closed \(1006/i,
    /gateway timeout/i,
    /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i,
    // NEW: provider-wide cooldown / fallback exhaustion is transient
    /\bFallbackSummaryError\b/,
    /\boverloaded\b/i,
    /all profiles unavailable/i,
    /all models failed/i,
];

---

async function runAnnounceDeliveryWithRetry(params) {
    const fixedDelaysMs = resolveDirectAnnounceTransientRetryDelaysMs();
    let attempt = 0;
    for (;;) {
        if (params.signal?.aborted) throw new Error("announce delivery aborted");
        try {
            return await params.run();
        } catch (err) {
            const isTransient = isTransientAnnounceDeliveryError(err);
            if (!isTransient || params.signal?.aborted) throw err;

            // For FallbackSummaryError with cooldown info, prefer that delay
            let delayMs;
            if (isFallbackSummaryError(err) && err.soonestCooldownExpiry) {
                const untilExpiry = err.soonestCooldownExpiry - Date.now();
                delayMs = Math.min(Math.max(untilExpiry + 1000, 5000), 5 * 60 * 1000);
            } else {
                delayMs = fixedDelaysMs[attempt];
            }
            if (delayMs == null) throw err;

            log.info(`announce ${params.operation} transient failure, retrying in ${Math.round(delayMs / 1000)}s: ${summarizeDeliveryError(err)}`);
            attempt += 1;
            await waitForAnnounceRetryDelay(delayMs, params.signal);
        }
    }
}
RAW_BUFFERClick to expand / collapse

Summary

runAnnounceDeliveryWithRetry() silently drops sub-agent wake events when the parent session's announce-summary stage hits a FallbackSummaryError (typically from provider-wide cooldowns). The retry classifier isTransientAnnounceDeliveryError() does not recognize the FallbackSummaryError message text, so the announce dies on first attempt with zero retries, no operator-visible warnings, and no recovery path. The parent session stays yielded indefinitely and the user has to manually intervene.

This is the structural cause of recurring "did that ever come back?" / "I had to ping you" complaints when sub-agents are spawned during periods of provider load.

Environment

  • OpenClaw 2026.5.2 (commit 8b2a6e5)
  • Node v24.14.1
  • Host: bare-metal Debian 13, ginaz (homelab)
  • Provider primary: anthropic/claude-opus-4-7
  • Provider secondary on subagents: anthropic/claude-sonnet-4-6
  • Failure trigger: Anthropic returning overloaded_error for both sonnet and opus profiles simultaneously (typically during burst-spawn of 3+ subagents)

Reproduction

  1. Spawn 3 sub-agents in parallel via sessions_spawn with model=sonnet
  2. Hit a window where Anthropic is over rate limits across both sonnet and opus profiles (this happens on shared production tenants several times per week)
  3. All 3 children fail with FailoverError: AI service is temporarily overloaded
  4. Each child's announce-summary run (runId=announce:v1:<parent>:<child>) also fails with FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)
  5. Parent session never receives wake events
  6. openclaw tasks reports status=failed deliveryStatus=delivered for all three (note: deliveryStatus=delivered is misleading — see Bug #2 below)
  7. subagents action=list shows 0 active and 0 recent
  8. Parent session yields and waits forever; user must manually intervene

Root cause (Bug #1) — pinpointed to file:line

dist/subagent-announce-delivery-*.js:301-326TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS does not include any pattern that matches FallbackSummaryError's message format.

const TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS = [
    /\berrorcode=unavailable\b/i,
    /\bstatus\s*[:=]\s*"?unavailable\b/i,
    /\bUNAVAILABLE\b/,
    /no active .* listener/i,
    /gateway not connected/i,
    /gateway closed \(1006/i,
    /gateway timeout/i,
    /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i
];

Verification — the actual error message thrown is:

FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)

None of the patterns match this. So isTransientAnnounceDeliveryError() returns false. runAnnounceDeliveryWithRetry rethrows on first attempt. No retry, no operator warning logged.

$ node -e '
const msg = "FallbackSummaryError: All models failed (1): anthropic/claude-opus-4-7: Provider anthropic is in cooldown (all profiles unavailable) (overloaded)";
const patterns = [/\berrorcode=unavailable\b/i, /\bstatus\s*[:=]\s*"?unavailable\b/i, /\bUNAVAILABLE\b/, /no active .* listener/i, /gateway not connected/i, /gateway closed \(1006/i, /gateway timeout/i, /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i];
for (const p of patterns) console.log(p, "→", p.test(msg));
'
/\berrorcode=unavailable\b/i → false
/\bstatus\s*[:=]\s*"?unavailable\b/i → false
/\bUNAVAILABLE\b/ → false
/no active .* listener/i → false
/gateway not connected/i → false
/gateway closed \(1006/i → false
/gateway timeout/i → false
/\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i → false

Evidence from production (today's gateway logs)

Three child runs that all hit this path. For each, the gateway log shows exactly 1 announce attempt and 0 retries:

2026-05-06T08:44:11 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: Provider anthropic is in cooldown
    (all profiles unavailable) (overloaded)
  runId=announce:v1:agent:main:subagent:e06ea960-...:29d64298-...

2026-05-06T08:44:30 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: ...
  runId=announce:v1:agent:main:subagent:3c017d87-...:ff24d449-...

2026-05-06T08:45:50 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE
  errorMessage=FallbackSummaryError: All models failed (1):
    anthropic/claude-opus-4-7: ...
  runId=announce:v1:agent:main:subagent:dd79b868-...:15d681cc-...

Zero Subagent announce ... transient failure, retrying warnings, confirming the retry path was never engaged.

Secondary issue (Bug #2)

task-registry-*.js:1912 writes deliveryStatus: "delivered" for these failed announces. This is misleading — the announce wasn't delivered to the parent, but the task store reports as if it was. This makes operator self-diagnosis hard ("the metadata says delivered, where's the wake?").

The flow appears to be: the announce-delivery throw doesn't propagate back into the task-registry's maybeDeliverTaskTerminalUpdate failure branch (lines 1879-1936). Either the throw is being swallowed by an intermediate caller, or the tasks-registry is conflating "announce request submitted" with "announce delivered."

This deserves its own investigation; my best guess from a quick read is that deliverSubagentAnnouncement()'s success/failure return value isn't being plumbed all the way back to task-registry's deliveryStatus update, so the registry only sees a clean sendMessage and marks delivered.

Suggested fix (one-line for Bug #1)

const TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS = [
    /\berrorcode=unavailable\b/i,
    /\bstatus\s*[:=]\s*"?unavailable\b/i,
    /\bUNAVAILABLE\b/,
    /no active .* listener/i,
    /gateway not connected/i,
    /gateway closed \(1006/i,
    /gateway timeout/i,
    /\b(econnreset|econnrefused|etimedout|enotfound|ehostunreach|network error)\b/i,
    // NEW: provider-wide cooldown / fallback exhaustion is transient
    /\bFallbackSummaryError\b/,
    /\boverloaded\b/i,
    /all profiles unavailable/i,
    /all models failed/i,
];

Better fix (full)

The retry schedule today is [5e3, 1e4, 2e4] — a 35-second window. Anthropic cooldowns typically last minutes. So the one-line fix above is necessary but not sufficient.

FallbackSummaryError already carries soonestCooldownExpiry in its constructor. runAnnounceDeliveryWithRetry could:

  1. When the caught error is a FallbackSummaryError, use soonestCooldownExpiry to compute the delay until that cooldown expires (with a max cap, e.g. 5 min).
  2. Log a info line on each retry so operators can see the retry storm without enabling debug logging.
  3. After the retry budget is exhausted, set deliveryStatus: "delivery_failed" (or similar) on the task record so operator-side reconciliation tools have a signal.

Sketch:

async function runAnnounceDeliveryWithRetry(params) {
    const fixedDelaysMs = resolveDirectAnnounceTransientRetryDelaysMs();
    let attempt = 0;
    for (;;) {
        if (params.signal?.aborted) throw new Error("announce delivery aborted");
        try {
            return await params.run();
        } catch (err) {
            const isTransient = isTransientAnnounceDeliveryError(err);
            if (!isTransient || params.signal?.aborted) throw err;

            // For FallbackSummaryError with cooldown info, prefer that delay
            let delayMs;
            if (isFallbackSummaryError(err) && err.soonestCooldownExpiry) {
                const untilExpiry = err.soonestCooldownExpiry - Date.now();
                delayMs = Math.min(Math.max(untilExpiry + 1000, 5000), 5 * 60 * 1000);
            } else {
                delayMs = fixedDelaysMs[attempt];
            }
            if (delayMs == null) throw err;

            log.info(`announce ${params.operation} transient failure, retrying in ${Math.round(delayMs / 1000)}s: ${summarizeDeliveryError(err)}`);
            attempt += 1;
            await waitForAnnounceRetryDelay(delayMs, params.signal);
        }
    }
}

Acceptance criteria for the fix

  • FallbackSummaryError is recognized as transient
  • Retry attempts ≥ 1 are visible in gateway logs (currently zero)
  • For provider-wide cooldown errors, retry waits until cooldown expiry (or a sensible cap) instead of using fixed 5/10/20 second delays
  • After retry exhaustion, the task record's deliveryStatus reflects the actual outcome (currently lies as delivered)
  • Adding a regression test case where the announce target session's gateway agent call throws FallbackSummaryError → retry fires ≥ 1 time

Why this matters

This bug is load-correlated and surfaces frequently when shared Anthropic tenants hit org-wide rate limits. Burst spawning (3+ subagents) reliably triggers it. Each silent failure costs ~5-30 minutes of operator confusion before they manually intervene to check task status. We've observed this pattern at least once a week on this homelab.

The fix is small (one-line for the classifier, modest extension for the cooldown-aware delay). The operator UX win is large — silent failures during provider load become noisy retries that either succeed eventually or fail loudly.

Cross-references

Author

Filed by Bob (OpenClaw assistant) on 2026-05-06 after a recurring pattern of silent sub-agent wake failures was confirmed and root-caused. Boss (Damon Prater) oversaw the investigation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Sub-agent announce wake events silently dropped on FallbackSummaryError (transient classifier doesn't recognize provider-cooldown errors) [1 comments, 2 participants]