openclaw - ✅(Solved) Fix Subagent announce: swap dispatch order to queue-first when parent session is busy [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57916Fetched 2026-04-08 01:56:11
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Subagent announce completion messages use direct-primary dispatch (via callGateway) as the first attempt for expectsCompletionMessage: true. When the parent session has an active run, the direct call queues into the session's command lane and blocks until the lane is free — often timing out.

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

Relates to: #45075, #54276, #53202, #38300, #44925

Root Cause

In src/agents/subagent-announce-dispatch.ts:

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

The direct() call uses callGateway(method: "agent") which enqueues into the parent session's command lane. When the parent is processing another announce (from a sibling subagent), the lane is occupied → timeout → retry → timeout → ...

Fix Action

Fix / Workaround

Subagent announce completion messages use direct-primary dispatch (via callGateway) as the first attempt for expectsCompletionMessage: true. When the parent session has an active run, the direct call queues into the session's command lane and blocks until the lane is free — often timing out.

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1
  • Announce retries reduced from 20+ to 8
  • All 5 subagents still delivered results successfully
  • But announce retries still happen because lane wait (44-64s) > announceTimeoutMs (35s)

PR fix notes

PR #56822: feat(agents): add opt-in sessions_await tool for parallel sub-agent orchestration

Description (problem / solution / changelog)

Summary

  • Problem: Orchestrator agents that spawn multiple parallel sub-agents have no reliable way to block until all workers complete. The only option today is polling sessions_list in a loop, which LLMs do unreliably — they skip ahead with partial results despite prompt engineering.
  • Why it matters: Fan-out/fan-in orchestration (main -> orchestrator -> N workers) is one of the most requested subagent patterns. Without a blocking primitive, orchestrators produce incomplete or inconsistent results.
  • What changed: Adds a sessions_await tool and suppressAnnounce/waitForCompletion parameters to sessions_spawn, gated behind agents.defaults.subagents.awaitEnabled (opt-in, off by default).
  • What did NOT change (scope boundary): No behavioral change for existing users. All new functionality requires explicit config opt-in. Registry, lifecycle, and run-manager internals gain the suppressAutoAnnounce field but it is inert unless the config flag is enabled.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #31499
  • Related #38522
  • Related #38433
  • Related #30767
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

N/A

Regression Test Plan (if applicable)

N/A — new feature, not a regression fix.

User-visible / Behavior Changes

When agents.defaults.subagents.awaitEnabled: true is set in openclaw.json:

  1. A new sessions_await tool becomes available to agents, accepting an array of session keys and an optional timeout.
  2. sessions_spawn gains suppressAnnounce and waitForCompletion boolean parameters.
  3. The system prompt includes guidance on the spawn+await pattern for parallel work.

When the flag is unset or false (default): no changes to behavior.

Configuration

{
  "agents": {
    "defaults": {
      "subagents": {
        "awaitEnabled": true
      }
    }
  }
}

Usage pattern

The agent spawns multiple workers with suppressAnnounce: true, then calls sessions_await with all session keys to block until every result is ready:

sessions_spawn(task="analyze file A", suppressAnnounce=true) -> key1
sessions_spawn(task="analyze file B", suppressAnnounce=true) -> key2
sessions_await(sessionKeys=[key1, key2], timeoutSeconds=300)
-> { status: "ok", results: [{ key1, reply: "..." }, { key2, reply: "..." }] }

Diagram (if applicable)

                       ┌─────────────┐
                       │ Parent Agent │
                       └──────┬──────┘
            ┌─────────────────┼─────────────────┐
            │ spawn(suppress) │ spawn(suppress)  │
            ▼                 │                  ▼
     ┌──────────┐             │           ┌──────────┐
     │ Worker 1 │             │           │ Worker 2 │
     └────┬─────┘             │           └────┬─────┘
          │                   │                │
          ▼                   ▼                ▼
       complete    sessions_await([k1,k2])  complete
                  { results: [...] }

Security Impact (required)

  • New permissions/capabilities? No — tools only available when config flag is on
  • Secrets/tokens handling changed? No
  • New/changed network calls? No — uses existing agent.wait gateway RPC
  • Command/tool execution surface changed? Yes — new sessions_await tool (gated)
  • Data access scope changed? No
  • The tool reads run results already available in the subagent registry; no new data surface is exposed.

Repro + Verification

Environment

  • OS: macOS 15.4 (Apple Silicon)
  • Runtime: Node 22, Bun 1.2
  • Model: any model with tool-calling support

Steps

  1. Set agents.defaults.subagents.awaitEnabled: true in openclaw.json
  2. Ask the agent to "analyze files A, B, and C in parallel"
  3. Agent spawns 3 workers with suppressAnnounce: true
  4. Agent calls sessions_await with all 3 session keys
  5. All results collected before agent responds

Expected

Agent blocks until all 3 workers complete, then synthesizes a combined response.

Actual

Same as expected.

Evidence

  • Passing scoped tests: sessions-await-tool.test.ts (4/4), announce-loop-guard.test.ts (6/6)
  • pnpm build green
  • pnpm check green (lint, format, types, all boundary checks)

Human Verification (required)

  • Verified: tool registration gated behind config flag, system prompt conditional on tool presence, spawn params only exposed when flag is on
  • Edge cases: empty keys, unknown keys, timeout, mixed completed/active/unknown sessions
  • Not verified: multi-gateway distributed scenario, production load

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — off by default, zero behavioral change without opt-in
  • Config/env changes? Yes — new optional agents.defaults.subagents.awaitEnabled boolean
  • Migration needed? No

Risks and Mitigations

  • Risk: sessions_await could hold an agent turn open for the full timeout duration if workers stall
    • Mitigation: Configurable timeout (default 300s), partial results returned on timeout
  • Risk: suppressAutoAnnounce could silently drop completion messages if sessions_await is never called
    • Mitigation: Only set when explicitly requested via tool param; existing announce flow is the default

Made with Cursor

Changed files

  • src/agents/openclaw-tools.sessions.test.ts (modified, +76/-0)
  • src/agents/openclaw-tools.ts (modified, +15/-1)
  • src/agents/subagent-registry-lifecycle.ts (modified, +5/-0)
  • src/agents/subagent-registry-run-manager.ts (modified, +2/-0)
  • src/agents/subagent-registry.announce-loop-guard.test.ts (modified, +36/-0)
  • src/agents/subagent-registry.test.ts (modified, +30/-0)
  • src/agents/subagent-registry.ts (modified, +43/-0)
  • src/agents/subagent-registry.types.ts (modified, +5/-0)
  • src/agents/subagent-spawn.test.ts (modified, +228/-10)
  • src/agents/subagent-spawn.ts (modified, +140/-2)
  • src/agents/system-prompt.ts (modified, +5/-0)
  • src/agents/tools/sessions-await-tool.test.ts (added, +562/-0)
  • src/agents/tools/sessions-await-tool.ts (added, +320/-0)
  • src/agents/tools/sessions-spawn-tool.test.ts (modified, +194/-1)
  • src/agents/tools/sessions-spawn-tool.ts (modified, +76/-38)
  • src/config/schema.base.generated.ts (modified, +10/-0)
  • src/config/types.agent-defaults.ts (modified, +6/-0)
  • src/config/types.agents.ts (modified, +7/-0)
  • src/config/zod-schema.agent-defaults.ts (modified, +6/-0)
  • src/config/zod-schema.agent-runtime.ts (modified, +6/-0)

Code Example

lane=session:agent:arcimun:telegram:direct:50234619 waitedMs=255308 queueAhead=3

---

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1

---

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

---

if (params.expectsCompletionMessage) {
  if (params.parentSessionBusy) {
    // Queue first — parent is processing another announce
    const queueOutcome = await params.queue();
    if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) {
      return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome));
    }
    // Fall through to direct if queue fails
  }
  const primaryDirect = await params.direct();
  // ... existing fallback to queue
}
RAW_BUFFERClick to expand / collapse

Summary

Subagent announce completion messages use direct-primary dispatch (via callGateway) as the first attempt for expectsCompletionMessage: true. When the parent session has an active run, the direct call queues into the session's command lane and blocks until the lane is free — often timing out.

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

Relates to: #45075, #54276, #53202, #38300, #44925

Evidence (Real-world reproduction)

Setup

  • OpenClaw 2026.3.28, macOS, Telegram channel
  • 5 parallel subagents doing research (NIM/Groq models)
  • Parent agent (Arcimun) on claude-sonnet-4-6

Before fix (announceTimeoutMs=120s)

lane=session:agent:arcimun:telegram:direct:50234619 waitedMs=255308 queueAhead=3
  • 5 subagents × 4 retries × 120s timeout = 40+ min total gateway blocking
  • Gateway auto-restarted via watchdog after 317s stuck session
  • After restart: context overflow (76%) → full model fallback cascade → all 6 models timeout

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1
  • Announce retries reduced from 20+ to 8
  • All 5 subagents still delivered results successfully
  • But announce retries still happen because lane wait (44-64s) > announceTimeoutMs (35s)

Root Cause

In src/agents/subagent-announce-dispatch.ts:

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

The direct() call uses callGateway(method: "agent") which enqueues into the parent session's command lane. When the parent is processing another announce (from a sibling subagent), the lane is occupied → timeout → retry → timeout → ...

Proposed Fix

Option A: Session-busy-aware dispatch (minimal change)

In runSubagentAnnounceDispatch(), when expectsCompletionMessage is true, check if the parent session has an active embedded run. If yes, try queue first:

if (params.expectsCompletionMessage) {
  if (params.parentSessionBusy) {
    // Queue first — parent is processing another announce
    const queueOutcome = await params.queue();
    if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) {
      return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome));
    }
    // Fall through to direct if queue fails
  }
  const primaryDirect = await params.direct();
  // ... existing fallback to queue
}

Option B: Dedicated announce lane (architectural)

Add a separate CommandLane.Announce that processes announce messages independently from user messages. This prevents announce contention with both user traffic AND sibling announces.

Option C: Batch collect-then-announce

When parent spawns N subagents, set up a collector that waits for all N to complete (or timeout), then delivers results as a single batch. This eliminates N-1 announce retries entirely.

Environment

  • OpenClaw version: 2026.3.28 (f9b1079)
  • OS: macOS Darwin 25.3.0
  • Channel: Telegram
  • Node: v22.17.1 (upgraded from v25.8.1)

extent analysis

Fix Plan

To address the issue, we will implement Option A: Session-busy-aware dispatch. This involves modifying the runSubagentAnnounceDispatch() function to check if the parent session has an active embedded run when expectsCompletionMessage is true. If the parent session is busy, we will try the queue first.

Steps:

  1. Update runSubagentAnnounceDispatch() function:

if (params.expectsCompletionMessage) { if (params.parentSessionBusy) { // Queue first — parent is processing another announce const queueOutcome = await params.queue(); if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) { return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome)); } // Fall through to direct if queue fails } const primaryDirect = await params.direct(); // ... existing fallback to queue }

2. **Implement `parentSessionBusy` check**:
   You need to implement a method to check if the parent session has an active embedded run. This can be done by querying the session's current state or by maintaining a flag that indicates whether the session is busy.

#### Example Code:
```typescript
// Assuming a getSessionState() function that returns the session's state
const parentSessionBusy = async (sessionId: string) => {
  const sessionState = await getSessionState(sessionId);
  return sessionState.activeRun !== null;
};
  1. Integrate parentSessionBusy check into runSubagentAnnounceDispatch():

if (params.expectsCompletionMessage) { const isParentSessionBusy = await parentSessionBusy(params.parentSessionId); if (isParentSessionBusy) { // Queue first — parent is processing another announce const queueOutcome = await params.queue(); if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) { return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome)); } // Fall through to direct if queue fails } const primaryDirect = await params.direct(); // ... existing fallback to queue }


### Verification
To verify that the fix worked, you can:
* Monitor the announce timeout and retry rates to ensure they have decreased.
* Check the session logs to confirm that the queue is being used first when the parent session is busy.
* Test the system with multiple subagents and verify that all results are delivered successfully without excessive retries.

### Extra Tips
* Consider implementing **Option B: Dedicated announce lane** or **Option C: Batch collect-then-announce** for a more architectural solution to the problem.
* Make sure to thoroughly test the changes in a staging environment before deploying them to production.
* Keep an eye on the system's performance and adjust the `announceTimeoutMs` value as needed to ensure optimal performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Subagent announce: swap dispatch order to queue-first when parent session is busy [1 pull requests, 1 participants]