openclaw - ✅(Solved) Fix Subagent announce: swap dispatch order to queue-first when parent session is busy [1 pull requests, 1 participants]

arcimun · 2026-03-30T20:05:35Z

[openclaw] Subagent announce completion messages use direct-primary dispatch via callGateway as the first attempt for expectsCompletionMessage: true . When the… Subagent announce completion messages use `direct-primary` dispatch (via `callGateway`) as the first attempt for `expectsCompletionMessage: true`. When the parent session has an active run, the direct call queues into the session's command lane and blocks until the lane is free — often timing out. **Proposed fix:** Check if the parent session has an active run. If yes, swap dispatch order to `queue-primary` first (like `expectsCompletionMessage: false` path). Relates to: #45075, #54276, #53202, #38300, #44925 # PR #56822: feat(agents): add opt-in sessions_await tool for parallel sub-agent orchestration - Repository: openclaw/openclaw - Author: tonga54 - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/56822 ## Description (problem / solution / changelog) ## Summary - **Problem:** Orchestrator agents that spawn multiple parallel sub-agents have no reliable way to block until all workers complete. The only option today is polling `sessions_list` in a loop, which LLMs do unreliably — they skip ahead with partial results despite prompt engineering. - **Why it matters:** Fan-out/fan-in orchestration (main -> orchestrator -> N workers) is one of the most requested subagent patterns. Without a blocking primitive, orchestrators produce incomplete or inconsistent results. - **What changed:** Adds a `sessions_await` tool and `suppressAnnounce`/`waitForCompletion` parameters to `sessions_spawn`, gated behind `agents.defaults.subagents.awaitEnabled` (opt-in, off by default). - **What did NOT change (scope boundary):** No behavioral change for existing users. All new functionality requires explicit config opt-in. Registry, lifecycle, and run-manager internals gain the `suppressAutoAnnounce` field but it is inert unless the config flag is enabled. ## Change Type (select all) - [ ] Bug fix - [x] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #31499 - Related #38522 - Related #38433 - Related #30767 - [ ] This PR fixes a bug or regression ## Root Cause / Regression History (if applicable) N/A ## Regression Test Plan (if applicable) N/A — new feature, not a regression fix. ## User-visible / Behavior Changes When `agents.defaults.subagents.awaitEnabled: true` is set in `openclaw.json`: 1. A new `sessions_await` tool becomes available to agents, accepting an array of session keys and an optional timeout. 2. `sessions_spawn` gains `suppressAnnounce` and `waitForCompletion` boolean parameters. 3. The system prompt includes guidance on the spawn+await pattern for parallel work. When the flag is unset or false (default): no changes to behavior. ### Configuration ```jsonc { "agents": { "defaults": { "subagents": { "awaitEnabled": true } } } } ``` ### Usage pattern The agent spawns multiple workers with `suppressAnnounce: true`, then calls `sessions_await` with all session keys to block until every result is ready: ``` sessions_spawn(task="analyze file A", suppressAnnounce=true) -> key1 sessions_spawn(task="analyze file B", suppressAnnounce=true) -> key2 sessions_await(sessionKeys=[key1, key2], timeoutSeconds=300) -> { status: "ok", results: [{ key1, reply: "..." }, { key2, reply: "..." }] } ``` ## Diagram (if applicable) ``` ┌─────────────┐ │ Parent Agent │ └──────┬──────┘ │ ┌─────────────────┼─────────────────┐ │ spawn(suppress) │ spawn(suppress) │ ▼ │ ▼ ┌──────────┐ │ ┌──────────┐ │ Worker 1 │ │ │ Worker 2 │ └────┬─────┘ │ └────┬─────┘ │ │ │ ▼ ▼ ▼ complete sessions_await([k1,k2]) complete │ ▼ { results: [...] } ``` ## Security Impact (required) - New permissions/capabilities? No — tools only available when config flag is on - Secrets/tokens handling changed? No - New/changed network calls? No — uses existing `agent.wait` gateway RPC - Command/tool execution surface changed? Yes — new `sessions_await` tool (gated) - Data access scope changed? No - The tool reads run results already available in the subagent registry; no new data surface is exposed. ## Repro + Verification ### Environment - OS: macOS 15.4 (Apple Silicon) - Runtime: Node 22, Bun 1.2 - Model: any model with tool-calling support ### Steps 1. Set `agents.defaults.subagents.awaitEnabled: true` in `openclaw.json` 2. Ask the agent to "analyze files A, B, and C in parallel" 3. Agent spawns 3 workers with `suppressAnnounce: true` 4. Agent calls `sessions_await` with all 3 session keys 5. All results collected before agent responds ### Expected Agent blocks until all 3 workers complete, then synthesizes a combined response. ### Actual Same as expected. ## Evidence - [x] Passing scoped tests: `session

openclaw2026-03-30 20:05:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#57916•Fetched 2026-04-08 01:56:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

arcimun

Participants

arcimun

Timeline (top)

cross-referenced ×1

Subagent announce completion messages use direct-primary dispatch (via callGateway) as the first attempt for expectsCompletionMessage: true. When the parent session has an active run, the direct call queues into the session's command lane and blocks until the lane is free — often timing out.

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

Relates to: #45075, #54276, #53202, #38300, #44925

Root Cause

In src/agents/subagent-announce-dispatch.ts:

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

The direct() call uses callGateway(method: "agent") which enqueues into the parent session's command lane. When the parent is processing another announce (from a sibling subagent), the lane is occupied → timeout → retry → timeout → ...

Fix Action

Fix / Workaround

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1

Announce retries reduced from 20+ to 8
All 5 subagents still delivered results successfully
But announce retries still happen because lane wait (44-64s) > announceTimeoutMs (35s)

PR fix notes

PR #56822: feat(agents): add opt-in sessions_await tool for parallel sub-agent orchestration

Repository: openclaw/openclaw
Author: tonga54
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/56822

Description (problem / solution / changelog)

Summary

Problem: Orchestrator agents that spawn multiple parallel sub-agents have no reliable way to block until all workers complete. The only option today is polling sessions_list in a loop, which LLMs do unreliably — they skip ahead with partial results despite prompt engineering.
Why it matters: Fan-out/fan-in orchestration (main -> orchestrator -> N workers) is one of the most requested subagent patterns. Without a blocking primitive, orchestrators produce incomplete or inconsistent results.
What changed: Adds a sessions_await tool and suppressAnnounce/waitForCompletion parameters to sessions_spawn, gated behind agents.defaults.subagents.awaitEnabled (opt-in, off by default).
What did NOT change (scope boundary): No behavioral change for existing users. All new functionality requires explicit config opt-in. Registry, lifecycle, and run-manager internals gain the suppressAutoAnnounce field but it is inert unless the config flag is enabled.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #31499
Related #38522
Related #38433
Related #30767
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

N/A

Regression Test Plan (if applicable)

N/A — new feature, not a regression fix.

User-visible / Behavior Changes

When agents.defaults.subagents.awaitEnabled: true is set in openclaw.json:

A new sessions_await tool becomes available to agents, accepting an array of session keys and an optional timeout.
sessions_spawn gains suppressAnnounce and waitForCompletion boolean parameters.
The system prompt includes guidance on the spawn+await pattern for parallel work.

When the flag is unset or false (default): no changes to behavior.

Configuration

{
  "agents": {
    "defaults": {
      "subagents": {
        "awaitEnabled": true
      }
    }
  }
}

Usage pattern

The agent spawns multiple workers with suppressAnnounce: true, then calls sessions_await with all session keys to block until every result is ready:

sessions_spawn(task="analyze file A", suppressAnnounce=true) -> key1
sessions_spawn(task="analyze file B", suppressAnnounce=true) -> key2
sessions_await(sessionKeys=[key1, key2], timeoutSeconds=300)
-> { status: "ok", results: [{ key1, reply: "..." }, { key2, reply: "..." }] }

Diagram (if applicable)

                       ┌─────────────┐
                       │ Parent Agent │
                       └──────┬──────┘
                              │
            ┌─────────────────┼─────────────────┐
            │ spawn(suppress) │ spawn(suppress)  │
            ▼                 │                  ▼
     ┌──────────┐             │           ┌──────────┐
     │ Worker 1 │             │           │ Worker 2 │
     └────┬─────┘             │           └────┬─────┘
          │                   │                │
          ▼                   ▼                ▼
       complete    sessions_await([k1,k2])  complete
                         │
                         ▼
                  { results: [...] }

Security Impact (required)

New permissions/capabilities? No — tools only available when config flag is on
Secrets/tokens handling changed? No
New/changed network calls? No — uses existing agent.wait gateway RPC
Command/tool execution surface changed? Yes — new sessions_await tool (gated)
Data access scope changed? No
The tool reads run results already available in the subagent registry; no new data surface is exposed.

Repro + Verification

Environment

OS: macOS 15.4 (Apple Silicon)
Runtime: Node 22, Bun 1.2
Model: any model with tool-calling support

Steps

Set agents.defaults.subagents.awaitEnabled: true in openclaw.json
Ask the agent to "analyze files A, B, and C in parallel"
Agent spawns 3 workers with suppressAnnounce: true
Agent calls sessions_await with all 3 session keys
All results collected before agent responds

Expected

Agent blocks until all 3 workers complete, then synthesizes a combined response.

Actual

Same as expected.

Evidence

Passing scoped tests: sessions-await-tool.test.ts (4/4), announce-loop-guard.test.ts (6/6)
pnpm build green
pnpm check green (lint, format, types, all boundary checks)

Human Verification (required)

Verified: tool registration gated behind config flag, system prompt conditional on tool presence, spawn params only exposed when flag is on
Edge cases: empty keys, unknown keys, timeout, mixed completed/active/unknown sessions
Not verified: multi-gateway distributed scenario, production load

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes — off by default, zero behavioral change without opt-in
Config/env changes? Yes — new optional agents.defaults.subagents.awaitEnabled boolean
Migration needed? No

Risks and Mitigations

Risk: sessions_await could hold an agent turn open for the full timeout duration if workers stall
- Mitigation: Configurable timeout (default 300s), partial results returned on timeout
Risk: suppressAutoAnnounce could silently drop completion messages if sessions_await is never called
- Mitigation: Only set when explicitly requested via tool param; existing announce flow is the default

Made with Cursor

Changed files

src/agents/openclaw-tools.sessions.test.ts (modified, +76/-0)
src/agents/openclaw-tools.ts (modified, +15/-1)
src/agents/subagent-registry-lifecycle.ts (modified, +5/-0)
src/agents/subagent-registry-run-manager.ts (modified, +2/-0)
src/agents/subagent-registry.announce-loop-guard.test.ts (modified, +36/-0)
src/agents/subagent-registry.test.ts (modified, +30/-0)
src/agents/subagent-registry.ts (modified, +43/-0)
src/agents/subagent-registry.types.ts (modified, +5/-0)
src/agents/subagent-spawn.test.ts (modified, +228/-10)
src/agents/subagent-spawn.ts (modified, +140/-2)
src/agents/system-prompt.ts (modified, +5/-0)
src/agents/tools/sessions-await-tool.test.ts (added, +562/-0)
src/agents/tools/sessions-await-tool.ts (added, +320/-0)
src/agents/tools/sessions-spawn-tool.test.ts (modified, +194/-1)
src/agents/tools/sessions-spawn-tool.ts (modified, +76/-38)
src/config/schema.base.generated.ts (modified, +10/-0)
src/config/types.agent-defaults.ts (modified, +6/-0)
src/config/types.agents.ts (modified, +7/-0)
src/config/zod-schema.agent-defaults.ts (modified, +6/-0)
src/config/zod-schema.agent-runtime.ts (modified, +6/-0)

Code Example

lane=session:agent:arcimun:telegram:direct:50234619 waitedMs=255308 queueAhead=3

---

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1

---

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

---

if (params.expectsCompletionMessage) {
  if (params.parentSessionBusy) {
    // Queue first — parent is processing another announce
    const queueOutcome = await params.queue();
    if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) {
      return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome));
    }
    // Fall through to direct if queue fails
  }
  const primaryDirect = await params.direct();
  // ... existing fallback to queue
}

RAW_BUFFERClick to expand / collapse

Summary

Proposed fix: Check if the parent session has an active run. If yes, swap dispatch order to queue-primary first (like expectsCompletionMessage: false path).

Relates to: #45075, #54276, #53202, #38300, #44925

Evidence (Real-world reproduction)

Setup

OpenClaw 2026.3.28, macOS, Telegram channel
5 parallel subagents doing research (NIM/Groq models)
Parent agent (Arcimun) on claude-sonnet-4-6

Before fix (announceTimeoutMs=120s)

lane=session:agent:arcimun:telegram:direct:50234619 waitedMs=255308 queueAhead=3

5 subagents × 4 retries × 120s timeout = 40+ min total gateway blocking
Gateway auto-restarted via watchdog after 317s stuck session
After restart: context overflow (76%) → full model fallback cascade → all 6 models timeout

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

lane=session:agent:arcimun:main waitedMs=64358 queueAhead=3
lane=session:agent:arcimun:main waitedMs=56507 queueAhead=2
lane=session:agent:arcimun:main waitedMs=44826 queueAhead=1

Announce retries reduced from 20+ to 8
All 5 subagents still delivered results successfully
But announce retries still happen because lane wait (44-64s) > announceTimeoutMs (35s)

Root Cause

In src/agents/subagent-announce-dispatch.ts:

// Current behavior for expectsCompletionMessage: true
const primaryDirect = await params.direct();  // ← blocks on lane
if (primaryDirect.delivered) return;
const fallbackQueue = await params.queue();   // ← only tries queue after direct fails

Proposed Fix

Option A: Session-busy-aware dispatch (minimal change)

In runSubagentAnnounceDispatch(), when expectsCompletionMessage is true, check if the parent session has an active embedded run. If yes, try queue first:

if (params.expectsCompletionMessage) {
  if (params.parentSessionBusy) {
    // Queue first — parent is processing another announce
    const queueOutcome = await params.queue();
    if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) {
      return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome));
    }
    // Fall through to direct if queue fails
  }
  const primaryDirect = await params.direct();
  // ... existing fallback to queue
}

Option B: Dedicated announce lane (architectural)

Add a separate CommandLane.Announce that processes announce messages independently from user messages. This prevents announce contention with both user traffic AND sibling announces.

Option C: Batch collect-then-announce

When parent spawns N subagents, set up a collector that waits for all N to complete (or timeout), then delivers results as a single batch. This eliminates N-1 announce retries entirely.

Environment

OpenClaw version: 2026.3.28 (f9b1079)
OS: macOS Darwin 25.3.0
Channel: Telegram
Node: v22.17.1 (upgraded from v25.8.1)

extent analysis

Fix Plan

To address the issue, we will implement Option A: Session-busy-aware dispatch. This involves modifying the runSubagentAnnounceDispatch() function to check if the parent session has an active embedded run when expectsCompletionMessage is true. If the parent session is busy, we will try the queue first.

Steps:

Update runSubagentAnnounceDispatch() function:

if (params.expectsCompletionMessage) { if (params.parentSessionBusy) { // Queue first — parent is processing another announce const queueOutcome = await params.queue(); if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) { return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome)); } // Fall through to direct if queue fails } const primaryDirect = await params.direct(); // ... existing fallback to queue }

2. **Implement `parentSessionBusy` check**:
   You need to implement a method to check if the parent session has an active embedded run. This can be done by querying the session's current state or by maintaining a flag that indicates whether the session is busy.

#### Example Code:
```typescript
// Assuming a getSessionState() function that returns the session's state
const parentSessionBusy = async (sessionId: string) => {
  const sessionState = await getSessionState(sessionId);
  return sessionState.activeRun !== null;
};

Integrate parentSessionBusy check into runSubagentAnnounceDispatch():

if (params.expectsCompletionMessage) { const isParentSessionBusy = await parentSessionBusy(params.parentSessionId); if (isParentSessionBusy) { // Queue first — parent is processing another announce const queueOutcome = await params.queue(); if (mapQueueOutcomeToDeliveryResult(queueOutcome).delivered) { return withPhases(mapQueueOutcomeToDeliveryResult(queueOutcome)); } // Fall through to direct if queue fails } const primaryDirect = await params.direct(); // ... existing fallback to queue }


### Verification
To verify that the fix worked, you can:
* Monitor the announce timeout and retry rates to ensure they have decreased.
* Check the session logs to confirm that the queue is being used first when the parent session is busy.
* Test the system with multiple subagents and verify that all results are delivered successfully without excessive retries.

### Extra Tips
* Consider implementing **Option B: Dedicated announce lane** or **Option C: Batch collect-then-announce** for a more architectural solution to the problem.
* Make sure to thoroughly test the changes in a staging environment before deploying them to production.
* Keep an eye on the system's performance and adjust the `announceTimeoutMs` value as needed to ensure optimal performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Subagent announce: swap dispatch order to queue-first when parent session is busy [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

PR fix notes

PR #56822: feat(agents): add opt-in sessions_await tool for parallel sub-agent orchestration

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Configuration

Usage pattern

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Summary

Evidence (Real-world reproduction)

Setup

Before fix (announceTimeoutMs=120s)

After partial mitigation (announceTimeoutMs=35s, expanded fallbacks)

Root Cause

Proposed Fix

Option A: Session-busy-aware dispatch (minimal change)

Option B: Dedicated announce lane (architectural)

Option C: Batch collect-then-announce

Environment

extent analysis

Fix Plan

Steps:

Still need to ship something?

RELATED_DISCOVERY

TRENDING