openclaw - ✅(Solved) Fix Bug: Dreaming narrative spawns unbounded concurrent subagent sessions across workspaces [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73198Fetched 2026-04-29 06:22:17
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
referenced ×2closed ×1cross-referenced ×1

Root Cause

In dreaming.ts, when trigger === "cron", detachNarratives is set to true. This causes narrative generation to be fired via queueMicrotask without awaiting the result:

// dreaming.ts — runLightDreaming (same pattern in runRemDreaming, runDeepDreaming)
if (params.detachNarratives) queueMicrotask(() => {
    generateAndAppendDreamNarrative({
        subagent: params.subagent,
        workspaceDir: params.workspaceDir,
        data,
        ...
    }).catch(() => void 0);  // errors silently swallowed
});

The workspace loop itself is sequential (for...of with await), but because each workspace's narrative generation is detached and not awaited, ALL narrative subagent sessions from ALL workspaces pile up simultaneously.

Fix Action

Fixed

PR fix notes

PR #73287: fix(memory-core): cap concurrent detached dream-narrative subagents (#73198)

Description (problem / solution / changelog)

Summary

Fixes #73198. The cron-driven path through light / rem / deep dreaming detaches narrative generation per workspace via queueMicrotask + fire-and-forget, so a 10-workspace cron sweep would fan out into 30+ simultaneous narrative subagent runs. Each one holds the session write-lock while it runs, which the issue reproduced as 30 s+ lock-hold warnings (releasing lock held for 30396ms (max=15000ms)) and cascading narrative timeouts.

This PR adds a single shared FIFO queue with a cap of DETACHED_NARRATIVE_CONCURRENCY = 3 inside extensions/memory-core/src/dreaming-narrative.ts, exposed as runDetachedDreamNarrative. The three existing detached-narrative call sites (deep in dreaming.ts; light + REM in dreaming-phases.ts) now go through that helper instead of an open queueMicrotask. The synchronous (non-detached) branches are unchanged — those callers explicitly want a serialized await today.

The cap is intentionally module-local (no new config surface): the limit applies across all phases and workspaces in a single cron sweep, which is exactly the failure pattern in the report. If the maintainers want it configurable later, hoisting the constant is a one-line follow-up.

Changes

FileWhat
extensions/memory-core/src/dreaming-narrative.tsnew: runDetachedDreamNarrative + module-local FIFO concurrency limiter (acquireDetachedNarrativeSlot / releaseDetachedNarrativeSlot). Errors from the underlying generation are intentionally swallowed — logging already happens inside generateAndAppendDreamNarrative, and surfacing here would only produce unhandled rejections from a fire-and-forget cron path.
extensions/memory-core/src/dreaming-narrative.test.tstwo new tests: (1) firing 5 detached narratives only lets 3 reach subagent.run before the cap holds; (2) a rejected subagent.run does not produce an unhandledRejection.
extensions/memory-core/src/dreaming.tsreplace the deep-phase queueMicrotask(() => void generateAndAppendDreamNarrative(...).catch(...)) block with runDetachedDreamNarrative(...).
extensions/memory-core/src/dreaming-phases.tssame swap for the light and REM detached call sites.
 4 files changed, 182 insertions(+), 32 deletions(-)

Tests

  • pnpm test extensions/memory-core/src/dreaming-narrative.test.ts — 44 pass (2 new)
  • pnpm test extensions/memory-core/src/dreaming-phases.test.ts — 30 pass
  • pnpm test extensions/memory-core/src/dreaming-command.test.ts — 8 pass
  • pnpm exec oxfmt --check — clean
  • pnpm exec oxlint — 0 warnings, 0 errors
  • pnpm tsgo -p tsconfig.core.json — clean

Context

Closes #73198.

The reporter suggested either p-limit or a hand-rolled semaphore; this follows the hand-rolled FIFO pattern already used at extensions/openshell/src/mirror.ts:13 (createConcurrencyLimiter) so the repo doesn't grow a new dependency for a single use site.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/memory-core/src/dreaming-narrative.test.ts (modified, +115/-0)
  • extensions/memory-core/src/dreaming-narrative.ts (modified, +50/-0)
  • extensions/memory-core/src/dreaming-phases.ts (modified, +21/-21)
  • extensions/memory-core/src/dreaming.test.ts (modified, +8/-4)
  • extensions/memory-core/src/dreaming.ts (modified, +13/-11)
  • test/vitest/vitest.unit-fast-paths.mjs (modified, +0/-1)

Code Example

// dreaming.ts — runLightDreaming (same pattern in runRemDreaming, runDeepDreaming)
if (params.detachNarratives) queueMicrotask(() => {
    generateAndAppendDreamNarrative({
        subagent: params.subagent,
        workspaceDir: params.workspaceDir,
        data,
        ...
    }).catch(() => void 0);  // errors silently swallowed
});

---

03:01:09 [session-write-lock] releasing lock held for 30396ms (max=15000ms)
03:01:09 narrative generation ended with status=timeout for rem phase
03:01:09 narrative generation ended with status=timeout for deep phase
03:01:09 narrative generation ended with status=timeout for light phase
03:01:09 narrative generation ended with status=timeout for rem phase
... (repeats for each workspace)
03:09:31 [memory] sync failed: memory embeddings batch timed out after 120s

---

import pLimit from 'p-limit';
const NARRATIVE_CONCURRENCY_LIMIT = 3; // or configurable
const limit = pLimit(NARRATIVE_CONCURRENCY_LIMIT);

// Instead of fire-and-forget:
if (params.detachNarratives) {
    limit(() => generateAndAppendDreamNarrative({ ... }))
        .catch(() => void 0);
}
RAW_BUFFERClick to expand / collapse

Problem

The dreaming narrative system spawns unbounded concurrent subagent sessions when processing multiple workspaces, causing resource exhaustion, session lock contention, and cascading timeouts.

Root Cause

In dreaming.ts, when trigger === "cron", detachNarratives is set to true. This causes narrative generation to be fired via queueMicrotask without awaiting the result:

// dreaming.ts — runLightDreaming (same pattern in runRemDreaming, runDeepDreaming)
if (params.detachNarratives) queueMicrotask(() => {
    generateAndAppendDreamNarrative({
        subagent: params.subagent,
        workspaceDir: params.workspaceDir,
        data,
        ...
    }).catch(() => void 0);  // errors silently swallowed
});

The workspace loop itself is sequential (for...of with await), but because each workspace's narrative generation is detached and not awaited, ALL narrative subagent sessions from ALL workspaces pile up simultaneously.

Impact

With N workspaces and 3 phases (light/rem/deep), this spawns 3N unbounded concurrent subagent sessions in a single sweep.

Observed with 11 workspaces:

MetricValue
Subagent sessions spawned~33 simultaneously
Narrative timeout (NARRATIVE_TIMEOUT_MS)15s
Sessions holding sessions.json.lockUp to 30,438ms (limit: 15,000ms)
Narrative phase timeoutsNearly all phases timed out
Node.js file descriptor GC warningsResource leak observed
Memory embedding syncTimed out after 120s

Log evidence

03:01:09 [session-write-lock] releasing lock held for 30396ms (max=15000ms)
03:01:09 narrative generation ended with status=timeout for rem phase
03:01:09 narrative generation ended with status=timeout for deep phase
03:01:09 narrative generation ended with status=timeout for light phase
03:01:09 narrative generation ended with status=timeout for rem phase
... (repeats for each workspace)
03:09:31 [memory] sync failed: memory embeddings batch timed out after 120s

Suggested Fix

Add a concurrency limiter (semaphore/p-limit) to bound the number of concurrent narrative subagent sessions. For example:

import pLimit from 'p-limit';
const NARRATIVE_CONCURRENCY_LIMIT = 3; // or configurable
const limit = pLimit(NARRATIVE_CONCURRENCY_LIMIT);

// Instead of fire-and-forget:
if (params.detachNarratives) {
    limit(() => generateAndAppendDreamNarrative({ ... }))
        .catch(() => void 0);
}

Additional improvements:

  • Increase NARRATIVE_TIMEOUT_MS from 15s to 30-60s for multi-workspace scenarios
  • Log detached narrative failures instead of swallowing them with .catch(() => void 0)
  • Consider processing workspaces in batches rather than firing all at once

Environment

  • OpenClaw version: 2026.4.25 (aa36ee6)
  • OS: Windows 10 (x64)
  • Number of agents/workspaces: 11
  • Dreaming config: enabled, default cron schedule (0 3 * * *)

Additional Context

This issue caused a cascading failure: the resource storm from 33 concurrent subagent sessions led to session lock contention (sessions.json.lock held 2x the limit), which contributed to main session context loss after a gateway restart. The dreaming system should be resilient to multi-workspace loads without exhausting system resources.

extent analysis

TL;DR

Implement a concurrency limiter to bound the number of concurrent narrative subagent sessions and prevent resource exhaustion.

Guidance

  • Introduce a concurrency limit using a library like p-limit to restrict the number of simultaneous narrative subagent sessions.
  • Consider increasing the NARRATIVE_TIMEOUT_MS to 30-60s for multi-workspace scenarios to reduce timeout errors.
  • Log detached narrative failures instead of silently swallowing them to improve error visibility.
  • Evaluate processing workspaces in batches to further mitigate resource contention.

Example

import pLimit from 'p-limit';
const NARRATIVE_CONCURRENCY_LIMIT = 3;
const limit = pLimit(NARRATIVE_CONCURRENCY_LIMIT);

if (params.detachNarratives) {
    limit(() => generateAndAppendDreamNarrative({ ... }))
        .catch((error) => console.error('Detached narrative failed:', error));
}

Notes

The suggested fix focuses on introducing a concurrency limiter, but additional improvements like increasing timeouts and logging failures can help improve the overall resilience of the system.

Recommendation

Apply the workaround by introducing a concurrency limiter, as it directly addresses the root cause of the issue and prevents resource exhaustion.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Bug: Dreaming narrative spawns unbounded concurrent subagent sessions across workspaces [1 pull requests, 1 participants]