openclaw - 💡(How to fix) Fix [Bug] OOM crash when running large parallel subagent tasks — no concurrency cap or memory pressure circuit breaker [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62321Fetched 2026-04-08 03:06:02
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Running a complex subagent task that internally spawns multiple parallel sub-skills (image generation ×4 + HTML rendering + publish script) caused the system to run out of memory and crash with a hard reboot.

Root Cause

Root Cause (observed via atop)

Code Example

PID 488697 — openclaw-skill — 723 MB
PID 488314 — openclaw-skill — 355 MB
PID 488363 — openclaw-skill — 320 MB
PID 488160 — openclaw-skill — 255 MB
PID 488678 — openclaw-skill — 173 MB
PID 488141 — openclaw-skill — 146 MB
PID 488295 — openclaw-skill — 133 MB
PID 488224 — openclaw-skill — 113 MB

---

memsome 61%  memfull 51%
iosome  93%  iofull 71%
RAW_BUFFERClick to expand / collapse

Bug Report: OOM System Crash Due to Unbounded Parallel Subagents

Environment

  • OpenClaw version: 2026.4.5 (3e72c03)
  • OS: Ubuntu 24.04, Linux 6.8.0-90-generic (x64)
  • RAM: 8GB (no swap)
  • Node: v22.22.0

Summary

Running a complex subagent task that internally spawns multiple parallel sub-skills (image generation ×4 + HTML rendering + publish script) caused the system to run out of memory and crash with a hard reboot.

Root Cause (observed via atop)

At the time of crash (2026-04-07 ~11:50 CST), 8+ openclaw-skill subagent processes were simultaneously resident in memory, each carrying a full LLM context load:

PID 488697 — openclaw-skill — 723 MB
PID 488314 — openclaw-skill — 355 MB
PID 488363 — openclaw-skill — 320 MB
PID 488160 — openclaw-skill — 255 MB
PID 488678 — openclaw-skill — 173 MB
PID 488141 — openclaw-skill — 146 MB
PID 488295 — openclaw-skill — 133 MB
PID 488224 — openclaw-skill — 113 MB

Combined with openclaw-gateway (588 MB) and AliYunDunMonitor (757 MB), total RSS exceeded 8 GB.

PSI metrics just before crash:

memsome 61%  memfull 51%
iosome  93%  iofull 71%

Important Note: Small Subagents Exit Cleanly

After the crash, we tested 3 lightweight parallel subagents (single python script each, ~30s runtime) and confirmed all processes exited cleanly with zero residual processes. The issue is not a process leak per se, but rather the absence of any concurrency cap or memory pressure circuit breaker when many large subagents run simultaneously.

Expected Behavior

  1. A configurable max concurrent subagents limit (e.g. agents.defaults.maxConcurrentSubagents: 4)
  2. Memory pressure awareness: delay or refuse new subagent spawning when system RSS exceeds a threshold (e.g. 70% of total RAM)
  3. Graceful degradation: queue pending subagents instead of spawning them all at once

Actual Behavior

  • No upper bound on concurrent subagent processes
  • No memory pressure check before spawning new subagents
  • System OOM-crashed with hard reboot (no graceful recovery or warning)

Steps to Reproduce

  1. Spawn a subagent task that internally triggers 4+ parallel skill executions (e.g. batch image generation inside a publish workflow)
  2. On a machine with <= 8GB RAM and no swap, observe RSS usage in htop/atop
  3. System will OOM-crash before any graceful handling occurs

Suggested Fix

  • Add maxConcurrentSubagents config option with a sane default (e.g. 3–4)
  • Add a memory pressure circuit breaker: check /proc/meminfo or PSI before spawning
  • Queue excess subagents and run them sequentially when concurrency cap is reached

Additional Context

Reproduced with the mp-weixin-ops skill (4x image generation + markdown-to-html + wechat publish) running inside a subagent. atop binary logs available if needed.

extent analysis

TL;DR

Implement a configurable concurrency limit and memory pressure awareness to prevent unbounded parallel subagent spawning and subsequent system crashes.

Guidance

  • Introduce a maxConcurrentSubagents configuration option to cap the number of simultaneous subagent processes.
  • Develop a memory pressure circuit breaker that checks system RSS usage before spawning new subagents, delaying or refusing new spawns when a threshold (e.g., 70% of total RAM) is exceeded.
  • Implement a queuing mechanism for excess subagents, running them sequentially when the concurrency cap is reached.
  • Consider integrating PSI metrics monitoring to enhance memory pressure detection and response.

Example

import psutil

def check_memory_pressure(threshold=0.7):
    """Check if system memory usage exceeds the given threshold."""
    mem_usage = psutil.virtual_memory().percent / 100
    return mem_usage > threshold

def spawn_subagent():
    """Spawn a new subagent process, respecting concurrency limits and memory pressure."""
    if check_memory_pressure():
        # Delay or refuse new subagent spawn
        print("Memory pressure too high; delaying subagent spawn.")
        return
    # Spawn new subagent, incrementing concurrency counter
    print("Spawning new subagent...")

Notes

The proposed solution assumes that the openclaw-skill subagent processes can be managed and limited through configuration and programming changes. Additional considerations may be necessary for handling edge cases, such as subagent process failures or priority scheduling.

Recommendation

Apply a workaround by introducing a temporary concurrency limit and memory pressure check, using the example code as a starting point. This will help prevent system crashes while a more comprehensive solution is developed and tested.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING