openclaw - 💡(How to fix) Fix [Bug] OOM crash when running large parallel subagent tasks — no concurrency cap or memory pressure circuit breaker [1 participants]

openclaw2026-04-07 05:46:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62321•Fetched 2026-04-08 03:06:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

JerryliDe

Participants

JerryliDe

Running a complex subagent task that internally spawns multiple parallel sub-skills (image generation ×4 + HTML rendering + publish script) caused the system to run out of memory and crash with a hard reboot.

Root Cause

Root Cause (observed via atop)

Code Example

PID 488697 — openclaw-skill — 723 MB
PID 488314 — openclaw-skill — 355 MB
PID 488363 — openclaw-skill — 320 MB
PID 488160 — openclaw-skill — 255 MB
PID 488678 — openclaw-skill — 173 MB
PID 488141 — openclaw-skill — 146 MB
PID 488295 — openclaw-skill — 133 MB
PID 488224 — openclaw-skill — 113 MB

---

memsome 61%  memfull 51%
iosome  93%  iofull 71%

RAW_BUFFERClick to expand / collapse

Bug Report: OOM System Crash Due to Unbounded Parallel Subagents

Environment

OpenClaw version: 2026.4.5 (3e72c03)
OS: Ubuntu 24.04, Linux 6.8.0-90-generic (x64)
RAM: 8GB (no swap)
Node: v22.22.0

Summary

Root Cause (observed via atop)

At the time of crash (2026-04-07 ~11:50 CST), 8+ openclaw-skill subagent processes were simultaneously resident in memory, each carrying a full LLM context load:

PID 488697 — openclaw-skill — 723 MB
PID 488314 — openclaw-skill — 355 MB
PID 488363 — openclaw-skill — 320 MB
PID 488160 — openclaw-skill — 255 MB
PID 488678 — openclaw-skill — 173 MB
PID 488141 — openclaw-skill — 146 MB
PID 488295 — openclaw-skill — 133 MB
PID 488224 — openclaw-skill — 113 MB

Combined with openclaw-gateway (588 MB) and AliYunDunMonitor (757 MB), total RSS exceeded 8 GB.

PSI metrics just before crash:

memsome 61%  memfull 51%
iosome  93%  iofull 71%

Important Note: Small Subagents Exit Cleanly

After the crash, we tested 3 lightweight parallel subagents (single python script each, ~30s runtime) and confirmed all processes exited cleanly with zero residual processes. The issue is not a process leak per se, but rather the absence of any concurrency cap or memory pressure circuit breaker when many large subagents run simultaneously.

Expected Behavior

A configurable max concurrent subagents limit (e.g. agents.defaults.maxConcurrentSubagents: 4)
Memory pressure awareness: delay or refuse new subagent spawning when system RSS exceeds a threshold (e.g. 70% of total RAM)
Graceful degradation: queue pending subagents instead of spawning them all at once

Actual Behavior

No upper bound on concurrent subagent processes
No memory pressure check before spawning new subagents
System OOM-crashed with hard reboot (no graceful recovery or warning)

Steps to Reproduce

Spawn a subagent task that internally triggers 4+ parallel skill executions (e.g. batch image generation inside a publish workflow)
On a machine with <= 8GB RAM and no swap, observe RSS usage in htop/atop
System will OOM-crash before any graceful handling occurs

Suggested Fix

Add maxConcurrentSubagents config option with a sane default (e.g. 3–4)
Add a memory pressure circuit breaker: check /proc/meminfo or PSI before spawning
Queue excess subagents and run them sequentially when concurrency cap is reached

Additional Context

Reproduced with the mp-weixin-ops skill (4x image generation + markdown-to-html + wechat publish) running inside a subagent. atop binary logs available if needed.

extent analysis

TL;DR

Implement a configurable concurrency limit and memory pressure awareness to prevent unbounded parallel subagent spawning and subsequent system crashes.

Guidance

Introduce a maxConcurrentSubagents configuration option to cap the number of simultaneous subagent processes.
Develop a memory pressure circuit breaker that checks system RSS usage before spawning new subagents, delaying or refusing new spawns when a threshold (e.g., 70% of total RAM) is exceeded.
Implement a queuing mechanism for excess subagents, running them sequentially when the concurrency cap is reached.
Consider integrating PSI metrics monitoring to enhance memory pressure detection and response.

Example

import psutil

def check_memory_pressure(threshold=0.7):
    """Check if system memory usage exceeds the given threshold."""
    mem_usage = psutil.virtual_memory().percent / 100
    return mem_usage > threshold

def spawn_subagent():
    """Spawn a new subagent process, respecting concurrency limits and memory pressure."""
    if check_memory_pressure():
        # Delay or refuse new subagent spawn
        print("Memory pressure too high; delaying subagent spawn.")
        return
    # Spawn new subagent, incrementing concurrency counter
    print("Spawning new subagent...")

Notes

The proposed solution assumes that the openclaw-skill subagent processes can be managed and limited through configuration and programming changes. Additional considerations may be necessary for handling edge cases, such as subagent process failures or priority scheduling.

Recommendation

Apply a workaround by introducing a temporary concurrency limit and memory pressure check, using the example code as a starting point. This will help prevent system crashes while a more comprehensive solution is developed and tested.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug] OOM crash when running large parallel subagent tasks — no concurrency cap or memory pressure circuit breaker [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (observed via atop)

Code Example

Bug Report: OOM System Crash Due to Unbounded Parallel Subagents

Environment

Summary

Root Cause (observed via atop)

Important Note: Small Subagents Exit Cleanly

Expected Behavior

Actual Behavior

Steps to Reproduce

Suggested Fix

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug] OOM crash when running large parallel subagent tasks — no concurrency cap or memory pressure circuit breaker [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (observed via atop)

Code Example

Bug Report: OOM System Crash Due to Unbounded Parallel Subagents

Environment

Summary

Root Cause (observed via atop)

Important Note: Small Subagents Exit Cleanly

Expected Behavior

Actual Behavior

Steps to Reproduce

Suggested Fix

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING