openclaw - 💡(How to fix) Fix Embedded LLM run objects retain memory after timeout/abort — causes multi-GB RSS growth in single-process Gateway [1 comments, 2 participants]

cjboy007 · 2026-04-29T07:18:51Z

[openclaw] After running the OpenClaw Gateway for ~2 hours with embedded LLM runs, the process RSS grew from a normal baseline to 2.85GB , causing event loop s… After running the OpenClaw Gateway for ~2 hours with embedded LLM runs, the process RSS grew from a normal baseline to **2.85GB**, causing event loop saturation (100% CPU, 5s+ RPC response times). The root cause appears to be **incomplete cleanup of embedded LLM run objects after timeout**. When an embedded run times out, the log shows `timedOut:true, aborted:true`, suggesting the abort signal was sent. However, the memory allocated for the run is not released. ## Evidence 1. **24 embedded run failover decisions** in a single day's logs 2. Multiple `embedded run timeout` entries with `timeoutMs=600000` (10 minutes) 3. `Unhandled promise rejection: Agent listener invoked outside active listener context` — suggests abandoned promise chains retaining references 4. Subagent announce completion retries (4×) each holding 120s timeouts with closures/promises/timers alive 5. RSS growth pattern consistent with retained run state (transcripts, tool results, stream buffers, abort controllers, event listeners, lane task closures, delivery promises) ## Run Object Retention Chain (suspected) ``` run registry → run object → transcript / conversation history → tool call results → stream chunks / buffers → provider request state (per-failover attempt) → abort controller → event listeners (agent listener context) → lane task closure → completion delivery promise chain ``` The failover mechanism multiplies this: a single user request that fails across 4 model attempts retains **4x the state**. ## Impact - Gateway RSS: 2.85GB after 2h14min of operation - Event loop saturation: 100% single-core CPU - RPC timeouts: `sessions.list` taking 5s+ (normally <1s) - Cascade: timeouts → retries → more retained state → worse saturation → dead letter queue ## Environment - **OpenClaw:** 2026.4.5 - **Node.js:** v25.6.1 - **OS:** macOS 25.3.0 (arm64) - **Config:** maxConcurrent=25 (now reduced to 8), embeddedRunTimeout=600000ms ## Reproduction Run the Gateway with multiple agents that trigger embedded LLM runs. Under provider/network instability, some runs will timeout. After several hours, observe unbounded RSS growth. ## Expected Behavior When an embedded run times out or is aborted: 1. All references to the run object should be removed from registries, lane maps, and queues 2. Event listeners should be deregistered 3. Stream buffers should be released 4. Promise chains should be resolved/rejected (not abandoned) 5. The V8 garbage collector should be able to reclaim all associated memory ## Suggested Investigation 1. Check `src/agents/pi-embedded-runner/` for timeout cleanup paths — verify `run registry` entries are deleted 2. Check `abort-cutoff.runtime` for proper listener deregistration on abort 3. Check `failover-policy.ts` — does each failover attempt create new objects that outlive the run? 4. Check `session-reset-service` — does resetting a session clean up embedded run state? 5. Add `process.memoryUsage()` logging on run timeout to track heap growth per timeout event ## Possible Fixes 1. **Run registry cleanup on timeout** — ensure `delete runRegistry[runId]` happens in the timeout handler, not just in the success path 2. **Event listener deregistration** — remove agent listener context references on abort 3. **Lane task closure release** — clear lane references when tasks timeout 4. **Failover state consolidation** — don't retain full state for each failed attempt; only keep error metadata 5. **Graceful degradation** — when memory exceeds a threshold, pause accepting new embedded runs until GC runs

openclaw2026-04-29 07:18:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#74182•Fetched 2026-04-30 06:27:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

cjboy007

Participants

cjboy007

clawsweeper[bot]

Timeline (top)

cross-referenced ×2commented ×1

Error Message

Failover state consolidation — don't retain full state for each failed attempt; only keep error metadata

Root Cause

The root cause appears to be incomplete cleanup of embedded LLM run objects after timeout. When an embedded run times out, the log shows timedOut:true, aborted:true, suggesting the abort signal was sent. However, the memory allocated for the run is not released.

Code Example

run registry
  → run object
    → transcript / conversation history
    → tool call results
    → stream chunks / buffers
    → provider request state (per-failover attempt)
    → abort controller
    → event listeners (agent listener context)
    → lane task closure
    → completion delivery promise chain

RAW_BUFFERClick to expand / collapse

After running the OpenClaw Gateway for ~2 hours with embedded LLM runs, the process RSS grew from a normal baseline to 2.85GB, causing event loop saturation (100% CPU, 5s+ RPC response times).

Evidence

24 embedded run failover decisions in a single day's logs
Multiple embedded run timeout entries with timeoutMs=600000 (10 minutes)
Unhandled promise rejection: Agent listener invoked outside active listener context — suggests abandoned promise chains retaining references
Subagent announce completion retries (4×) each holding 120s timeouts with closures/promises/timers alive
RSS growth pattern consistent with retained run state (transcripts, tool results, stream buffers, abort controllers, event listeners, lane task closures, delivery promises)

Run Object Retention Chain (suspected)

run registry
  → run object
    → transcript / conversation history
    → tool call results
    → stream chunks / buffers
    → provider request state (per-failover attempt)
    → abort controller
    → event listeners (agent listener context)
    → lane task closure
    → completion delivery promise chain

The failover mechanism multiplies this: a single user request that fails across 4 model attempts retains 4x the state.

Impact

Gateway RSS: 2.85GB after 2h14min of operation
Event loop saturation: 100% single-core CPU
RPC timeouts: sessions.list taking 5s+ (normally <1s)
Cascade: timeouts → retries → more retained state → worse saturation → dead letter queue

Environment

OpenClaw: 2026.4.5
Node.js: v25.6.1
OS: macOS 25.3.0 (arm64)
Config: maxConcurrent=25 (now reduced to 8), embeddedRunTimeout=600000ms

Reproduction

Run the Gateway with multiple agents that trigger embedded LLM runs. Under provider/network instability, some runs will timeout. After several hours, observe unbounded RSS growth.

Expected Behavior

When an embedded run times out or is aborted:

All references to the run object should be removed from registries, lane maps, and queues
Event listeners should be deregistered
Stream buffers should be released
Promise chains should be resolved/rejected (not abandoned)
The V8 garbage collector should be able to reclaim all associated memory

Suggested Investigation

Check src/agents/pi-embedded-runner/ for timeout cleanup paths — verify run registry entries are deleted
Check abort-cutoff.runtime for proper listener deregistration on abort
Check failover-policy.ts — does each failover attempt create new objects that outlive the run?
Check session-reset-service — does resetting a session clean up embedded run state?
Add process.memoryUsage() logging on run timeout to track heap growth per timeout event

Possible Fixes

Run registry cleanup on timeout — ensure delete runRegistry[runId] happens in the timeout handler, not just in the success path
Event listener deregistration — remove agent listener context references on abort
Lane task closure release — clear lane references when tasks timeout
Failover state consolidation — don't retain full state for each failed attempt; only keep error metadata
Graceful degradation — when memory exceeds a threshold, pause accepting new embedded runs until GC runs

extent analysis

TL;DR

Implementing a proper cleanup mechanism for embedded LLM run objects after timeout, including removing references from registries, deregistering event listeners, and releasing stream buffers, should mitigate the memory growth issue.

Guidance

Investigate the src/agents/pi-embedded-runner/ directory to ensure that run registry entries are deleted upon timeout, focusing on the timeout cleanup paths.
Verify that abort-cutoff.runtime properly deregisters listeners on abort to prevent memory leaks.
Review failover-policy.ts to determine if each failover attempt creates new objects that outlive the run, potentially causing retained state.
Consider adding process.memoryUsage() logging on run timeout to track heap growth per timeout event and understand the memory allocation pattern.

Example

No specific code snippet can be provided without modifying the existing codebase, but ensuring that the runRegistry is cleaned up in the timeout handler, like delete runRegistry[runId], is crucial.

Notes

The provided information suggests that the issue is related to the incomplete cleanup of embedded LLM run objects. However, without direct access to the code or more detailed logs, it's challenging to provide a definitive fix. The suggested investigations should help pinpoint the exact cause and appropriate solution.

Recommendation

Apply a workaround by implementing a proper cleanup mechanism for embedded LLM run objects after timeout, focusing on removing all references and deregistering event listeners to prevent memory leaks. This approach is recommended because it directly addresses the identified root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#conversation history #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Embedded LLM run objects retain memory after timeout/abort — causes multi-GB RSS growth in single-process Gateway [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Evidence

Run Object Retention Chain (suspected)

Impact

Environment

Reproduction

Expected Behavior

Suggested Investigation

Possible Fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Embedded LLM run objects retain memory after timeout/abort — causes multi-GB RSS growth in single-process Gateway [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Evidence

Run Object Retention Chain (suspected)

Impact

Environment

Reproduction

Expected Behavior

Suggested Investigation

Possible Fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING