openclaw - 💡(How to fix) Fix Embedded LLM run objects retain memory after timeout/abort — causes multi-GB RSS growth in single-process Gateway [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74182Fetched 2026-04-30 06:27:42
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Timeline (top)
cross-referenced ×2commented ×1

Error Message

  1. Failover state consolidation — don't retain full state for each failed attempt; only keep error metadata

Root Cause

The root cause appears to be incomplete cleanup of embedded LLM run objects after timeout. When an embedded run times out, the log shows timedOut:true, aborted:true, suggesting the abort signal was sent. However, the memory allocated for the run is not released.

Code Example

run registry
  → run object
    → transcript / conversation history
    → tool call results
    → stream chunks / buffers
    → provider request state (per-failover attempt)
    → abort controller
    → event listeners (agent listener context)
    → lane task closure
    → completion delivery promise chain
RAW_BUFFERClick to expand / collapse

After running the OpenClaw Gateway for ~2 hours with embedded LLM runs, the process RSS grew from a normal baseline to 2.85GB, causing event loop saturation (100% CPU, 5s+ RPC response times).

The root cause appears to be incomplete cleanup of embedded LLM run objects after timeout. When an embedded run times out, the log shows timedOut:true, aborted:true, suggesting the abort signal was sent. However, the memory allocated for the run is not released.

Evidence

  1. 24 embedded run failover decisions in a single day's logs
  2. Multiple embedded run timeout entries with timeoutMs=600000 (10 minutes)
  3. Unhandled promise rejection: Agent listener invoked outside active listener context — suggests abandoned promise chains retaining references
  4. Subagent announce completion retries (4×) each holding 120s timeouts with closures/promises/timers alive
  5. RSS growth pattern consistent with retained run state (transcripts, tool results, stream buffers, abort controllers, event listeners, lane task closures, delivery promises)

Run Object Retention Chain (suspected)

run registry
  → run object
    → transcript / conversation history
    → tool call results
    → stream chunks / buffers
    → provider request state (per-failover attempt)
    → abort controller
    → event listeners (agent listener context)
    → lane task closure
    → completion delivery promise chain

The failover mechanism multiplies this: a single user request that fails across 4 model attempts retains 4x the state.

Impact

  • Gateway RSS: 2.85GB after 2h14min of operation
  • Event loop saturation: 100% single-core CPU
  • RPC timeouts: sessions.list taking 5s+ (normally <1s)
  • Cascade: timeouts → retries → more retained state → worse saturation → dead letter queue

Environment

  • OpenClaw: 2026.4.5
  • Node.js: v25.6.1
  • OS: macOS 25.3.0 (arm64)
  • Config: maxConcurrent=25 (now reduced to 8), embeddedRunTimeout=600000ms

Reproduction

Run the Gateway with multiple agents that trigger embedded LLM runs. Under provider/network instability, some runs will timeout. After several hours, observe unbounded RSS growth.

Expected Behavior

When an embedded run times out or is aborted:

  1. All references to the run object should be removed from registries, lane maps, and queues
  2. Event listeners should be deregistered
  3. Stream buffers should be released
  4. Promise chains should be resolved/rejected (not abandoned)
  5. The V8 garbage collector should be able to reclaim all associated memory

Suggested Investigation

  1. Check src/agents/pi-embedded-runner/ for timeout cleanup paths — verify run registry entries are deleted
  2. Check abort-cutoff.runtime for proper listener deregistration on abort
  3. Check failover-policy.ts — does each failover attempt create new objects that outlive the run?
  4. Check session-reset-service — does resetting a session clean up embedded run state?
  5. Add process.memoryUsage() logging on run timeout to track heap growth per timeout event

Possible Fixes

  1. Run registry cleanup on timeout — ensure delete runRegistry[runId] happens in the timeout handler, not just in the success path
  2. Event listener deregistration — remove agent listener context references on abort
  3. Lane task closure release — clear lane references when tasks timeout
  4. Failover state consolidation — don't retain full state for each failed attempt; only keep error metadata
  5. Graceful degradation — when memory exceeds a threshold, pause accepting new embedded runs until GC runs

extent analysis

TL;DR

Implementing a proper cleanup mechanism for embedded LLM run objects after timeout, including removing references from registries, deregistering event listeners, and releasing stream buffers, should mitigate the memory growth issue.

Guidance

  • Investigate the src/agents/pi-embedded-runner/ directory to ensure that run registry entries are deleted upon timeout, focusing on the timeout cleanup paths.
  • Verify that abort-cutoff.runtime properly deregisters listeners on abort to prevent memory leaks.
  • Review failover-policy.ts to determine if each failover attempt creates new objects that outlive the run, potentially causing retained state.
  • Consider adding process.memoryUsage() logging on run timeout to track heap growth per timeout event and understand the memory allocation pattern.

Example

No specific code snippet can be provided without modifying the existing codebase, but ensuring that the runRegistry is cleaned up in the timeout handler, like delete runRegistry[runId], is crucial.

Notes

The provided information suggests that the issue is related to the incomplete cleanup of embedded LLM run objects. However, without direct access to the code or more detailed logs, it's challenging to provide a definitive fix. The suggested investigations should help pinpoint the exact cause and appropriate solution.

Recommendation

Apply a workaround by implementing a proper cleanup mechanism for embedded LLM run objects after timeout, focusing on removing all references and deregistering event listeners to prevent memory leaks. This approach is recommended because it directly addresses the identified root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING