openclaw - ๐Ÿ’ก(How to fix) Fix Gateway sessionStore maintenance synchronously blocks event loop for 30-60s, causes GC starvation and OOM [1 comments, 2 participants]

Official PRs (โ€ฆ)
ON THIS PAGE

Recommended Tools

ร—6

Utilities matched from this issueโ€™s tags and category โ€” try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful ยท Quick feedback

Loadingโ€ฆ
GitHub stats
openclaw/openclaw#72826โ€ขFetched 2026-04-28 06:31:52
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
closed ร—1commented ร—1cross-referenced ร—1

Error Message

OpenClaw gateway exhibits monotonic heap growth and crashes with V8 fatal error Reached heap limit Allocation failed - JavaScript heap out of memory, on the order of every ~90 minutes on a single-host install with ~50 cron jobs running.

Root Cause

Root cause (summarized)

Fix Action

Fix / Workaround

Workarounds applied locally

Code Example

Runtime_CreateDataProperty
    โ†’ DescriptorArray::CopyUpTo
      โ†’ Map::CopyAddDescriptors
        โ†’ JSObject::DefineOwnPropertyIgnoreAttributes
RAW_BUFFERClick to expand / collapse

Symptoms

OpenClaw gateway exhibits monotonic heap growth and crashes with V8 fatal error Reached heap limit Allocation failed - JavaScript heap out of memory, on the order of every ~90 minutes on a single-host install with ~50 cron jobs running.

  • Heap RSS grows linearly at ~11 MB/min until OOM
  • V8 stacks at the moment of OOM:
    Runtime_CreateDataProperty
      โ†’ DescriptorArray::CopyUpTo
        โ†’ Map::CopyAddDescriptors
          โ†’ JSObject::DefineOwnPropertyIgnoreAttributes
    i.e. the GC is starving while object descriptor maps grow without compaction.
  • Event-loop blocks of 30โ€“60 seconds are observed every ~2 minutes during heavy cron periods.

Trigger

session.maintenance.maxEntries is hit on <agent>/sessions/sessions.json. When the store is at the cap, every new session ingestion triggers a synchronous maintenance pass:

  1. JSON.parse of the (now ~1.3โ€“1.5 MB) sessions.json
  2. structuredClone of all entries
  3. Sort by updatedAt
  4. Trim to cap
  5. Synchronous write back to disk

This whole pipeline runs on the main event loop. With per-run-isolated cron sessions, a single agent (e.g. haiku-utility) can produce 1.5โ€“2k new session entries per day, hitting the cap continuously and firing this pass every couple of minutes.

Frequency observed

On one host:

  • 55 cap-hits in 112 minutes (one cap-hit โ‰ˆ one full sync maintenance pass)
  • Lock durations (per-pass event-loop block, measured from process.hrtime around the maintenance call): 36 s, 58 s, 60 s, โ€ฆ
  • 5,176 <sessionId>.jsonl files in ~/.openclaw/agents/haiku-utility/sessions/ accumulated over 3 days, while sessions.json was capped at 50 entries throughout.

When the loop stalls for that long, GC cycles stall too. Heap is unable to compact descriptor arrays for newly-instantiated session/handler objects, and the process eventually OOMs. Restarting the gateway reclaims memory; the cycle then repeats.

Root cause (summarized)

session.maintenance is a synchronous, in-loop, full-file rewrite hot path that fires on every cap-hit. With high cron throughput producing fresh sessions, that path is hit far more often than it was apparently designed for.

Suggested fixes (any one would unblock; combination is ideal)

  1. Chunk maintenance via setImmediate / async iteration. Yield the loop between parse/sort/trim/write phases so other work (channel ingress, scheduling, GC) can interleave.
  2. Move the maintenance pass to a worker thread. The store on disk is the canonical state; the worker can read, compute the new trimmed set, and atomically swap the file via rename(2).
  3. Async I/O with backpressure. Replace the sync read/write with fsPromises and a single inflight-promise per agent so concurrent ingressions queue rather than each kicking off another full pass.
  4. Coalesce cap-hits. When the store is over cap, schedule one maintenance pass on a debounced timer (e.g. 5 s) rather than running per-ingestion. Multiple ingestions inside the debounce window share the same pass.
  5. Bound the data structure. A streaming/append-only log + periodic compaction would avoid full-file rewrites altogether.

Workarounds applied locally

  • Persistent session keys for cron runs. sessionTarget: "session:cron-<slug>" instead of sessionTarget: "isolated" for 49 of 50 enabled non-one-shot agentTurn crons. This collapses ~1.7 k sessions/day per heavy agent down to ~1 session per cron, dramatically reducing cap-hit frequency.
    • Note: sessionTarget: "isolated" always forces a new sessionId per run regardless of sessionKey (forceNew: input.job.sessionTarget === "isolated" in server.impl-*.js). Folks who set sessionKey expecting reuse are silently getting per-run sessions.
  • Drop session.maintenance.maxEntries from 50 to 25. Smaller working set means each maintenance pass is faster and the file size stays smaller.

Reproduction

Easiest minimal repro: schedule a kind: agentTurn cron with sessionTarget: "isolated" running every minute on one agent. Within a day the agent's sessions.json will hit the cap and you can observe the sync maintenance pass via --prof or by attaching clinic flame.

Impact

Single-host installs with even moderate cron usage (โ‰ค 50 jobs, mix of intervals from minutes to days) appear to hit this consistently. The combination of "sync I/O on hot loop" + "cap-hit every couple of minutes" is enough to pin GC and cause OOM on a 4 GB host within ~90 minutes.

Happy to share heap snapshots / cron config / sessions.json sample if it helps.

extent analysis

TL;DR

Implement one of the suggested fixes, such as chunking maintenance via setImmediate or moving the maintenance pass to a worker thread, to prevent synchronous maintenance passes from blocking the event loop and causing heap growth.

Guidance

  • Identify the most suitable fix from the suggested options, considering the specific use case and performance requirements.
  • Implement the chosen fix, ensuring that it is properly tested and validated to prevent regressions.
  • Monitor the system's performance and heap growth after applying the fix to verify its effectiveness.
  • Consider combining multiple fixes for optimal results, as suggested in the issue.

Example

No code snippet is provided, as the issue does not contain sufficient information to create a specific example.

Notes

The issue highlights the importance of asynchronous I/O and event loop management in preventing heap growth and crashes. The suggested fixes aim to address the root cause of the problem, but may require additional testing and validation to ensure their effectiveness.

Recommendation

Apply workaround: Implement one of the suggested fixes, such as chunking maintenance via setImmediate or moving the maintenance pass to a worker thread, to prevent synchronous maintenance passes from blocking the event loop and causing heap growth. This is recommended because it directly addresses the identified root cause of the issue.

Vote matrix ยท Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loadingโ€ฆ

Still need to ship something?

ร—6

Another batch ranked right after the header list โ€” different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ๐Ÿ’ก(How to fix) Fix Gateway sessionStore maintenance synchronously blocks event loop for 30-60s, causes GC starvation and OOM [1 comments, 2 participants]