openclaw - 💡(How to fix) Fix Fix cron isolated runs stuck in "running" causing unbounded session accumulation and gateway OOM crash loop [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#61633Fetched 2026-04-08 02:56:35
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

On openclaw 2026.3.28 (f9b1079), cron isolated runs can get permanently stuck in `status=running` and are never marked terminal. Existing safeguards (session reaper, `capEntryCount`) silently skip these entries. Meanwhile the cron scheduler continues spawning new runs on every tick, growing `sessions.json` without bound until it reaches hundreds of megabytes. When the gateway (re)starts, it loads the entire file into RAM, hits the V8 heap limit, crashes, restarts, and repeats — an infinite OOM crash loop with no self-recovery path.

This differs from #56121 (status never updated) and #57058 (in-memory lock) in that the failure mode is a fully unbootable gateway, not just a stuck job.

Error Message

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Root Cause

The failure is a chain of three independent gaps:

Gap 1 — Sessions never reach a terminal state. As documented in #56121, the cron execution path does not update the session entry when the run completes or fails. If the model also refuses/errors before writing any output, the session stays running indefinitely. A likely trigger is model rejection/refusal: the run is spawned, the LLM declines the task, and no terminal state is ever written back to the session entry.

Gap 2 — Safeguards ignore "running" sessions. The 24 h session reaper TTL targets completed sessions; "running" sessions are excluded. capEntryCount (500-entry cap) appears to behave the same way. Both safeguards were introduced for the scenario in #12289, not for permanently stuck runs.

Gap 3 — Scheduler spawns new runs unconditionally. The cron scheduler does not check whether an "running" isolated session already exists for a job before creating a new one. Combined with gaps 1 and 2, every tick adds another permanent entry.

OOM crash loop. When the gateway restarts, it loads sessions.json entirely into RAM. Per #51097, this causes roughly 180× memory amplification. A 185 MB file therefore requires ~33 GB of heap, well above the V8 default limit of ~2.3 GB. The gateway crashes before it can service any request, restarts automatically, and crashes again.

Fix Action

Fix / Workaround

Workaround Applied

Code Example

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

---

import json, time

with open("sessions.json") as f:
    data = json.load(f)

cutoff = time.time() - 36 * 3600
kept = {k: v for k, v in data.items()
        if v.get("updatedAt", 0) / 1000 < cutoff}

with open("sessions.json", "w") as f:
    json.dump(kept, f)

print(f"Kept {len(kept)} / {len(data)} sessions")
RAW_BUFFERClick to expand / collapse

Description

On openclaw 2026.3.28 (f9b1079), cron isolated runs can get permanently stuck in `status=running` and are never marked terminal. Existing safeguards (session reaper, `capEntryCount`) silently skip these entries. Meanwhile the cron scheduler continues spawning new runs on every tick, growing `sessions.json` without bound until it reaches hundreds of megabytes. When the gateway (re)starts, it loads the entire file into RAM, hits the V8 heap limit, crashes, restarts, and repeats — an infinite OOM crash loop with no self-recovery path.

This differs from #56121 (status never updated) and #57058 (in-memory lock) in that the failure mode is a fully unbootable gateway, not just a stuck job.

Environment

  • openclaw 2026.3.28 (f9b1079)
  • Node.js v24.13.0 (fnm)
  • macOS Darwin 25.3.0 (Apple Silicon)
  • 5 cron jobs, all isolated mode
  • Channels: telegram, discord, slack (2 accounts), brave search plugin

Steps to Reproduce

  1. Configure one or more cron jobs in isolated mode.
  2. Trigger a condition that prevents the underlying model from completing a run (e.g., model refusal/rejection, network failure, timeout before any terminal state is written).
  3. Observe the cron scheduler spawning a new isolated run on each tick — no guard against an existing stuck run.
  4. Allow this to run for hours/days.
  5. Restart the gateway.

Current Behavior

sessions.json grew to 185 MB containing 6,539 sessions (6,524 of which are cron:run entries).

Distribution from our instance (5 cron jobs):

Cron JobRuns accumulatedStatus breakdownObserved interval
morning-sleep-report2,075 in 25 min2,016 "running", 59 unknown~20–25 s (rapid retry loop)
git-backup-night2,236 over ~2 days2,085 "running", 151 "failed"avg 3,540 s
SSL price check2,211 over ~2 daysall unknown statusavg 1,788 s

morning-sleep-report spawned 2,075 runs in 25 minutes — roughly one every 20–25 seconds — indicating a rapid retry loop when the run never reaches a terminal state.

Gateway crash output on every start attempt:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Setting NODE_OPTIONS="--max-old-space-size=4096" had no effect because the gateway spawns a child Node process that does not inherit the env var.

Expected Behavior

  • A cron job with an existing "running" isolated session should not spawn additional runs until the current one reaches a terminal state (or is force-expired).
  • The session reaper should enforce a maximum lifetime for "running" sessions regardless of status (e.g., force-expire after 1 h).
  • capEntryCount should apply to all sessions regardless of status.
  • The gateway should not load sessions.json entirely into RAM at startup (see #51097).

Root Cause Analysis

The failure is a chain of three independent gaps:

Gap 1 — Sessions never reach a terminal state. As documented in #56121, the cron execution path does not update the session entry when the run completes or fails. If the model also refuses/errors before writing any output, the session stays running indefinitely. A likely trigger is model rejection/refusal: the run is spawned, the LLM declines the task, and no terminal state is ever written back to the session entry.

Gap 2 — Safeguards ignore "running" sessions. The 24 h session reaper TTL targets completed sessions; "running" sessions are excluded. capEntryCount (500-entry cap) appears to behave the same way. Both safeguards were introduced for the scenario in #12289, not for permanently stuck runs.

Gap 3 — Scheduler spawns new runs unconditionally. The cron scheduler does not check whether an "running" isolated session already exists for a job before creating a new one. Combined with gaps 1 and 2, every tick adds another permanent entry.

OOM crash loop. When the gateway restarts, it loads sessions.json entirely into RAM. Per #51097, this causes roughly 180× memory amplification. A 185 MB file therefore requires ~33 GB of heap, well above the V8 default limit of ~2.3 GB. The gateway crashes before it can service any request, restarts automatically, and crashes again.

Workaround Applied

Manually filtered sessions.json with a Python script, removing all sessions with updatedAt within the last 36 hours (6,523 entries removed, 16 kept). File shrank from 185 MB to 0.4 MB. Gateway started normally.

import json, time

with open("sessions.json") as f:
    data = json.load(f)

cutoff = time.time() - 36 * 3600
kept = {k: v for k, v in data.items()
        if v.get("updatedAt", 0) / 1000 < cutoff}

with open("sessions.json", "w") as f:
    json.dump(kept, f)

print(f"Kept {len(kept)} / {len(data)} sessions")

Proposed Fixes

  1. Session reaper: cap maximum lifetime for "running" sessions. Force-expire any session with status=running and updatedAt older than a configurable threshold (e.g., 1 h) regardless of status.

  2. Cron scheduler: guard against existing running sessions. Before spawning a new isolated run, check whether a session with status=running already exists for that job. Skip or abort the stuck one before creating a new run.

  3. capEntryCount: apply to all statuses. The current eviction policy should not exempt "running" sessions from the cap.

  4. Lazy / streaming load of sessions.json at startup. Addressed separately in #51097, but critical to prevent a large-but-not-massive file from triggering OOM during startup.

Related Issues

  • #56121 — Cron sessions stuck in status=running after completion (root state-machine bug)
  • #57058 — Stuck cron jobs locked in memory, no cancel/reset API
  • #59056 — runningAtMs zombie state regression
  • #51097 — Gateway loads full sessions.json into RAM (open)
  • #17820 — Map memory leak
  • #58802 — fileEntries growth
  • #12289 / #15225 — Original session accumulation (safeguards introduced but insufficient for stuck runs)
  • #14434 / #13900 — Related session lifecycle issues

extent analysis

TL;DR

To prevent the gateway from entering an infinite OOM crash loop due to stuck cron jobs, apply a workaround by manually filtering sessions.json and consider implementing proposed fixes to address the root causes.

Guidance

  1. Manually filter sessions.json: Use a script like the provided Python example to remove stuck sessions and prevent the gateway from crashing due to excessive memory usage.
  2. Implement a session reaper for "running" sessions: Force-expire sessions with status=running and updatedAt older than a configurable threshold to prevent them from accumulating indefinitely.
  3. Modify the cron scheduler to check for existing "running" sessions: Before spawning a new isolated run, check if a session with status=running already exists for that job and skip or abort the stuck one.
  4. Apply capEntryCount to all session statuses: Ensure that the eviction policy does not exempt "running" sessions from the cap to prevent excessive session accumulation.

Example

The provided Python script can be used as a temporary workaround to filter sessions.json:

import json, time

with open("sessions.json") as f:
    data = json.load(f)

cutoff = time.time() - 36 * 3600
kept = {k: v for k, v in data.items()
        if v.get("updatedAt", 0) / 1000 < cutoff}

with open("sessions.json", "w") as f:
    json.dump(kept, f)

print(f"Kept {len(kept)} / {len(data)} sessions")

Notes

The proposed fixes aim to address the root causes of the issue, but implementing them may require careful consideration of the specific use case and requirements. Additionally, the lazy/streaming load of sessions.json at startup, addressed in #51097, is critical to preventing OOM crashes.

Recommendation

Apply the proposed fixes, starting with implementing a session reaper for "running" sessions and modifying the cron scheduler to check for existing "running" sessions, as these address the primary root causes of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Fix cron isolated runs stuck in "running" causing unbounded session accumulation and gateway OOM crash loop [1 participants]