openclaw - ✅(Solved) Fix Background PTY exec runs can survive restart/session loss and become untracked orphan process trees [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65983Fetched 2026-04-14 05:39:27
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Background exec runs launched with pty: true appear to rely on in-memory-only ownership. If the gateway restarts or otherwise loses run state, the PTY-backed worker tree can remain alive while OpenClaw no longer knows about it. In practice this looks like orphaned Codex/OMX/MCP helper trees continuing to consume memory even when process list reports no running sessions.

Error Message

On a live macOS install, I observed repeated helper-process families rooted under a long-lived Codex worker, including:

Root Cause

This can produce a bad failure mode for long-lived Codex/OMX workers launched through generic background PTY exec:

  1. worker is launched and tracked only in memory
  2. gateway restarts or loses run state
  3. worker tree survives
  4. OpenClaw forgets it
  5. Codex/OMX inside that still-living worker can reopen/reconnect MCP helpers over time
  6. helper processes accumulate and total memory climbs

Even if Codex/OMX has its own cleanup bug, OpenClaw currently appears to make that failure mode much worse by allowing PTY-backed background workers to become unowned.

Fix Action

Fixed

PR fix notes

PR #66001: fix(process): reconcile orphaned bash exec runs

Description (problem / solution / changelog)

Summary

  • add a real reconcileOrphans(...) contract to the process supervisor
  • reconcile active managed runs against bash session ownership
  • invoke reconciliation before serving process-tool requests and after registering new exec sessions
  • add focused tests for orphan cancellation and owned-run preservation

Problem

OpenClaw already tracks bash exec/process sessions separately from the supervisor's managed runs. When that session ownership drifts or disappears inside the live process, the supervisor could keep a run alive while the process tool no longer considers the session tracked. Before this patch, reconcileOrphans() was a deliberate no-op, so those runs were never cleaned up.

This does not implement durable persistence or true cold-start orphan reaping. It is the smaller, defensible fix for the in-process/session-loss half of the problem.

Closes #65983

Tests

  • pnpm test src/process/supervisor/supervisor.test.ts src/agents/bash-tools.process.supervisor.test.ts

Changed files

  • src/agents/bash-process-registry.ts (modified, +4/-0)
  • src/agents/bash-tools.exec-runtime.ts (modified, +4/-0)
  • src/agents/bash-tools.process.supervisor.test.ts (modified, +24/-0)
  • src/agents/bash-tools.process.ts (modified, +4/-0)
  • src/process/supervisor/index.ts (modified, +2/-0)
  • src/process/supervisor/supervisor.test.ts (modified, +50/-0)
  • src/process/supervisor/supervisor.ts (modified, +21/-4)
  • src/process/supervisor/types.ts (modified, +10/-1)
RAW_BUFFERClick to expand / collapse

Summary

Background exec runs launched with pty: true appear to rely on in-memory-only ownership. If the gateway restarts or otherwise loses run state, the PTY-backed worker tree can remain alive while OpenClaw no longer knows about it. In practice this looks like orphaned Codex/OMX/MCP helper trees continuing to consume memory even when process list reports no running sessions.

Observed behavior

On a live macOS install, I observed repeated helper-process families rooted under a long-lived Codex worker, including:

  • repeated npm exec mcp-remote https://mcp.firecrawl.dev/...
  • repeated npm exec mcp-remote https://mcp.exa.ai/mcp
  • repeated npm exec mcp-remote https://api.ref.tools/mcp
  • repeated OMX sidecars such as:
    • oh-my-codex/dist/mcp/code-intel-server.js
    • trace-server.js
    • state-server.js
    • memory-server.js
    • team-server.js

At one point the repeated remote MCP fingerprints showed 12 copies each for Firecrawl / Exa / Ref, all under a Codex-rooted worker tree, while OpenClaw runtime state did not show a corresponding tracked run:

  • ~/.openclaw/subagents/runs.json was empty ({"version":2,"runs":{}})
  • process(action="list") returned No running or recent sessions.

That combination strongly suggests a live worker tree can outlive tracked ownership.

Why this looks like an OpenClaw lifecycle bug

The supervisor implementation currently documents orphan reconciliation as a deliberate no-op:

  • src/process/supervisor/supervisor.ts
    • reconcileOrphans(): Promise<void>
    • comment: Deliberate no-op: this supervisor uses in-memory ownership only. Active runs are not recovered after process restart in the current model.

Relevant launch-path differences:

Generic exec PTY path

  • src/agents/bash-tools.exec-runtime.ts
    • runExecProcess(...)
    • when usePty is true, this spawns with mode: "pty"
  • src/agents/bash-tools.exec.ts
    • background runs can end up with effectiveTimeout = null when backgroundTimeoutBypass applies
  • I do not see the same explicit scope-replacement semantics here that the CLI-runner path uses

CLI-runner path

  • src/agents/cli-runner/execute.ts
    • uses mode: "child"
    • passes scopeKey
    • passes replaceExistingScope: Boolean(useResume && scopeKey)
    • passes noOutputTimeoutMs

Process visibility path

  • src/agents/bash-tools.process.ts
    • process list is driven from in-memory session/registry state
  • src/agents/bash-process-registry.ts
    • running/finished sessions are also in-memory maps

So if the gateway restarts, reloads, or otherwise loses that state, OpenClaw can truthfully report no running sessions while the actual PTY child tree is still alive.

Why this matters

This can produce a bad failure mode for long-lived Codex/OMX workers launched through generic background PTY exec:

  1. worker is launched and tracked only in memory
  2. gateway restarts or loses run state
  3. worker tree survives
  4. OpenClaw forgets it
  5. Codex/OMX inside that still-living worker can reopen/reconnect MCP helpers over time
  6. helper processes accumulate and total memory climbs

Even if Codex/OMX has its own cleanup bug, OpenClaw currently appears to make that failure mode much worse by allowing PTY-backed background workers to become unowned.

Proposed fix direction

Minimum viable fix

  1. Persist active background exec metadata for PTY-backed runs
    • run id
    • pid
    • scope key
    • session id
    • backend id
    • started at
  2. Implement real reconcileOrphans() on startup
  3. On startup, kill or re-register stale PTY-backed runs from a previous gateway epoch
  4. Remove persisted records on normal finalize

Additional hardening

  1. Add explicit scope replacement for long-lived PTY background runs where a stable scope exists
  2. Add a default noOutputTimeoutMs / watchdog for background PTY runs instead of letting them drift forever when no explicit timeout is set
  3. Add lifecycle logging around:
    • PTY spawn
    • scope replacement / cancellation
    • finalize
    • orphan sweep results

Files / code points inspected

  • src/process/supervisor/supervisor.ts
  • src/process/supervisor/adapters/pty.ts
  • src/process/supervisor/adapters/child.ts
  • src/process/supervisor/registry.ts
  • src/agents/bash-tools.exec-runtime.ts
  • src/agents/bash-tools.exec.ts
  • src/agents/bash-tools.process.ts
  • src/agents/cli-runner/execute.ts
  • src/agents/bash-process-registry.ts

Important note

I am not claiming the entire leak is only OpenClaw. It is possible Codex/OMX is also failing to clean up helper bundles inside a still-running worker. But OpenClaw currently appears to have a real ownership / cleanup gap for background PTY exec runs that can leave those workers alive but untracked after restart or session-state loss.

If helpful, I can open a follow-up PR once there is agreement on whether the preferred behavior is:

  • kill stale PTY exec orphans on startup, or
  • attempt to reattach / re-register them.

extent analysis

TL;DR

To fix the issue of orphaned Codex/OMX/MCP helper trees consuming memory after a gateway restart, persist active background exec metadata for PTY-backed runs and implement a real reconcileOrphans() function to kill or re-register stale runs on startup.

Guidance

  • Persist active background exec metadata for PTY-backed runs, including run id, pid, scope key, session id, backend id, and started at timestamp.
  • Implement a real reconcileOrphans() function in src/process/supervisor/supervisor.ts to kill or re-register stale PTY-backed runs from a previous gateway epoch on startup.
  • Add explicit scope replacement for long-lived PTY background runs where a stable scope exists to prevent accumulation of helper processes.
  • Consider adding a default noOutputTimeoutMs / watchdog for background PTY runs to prevent them from running indefinitely without output.

Example

// src/process/supervisor/supervisor.ts
reconcileOrphans(): Promise<void> {
  // Load persisted metadata for PTY-backed runs
  const persistedRuns = loadPersistedRuns();
  
  // Kill or re-register stale runs
  persistedRuns.forEach((run) => {
    if (isStaleRun(run)) {
      killRun(run.pid);
    } else {
      reRegisterRun(run);
    }
  });
}

Notes

The proposed fix direction is to persist active background exec metadata and implement a real reconcileOrphans() function. However, it is also possible that Codex/OMX has its own cleanup bug, and OpenClaw's fix should be coordinated with any necessary changes to Codex/OMX.

Recommendation

Apply the proposed minimum viable fix by persisting active background exec metadata and implementing a real reconcileOrphans() function to kill or re-register stale PTY-backed runs on startup. This will help prevent the accumulation of helper processes and reduce memory consumption.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING