openclaw - ✅(Solved) Fix claude-live-session: bundled MCP tempDir cleaned up while persistent CLI subprocess still uses it (race) — plus misleading billing-error template [1 pull requests, 1 participants]

openclaw2026-04-28 04:02:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73244•Fetched 2026-04-29 06:21:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Demented2984

Participants

Demented2984

Timeline (top)

closed ×1cross-referenced ×1

When the model-fallback runner exhausts all configured candidates with a mix of reason: "timeout" / reason: "auth" / etc., the user-visible delivery surface displays:

⚠️ API provider returned a billing error — your API key has run out of credits or has an insufficient balance. Check your provider's billing dashboard and top up or switch to a different API key.

This is wrong for OAuth subscriptions (subscriptionType: "max", no API key in play) and was wrong in the case I hit (Bug 1 — the failures were tempDir races, not quota or credit). It sent me chasing a billing problem that didn't exist.

Error Message

$ claude -p "say OK" --model sonnet --setting-sources user
--mcp-config /tmp/openclaw-cli-mcp-XXXXXX/mcp.json
--plugin-dir /tmp/openclaw/openclaw-claude-skills-YYYYYY/openclaw-skills
--strict-mcp-config Error: Invalid MCP configuration: MCP config file not found: /tmp/openclaw-cli-mcp-XXXXXX/mcp.json

Root Cause

prepareModeSpecificBundleMcpConfig (bundled into prepare.runtime-*.js) returns a cleanup that does fs.rm(tempDir, { recursive: true, force: true }). That cleanup is hung off context.preparedBackend.cleanup and invoked in runCliAgent (cli-runner-*.js):

} finally {
    await context.preparedBackend.cleanup?.();
}

For one-shot supervisor.spawn / managedRun.wait() runs this is correct (subprocess has exited). For the shouldUseClaudeLiveSession(context) branch in executePreparedCliRun (execute.runtime-*.js), the subprocess is intentionally persistent — so the cleanup must be deferred to the live-session lifecycle, not the per-turn one.

The same file already has the right pattern for the skills plugin: claudeSkillsPluginCleanupOwned = true transfers ownership to the live-session cleanup chain. The MCP config tempDir needs the same treatment.

Fix Action

Fix / Workaround

Patch verified locally against the bundled dist/execute.runtime-*.js (file still parses, node --check passes). Happy to send a PR if useful.

PR fix notes

PR #73351: fix(cli-runner): transfer bundle-MCP cleanup to live session lifecycle (#73244)

Repository: openclaw/openclaw
Author: edwin-rivera-dev
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/73351

Description (problem / solution / changelog)

What

In Claude live-session mode (claude-live-session), the spawned claude -p --input-format stream-json subprocess is intentionally persistent: it is stored in the liveSessions Map and reused across many turns. Its argv permanently contains --mcp-config <tempDir>/mcp.json.

The bundle-MCP tempDir cleanup was attached to context.preparedBackend.cleanup and fired from src/agents/cli-runner.ts:375 (the outer per-turn finally), so the tempDir was deleted while the live subprocess was still alive. Any later cause for a respawn — no-output-timeout, overall-timeout, closeLiveSession(\"restart\") from a fingerprint mismatch, or an MCP loopback child re-resolution — then read the now-missing path and surfaced as a 180s no-output-timeout cycling through every model in the fallback chain, with the gateway eventually mapping to a misleading billing-error template.

The bug is reported in detail in #73244 (Bug 1), with strace evidence and a verified minimal patch.

Closes #73244 (Bug 1). Bug 2 in the same issue (the misleading billing-error template) is a separate fix that I am not addressing here.

Why this fix

The same file already uses an ownership-transfer pattern for the skills plugin: claudeSkillsPluginCleanupOwned = true hands the cleanup off to the live-session lifecycle via the cleanup parameter to runClaudeLiveSessionTurn. The bundle-MCP tempDir needs the same treatment.

This PR:

Saves context.preparedBackend.cleanup into a local ownedPreparedBackendCleanup before invoking runClaudeLiveSessionTurn.
Sets context.preparedBackend.cleanup = undefined so the outer runCliAgent finally becomes a no-op for this turn.
Wraps the live-session cleanup callback to call both claudeSkillsPlugin.cleanup (existing behavior) and the saved ownedPreparedBackendCleanup.

runClaudeLiveSessionTurn already guards with cleanupDone, so the wrapped cleanup is idempotent. For session-reuse turns (cleanupTurnArtifacts && session), the wrapped cleanup runs immediately — the freshly-prepared per-turn tempDir was never used by the persistent subprocess (which holds the first turn's tempDir), so deleting it is correct. For fresh-session turns, the cleanup is stored on the live session and only fires when the session actually closes (timeout, fingerprint mismatch restart, or supervisor cancel).

The one-shot non-live branch (else after shouldUseClaudeLiveSession) is untouched, so non-live runs continue to clean up immediately in the outer runCliAgent finally as before.

Tests

Added transfers preparedBackend.cleanup ownership to the Claude live session lifecycle (#73244) in src/agents/cli-runner.spawn.test.ts alongside the existing live-session tests. The test:

Builds a PreparedCliRunContext with liveSession: \"claude-stdio\" and a preparedBackend.cleanup mock.
Runs executePreparedCliRun once (creating the live session).
Asserts that context.preparedBackend.cleanup is undefined afterwards (ownership transferred).
Asserts that the cleanup mock has not been called yet (deferred to the live-session lifecycle, not deleted out from under the persistent subprocess).

Existing live-session tests in the same file (reuses a Claude live session process across turns, etc.) continue to pass — they do not set preparedBackend.cleanup, so the new branch is a no-op for them.

Notes

Single-area diff: src/agents/cli-runner/execute.ts, its existing test file, and a one-line CHANGELOG entry under ## Unreleased ### Fixes with Thanks @edwin-rivera-dev.
No Plugin SDK or other-extension files touched.
The patch follows the exact shape suggested by the issue reporter, who verified locally that the bundled dist/execute.runtime-*.js parses with node --check and that the failure mode reproduces consistently.
AI-assisted (Claude). Reviewed locally; please flag any oversight, especially around the live-session cleanup chain (closeLiveSession, cleanupDone idempotence, supervisor cancel paths).

Validation

I have not run the full pnpm test/pnpm check:changed lanes locally for environment reasons, so I am relying on CI for the canonical proof. I have:

Mirrored the ownership-transfer pattern from the existing skills-plugin cleanup, so the change stays consistent with surrounding code.
Confirmed the new test reuses the same supervisor/stdin mocking pattern the existing live-session tests already validate.
Confirmed the changelog entry is single-line, anchored to a contributor, and references the issue.

Happy to address any Greptile/Codex review feedback or extend coverage to the closeLiveSession/restart path if reviewers want it.

Changed files

CHANGELOG.md (modified, +1/-0)
src/agents/cli-runner.spawn.test.ts (modified, +54/-0)
src/agents/cli-runner/execute.ts (modified, +9/-1)
src/video-generation/provider-registry.test.ts (modified, +4/-4)
test/scripts/npm-telegram-live.test.ts (modified, +0/-4)
test/vitest-scoped-config.test.ts (modified, +1/-6)

Code Example

$ ls -d /tmp/openclaw-cli-mcp-XXXXXX/  # the one in the live process's argv
   ls: cannot access '/tmp/openclaw-cli-mcp-XXXXXX/': No such file or directory

---

$ claude -p "say OK" --model sonnet --setting-sources user \
    --mcp-config /tmp/openclaw-cli-mcp-XXXXXX/mcp.json \
    --plugin-dir /tmp/openclaw/openclaw-claude-skills-YYYYYY/openclaw-skills \
    --strict-mcp-config
Error: Invalid MCP configuration:
MCP config file not found: /tmp/openclaw-cli-mcp-XXXXXX/mcp.json

---

} finally {
    await context.preparedBackend.cleanup?.();
}

---

--- a/src/agents/cli-runner/execute.ts
+++ b/src/agents/cli-runner/execute.ts
@@ -1205,6 +1205,11 @@ async function executePreparedCliRun(context, cliSessionIdToUse) {
                if (shouldUseClaudeLiveSession(context)) {
                    if (!hasJsonlOutput) throw new Error("Claude live session requires JSONL streaming parser");
                    claudeSkillsPluginCleanupOwned = true;
+                   // Live session keeps the spawned CLI subprocess alive across turns; that
+                   // subprocess still holds --mcp-config <tempDir>/mcp.json in its argv. Defer
+                   // the bundle-MCP tempDir cleanup to the live-session lifecycle so the outer
+                   // runCliAgent finally doesn't rm it out from under a still-running process.
+                   const ownedPreparedBackendCleanup = context.preparedBackend.cleanup;
+                   context.preparedBackend.cleanup = void 0;
                    const liveResult = await runClaudeLiveSessionTurn({
                        context,
                        args,
@@ -1224,7 +1229,12 @@ async function executePreparedCliRun(context, cliSessionIdToUse) {
                                }
                            });
                        },
-                       cleanup: claudeSkillsPlugin.cleanup
+                       cleanup: async () => {
+                           try {
+                               await claudeSkillsPlugin.cleanup();
+                           } finally {
+                               await ownedPreparedBackendCleanup?.();
+                           }
+                       }
                    });

---

"reason": "timeout", "status": 408, "errorPreview": "CLI produced no output for 180s and was terminated."
"reason": "auth",    "status": 401, ...

RAW_BUFFERClick to expand / collapse

Bug 1 — Bundled MCP tempDir cleaned up while persistent CLI subprocess still references it

Summary

In claude-live-session mode, the spawned claude -p --input-format stream-json subprocess persists across many turns (stored in the liveSessions Map and reused). Its argv permanently contains --mcp-config /tmp/openclaw-cli-mcp-XXXXXX/mcp.json. However, the bundled-MCP tempDir cleanup runs in the outer runCliAgent finally block — i.e. after every turn returns — which deletes the tempDir while the subprocess is still alive.

The subprocess only loaded the config at startup so the dangling reference is harmless until anything causes a re-read or respawn:

supervisor respawn after no-output-timeout / overall-timeout
live-session fingerprint mismatch on a later turn → closeLiveSession("restart") → fresh spawn re-using stale args
MCP loopback child re-resolution

The user-visible failure manifests as a 180s no-output-timeout (CLI produced no output for 180s and was terminated), cycled through every model in the fallback chain, until the gateway's catch-all maps to a misleading "billing error" template (see Bug 2 below).

Affected version

[email protected]. The relevant code is in the bundled dist/ output; the original source is in src/agents/cli-runner/ per the .d.ts files (bundle-mcp.d.ts, execute.d.ts).

Repro

Run any agent that uses claude-cli backend with a bundled MCP config (default path).
Observe the spawned claude subprocess: ps -ef | grep claude.*--mcp-config /tmp/openclaw-cli-mcp-

After the first turn returns, observe that the tempDir referenced by --mcp-config no longer exists on disk:

$ ls -d /tmp/openclaw-cli-mcp-XXXXXX/  # the one in the live process's argv
ls: cannot access '/tmp/openclaw-cli-mcp-XXXXXX/': No such file or directory

Confirm the running subprocess still points at the now-missing path.
To force the failure surface: cause a respawn (e.g. trigger no-output-timeout). The new spawn uses the stale path → Error: MCP config file not found: /tmp/openclaw-cli-mcp-XXXXXX/mcp.json → FailoverError(reason: "timeout") → fallback model.

Reproduced manually:

$ claude -p "say OK" --model sonnet --setting-sources user \
    --mcp-config /tmp/openclaw-cli-mcp-XXXXXX/mcp.json \
    --plugin-dir /tmp/openclaw/openclaw-claude-skills-YYYYYY/openclaw-skills \
    --strict-mcp-config
Error: Invalid MCP configuration:
MCP config file not found: /tmp/openclaw-cli-mcp-XXXXXX/mcp.json

Root cause

} finally {
    await context.preparedBackend.cleanup?.();
}

Proposed fix (minimal, single-hunk)

--- a/src/agents/cli-runner/execute.ts
+++ b/src/agents/cli-runner/execute.ts
@@ -1205,6 +1205,11 @@ async function executePreparedCliRun(context, cliSessionIdToUse) {
                if (shouldUseClaudeLiveSession(context)) {
                    if (!hasJsonlOutput) throw new Error("Claude live session requires JSONL streaming parser");
                    claudeSkillsPluginCleanupOwned = true;
+                   // Live session keeps the spawned CLI subprocess alive across turns; that
+                   // subprocess still holds --mcp-config <tempDir>/mcp.json in its argv. Defer
+                   // the bundle-MCP tempDir cleanup to the live-session lifecycle so the outer
+                   // runCliAgent finally doesn't rm it out from under a still-running process.
+                   const ownedPreparedBackendCleanup = context.preparedBackend.cleanup;
+                   context.preparedBackend.cleanup = void 0;
                    const liveResult = await runClaudeLiveSessionTurn({
                        context,
                        args,
@@ -1224,7 +1229,12 @@ async function executePreparedCliRun(context, cliSessionIdToUse) {
                                }
                            });
                        },
-                       cleanup: claudeSkillsPlugin.cleanup
+                       cleanup: async () => {
+                           try {
+                               await claudeSkillsPlugin.cleanup();
+                           } finally {
+                               await ownedPreparedBackendCleanup?.();
+                           }
+                       }
                    });

Why it's safe:

runClaudeLiveSessionTurn's existing cleanupDone guard makes the wrapped cleanup idempotent.
For session-reuse turns (cleanupTurnArtifacts && session), the wrapped cleanup runs immediately — the freshly-prepared tempDir for that turn was never used by the persistent subprocess, so deleting it is correct.
For first-turn / new-session creation, the cleanup is stored on the live session and only fires when the session closes.
Outer cli-runner finally still calls context.preparedBackend.cleanup?.() but it's now void 0 — optional chaining makes it a no-op.
One-shot mode (the else branch following the shouldUseClaudeLiveSession block) is untouched.

Patch verified locally against the bundled dist/execute.runtime-*.js (file still parses, node --check passes). Happy to send a PR if useful.

Bug 2 — Misleading "billing error" template when fallback chain exhausts

Summary

When the model-fallback runner exhausts all configured candidates with a mix of reason: "timeout" / reason: "auth" / etc., the user-visible delivery surface displays:

⚠️ API provider returned a billing error — your API key has run out of credits or has an insufficient balance. Check your provider's billing dashboard and top up or switch to a different API key.

Reproducible signal

The gateway log (/tmp/openclaw/openclaw-YYYY-MM-DD.log) holds the truth — model_fallback_decision events show the real reason/status/errorPreview:

"reason": "timeout", "status": 408, "errorPreview": "CLI produced no output for 180s and was terminated."
"reason": "auth",    "status": 401, ...

But by the time the message reaches the chat surface it's collapsed to the generic billing template.

Suggestion

Either:

Aggregate the per-candidate reasons and pick a more accurate human-facing message (e.g. "All N models timed out / auth-failed — see gateway log"), or
At minimum, suppress the billing-specific wording when the active auth profile is OAuth / subscriptionType: max (no credit-card concept applies).

The current behavior is debugging-hostile — the surfaced error actively contradicts what the underlying logs say.

Environment

[email protected] (bundled CLI on Linux/WSL2)
Auth: Claude Max OAuth (subscriptionType: max, rateLimitTier: default_claude_max_20x)
Backend: claude-cli, primary model claude-opus-4-7 with full 4.6/4.5 fallback chain
Surfaced via control-ui session attached to agent:main:main

Happy to provide trajectories or run targeted repros.

extent analysis

TL;DR

To fix the bug, defer the cleanup of the MCP config tempDir to the live-session lifecycle instead of the per-turn lifecycle.

Guidance

Identify the prepareModeSpecificBundleMcpConfig function and its associated cleanup logic.
Modify the executePreparedCliRun function to defer the cleanup of the MCP config tempDir when using Claude live sessions.
Verify that the tempDir is not deleted prematurely by checking the liveSessions Map and the subprocess's argv.
Test the fix by reproducing the bug and confirming that the tempDir is now properly cleaned up when the live session closes.

Example

The proposed fix provides a code snippet that demonstrates how to defer the cleanup of the MCP config tempDir:

const ownedPreparedBackendCleanup = context.preparedBackend.cleanup;
context.preparedBackend.cleanup = void 0;
// ...
cleanup: async () => {
  try {
    await claudeSkillsPlugin.cleanup();
  } finally {
    await ownedPreparedBackendCleanup?.();
  }
}

This code transfers the ownership of the cleanup to the live-session lifecycle, ensuring that the tempDir is not deleted prematurely.

Notes

The fix assumes that the runClaudeLiveSessionTurn function's existing cleanupDone guard makes the wrapped cleanup idempotent. Additionally, the fix only applies to Claude live sessions and does not affect one-shot mode.

Recommendation

Apply the proposed workaround by deferring the cleanup of the MCP config tempDir to the live-session lifecycle. This fix is safe and effective, as verified locally against the bundled dist/execute.runtime-*.js file.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #API rate limit #retriever error #indexing error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix claude-live-session: bundled MCP tempDir cleaned up while persistent CLI subprocess still uses it (race) — plus misleading billing-error template [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #73351: fix(cli-runner): transfer bundle-MCP cleanup to live session lifecycle (#73244)

Description (problem / solution / changelog)

What

Why this fix

Tests

Notes

Validation

Changed files

Code Example

Bug 1 — Bundled MCP tempDir cleaned up while persistent CLI subprocess still references it

Summary

Affected version

Repro

Root cause

Proposed fix (minimal, single-hunk)

Bug 2 — Misleading "billing error" template when fallback chain exhausts

Summary

Reproducible signal

Suggestion

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING