openclaw - ✅(Solved) Fix [Bug] exec-approval-followup tasks stuck in 'running' forever, blocking channel reload and saturating event loop [2 pull requests, 1 comments, 2 participants]

DuanXiaoWen · 2026-05-02T16:53:30Z

[openclaw] When exec approval follow-up tasks fail e.g. network error during LLM call , the task runs remain in running status indefinitely. They are not recon… When exec approval follow-up tasks fail (e.g. network error during LLM call), the task runs remain in `running` status indefinitely. They are not reconciled by the task maintenance sweeper because `hasBackingSession` always returns `true` for `cli` runtime tasks whose `childSessionKey` points to a persistent session (e.g. `agent:main:main`). This is the same bug pattern as #75307 (cron tasks stuck in running), but the fix for that issue only covered the cron code path. The cli runtime path in `hasBackingSession` still has the defect. # PR #76199: fix(tasks): mark exec-approval-followup cli tasks lost when run ends (#76162) - Repository: openclaw/openclaw - Author: hclsys - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/76199 ## Description (problem / solution / changelog) ## Problem `exec-approval-followup` tasks (and any cli task that uses `childSessionKey="agent:main:main"`) get stuck in `running` forever after the embedded agent run ends. Root cause: `hasBackingSession()` in `task-registry.maintenance.ts` had: ```ts if (task.runtime === "cli" && hasActiveCliRun(task)) { return true; } // falls through to session-existence check when hasActiveCliRun is false ``` When the run ends, `hasActiveCliRun` returns false, so it falls through to the session-existence check. `exec-approval-followup` tasks use `childSessionKey="agent:main:main"` — the persistent main session, which always exists. So `hasBackingSession` returns `true` indefinitely and the task is never marked lost. ## Fix Return `hasActiveCliRun(task)` immediately for all cli tasks — same pattern already used for cron tasks: ```ts if (task.runtime === "cli") { // CLI task liveness is determined solely by whether the embedded agent run // is still active. Falling through to session-existence checks is wrong: // exec-approval-followup tasks use childSessionKey="agent:main:main" which // is a persistent session — it always exists, so the session-existence path // would never mark the task lost (#76162). Same pattern as cron above. return hasActiveCliRun(task); } ``` Also removes the now-dead `resolveSessionChatType()` helper and `sessionChatTypesByKey` from the lookup context. ## Tests - Updated existing test to remove dead caching assertion (the removed `resolveSessionChatType` was the only caller) - Added regression test: `exec-approval-followup` cli task with `childSessionKey="agent:main:main"` and a live session store entry gets marked lost when run ends - 302/302 tests pass in `src/tasks/` Fixes #76162 ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `docs/automation/tasks.md` (modified, +5/-6) - `src/commands/tasks.test.ts` (modified, +1/-1) - `src/tasks/task-registry.maintenance.issue-60299.test.ts` (modified, +25/-4) - `src/tasks/task-registry.maintenance.ts` (modified, +10/-34) --- # PR #76216: fix(status): guard resolveSessionModelRef against non-string model fields (#76206) - Repository: openclaw/openclaw - Author: hclsys - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/76216 ## Description (problem / solution / changelog) ## Problem `openclaw status` crashes with `TypeError: runtimeModel?.trim is not a function` when any session entry in `~/.openclaw/agents/ /sessions/sessions.json` has a non-string value for `model`, `modelProvider`, `providerOverride`, or `modelOverride`. Root cause: `readSessionStoreReadOnly` parses session JSON with `z.record(z.string(), z.unknown())` — no field normalization. The four model fields reach `resolvePersistedSelectedModelRef` typed as `string | undefined` but holding arbitrary JSON values. The internal `.trim()` calls crash on objects or numbers. The `loadSessionStore` path does normalize via `normalizeSessionRuntimeModelFields`, but `openclaw status` uses `readSessionStoreReadOnly` for its read-only scan. ## Fix Wrap the four fields with `normalizeOptionalString()` at the `resolveSessionModelRef` call site in `status.summary.runtime.ts`. `normalizeOptionalString` accepts `unknown` and returns `string | undefined`, so non-string values are discarded and the resolver falls back to the configured default. ## Tests Two regression tests added to `status.summary.runtime.test.ts`: - object `model` field → does not throw, falls back to configured default - object `modelOverride` + number `providerOverride` → does not throw Fixes #76206 ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/commands/status.summary.runtime.test.ts` (modified, +17/-0) - `src/commands/status.summary.runtime.ts` (modified, +4/-4) ## Fixed - Fixed by PR: fix(tasks): mark exec-approval-followup cli tasks lost when run ends (#76162) (https://github.com/openclaw/openclaw/pull/76199) - Fixed by PR: fix(status): guard resolveSessionModelRef against non-string model fields (#76206) (https://github.com/openc

openclaw2026-05-02 16:53:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#76162•Fetched 2026-05-03 04:41:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

DuanXiaoWen

Participants

clawsweeper[bot]

DuanXiaoWen

Timeline (top)

referenced ×5cross-referenced ×4commented ×1unsubscribed ×1

When exec approval follow-up tasks fail (e.g. network error during LLM call), the task runs remain in running status indefinitely. They are not reconciled by the task maintenance sweeper because hasBackingSession always returns true for cli runtime tasks whose childSessionKey points to a persistent session (e.g. agent:main:main).

This is the same bug pattern as #75307 (cron tasks stuck in running), but the fix for that issue only covered the cron code path. The cli runtime path in hasBackingSession still has the defect.

Error Message

Root Cause

In src/tasks/task-registry.maintenance.ts, the hasBackingSession function:

function hasBackingSession(task) {
  if (task.runtime === cron) { ... } // Fixed in #75307
  if (task.runtime === cli && hasActiveCliRun(task)) return true;
  const childSessionKey = task.childSessionKey?.trim();
  if (!childSessionKey) return true;
  // ...
  if (task.runtime === subagent || task.runtime === cli) {
    // For cli: checks findTaskSessionEntry which returns true
    // if the session key exists in the store.
    // agent:main:main is a persistent session — it always exists.
    return Boolean(entry);
  }
}

For exec-approval-followup tasks:

runtime = cli
childSessionKey = agent:main:main (the main session)
hasActiveCliRun returns false (the embedded run has ended)
findTaskSessionEntry returns true (main session always exists in store)
hasBackingSession → true
shouldMarkLost → false
Task is never reconciled

Fix Action

Fixed

Fixed by PR: fix(tasks): mark exec-approval-followup cli tasks lost when run ends (#76162) (https://github.com/openclaw/openclaw/pull/76199)
Fixed by PR: fix(status): guard resolveSessionModelRef against non-string model fields (#76206) (https://github.com/openclaw/openclaw/pull/76216)

PR fix notes

PR #76199: fix(tasks): mark exec-approval-followup cli tasks lost when run ends (#76162)

Repository: openclaw/openclaw
Author: hclsys
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/76199

Description (problem / solution / changelog)

Problem

exec-approval-followup tasks (and any cli task that uses childSessionKey="agent:main:main") get stuck in running forever after the embedded agent run ends.

Root cause: hasBackingSession() in task-registry.maintenance.ts had:

if (task.runtime === "cli" && hasActiveCliRun(task)) {
  return true;
}
// falls through to session-existence check when hasActiveCliRun is false

When the run ends, hasActiveCliRun returns false, so it falls through to the session-existence check. exec-approval-followup tasks use childSessionKey="agent:main:main" — the persistent main session, which always exists. So hasBackingSession returns true indefinitely and the task is never marked lost.

Fix

Return hasActiveCliRun(task) immediately for all cli tasks — same pattern already used for cron tasks:

if (task.runtime === "cli") {
  // CLI task liveness is determined solely by whether the embedded agent run
  // is still active. Falling through to session-existence checks is wrong:
  // exec-approval-followup tasks use childSessionKey="agent:main:main" which
  // is a persistent session — it always exists, so the session-existence path
  // would never mark the task lost (#76162). Same pattern as cron above.
  return hasActiveCliRun(task);
}

Also removes the now-dead resolveSessionChatType() helper and sessionChatTypesByKey from the lookup context.

Tests

Updated existing test to remove dead caching assertion (the removed resolveSessionChatType was the only caller)
Added regression test: exec-approval-followup cli task with childSessionKey="agent:main:main" and a live session store entry gets marked lost when run ends
302/302 tests pass in src/tasks/

Fixes #76162

Changed files

CHANGELOG.md (modified, +1/-0)
docs/automation/tasks.md (modified, +5/-6)
src/commands/tasks.test.ts (modified, +1/-1)
src/tasks/task-registry.maintenance.issue-60299.test.ts (modified, +25/-4)
src/tasks/task-registry.maintenance.ts (modified, +10/-34)

PR #76216: fix(status): guard resolveSessionModelRef against non-string model fields (#76206)

Repository: openclaw/openclaw
Author: hclsys
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/76216

Description (problem / solution / changelog)

Problem

openclaw status crashes with TypeError: runtimeModel?.trim is not a function when any session entry in ~/.openclaw/agents/<agent>/sessions/sessions.json has a non-string value for model, modelProvider, providerOverride, or modelOverride.

Root cause: readSessionStoreReadOnly parses session JSON with z.record(z.string(), z.unknown()) — no field normalization. The four model fields reach resolvePersistedSelectedModelRef typed as string | undefined but holding arbitrary JSON values. The internal .trim() calls crash on objects or numbers.

The loadSessionStore path does normalize via normalizeSessionRuntimeModelFields, but openclaw status uses readSessionStoreReadOnly for its read-only scan.

Fix

Wrap the four fields with normalizeOptionalString() at the resolveSessionModelRef call site in status.summary.runtime.ts. normalizeOptionalString accepts unknown and returns string | undefined, so non-string values are discarded and the resolver falls back to the configured default.

Tests

Two regression tests added to status.summary.runtime.test.ts:

object model field → does not throw, falls back to configured default
object modelOverride + number providerOverride → does not throw

Fixes #76206

Changed files

CHANGELOG.md (modified, +1/-0)
src/commands/status.summary.runtime.test.ts (modified, +17/-0)
src/commands/status.summary.runtime.ts (modified, +4/-4)

Code Example

function hasBackingSession(task) {
  if (task.runtime === cron) { ... } // Fixed in #75307
  if (task.runtime === cli && hasActiveCliRun(task)) return true;
  const childSessionKey = task.childSessionKey?.trim();
  if (!childSessionKey) return true;
  // ...
  if (task.runtime === subagent || task.runtime === cli) {
    // For cli: checks findTaskSessionEntry which returns true
    // if the session key exists in the store.
    // agent:main:main is a persistent session — it always exists.
    return Boolean(entry);
  }
}

RAW_BUFFERClick to expand / collapse

Summary

This is the same bug pattern as #75307 (cron tasks stuck in running), but the fix for that issue only covered the cron code path. The cli runtime path in hasBackingSession still has the defect.

Repro

Trigger exec approvals that generate exec-approval-followup tasks (e.g. batch approve exec commands)
Let the follow-up embedded runs fail (network error, LLM timeout, etc.)
After the runs fail, check tasks/runs.json — tasks remain in running status
Restart the gateway — tasks are reloaded from disk, still running
Channel reload is permanently deferred: channel reload still deferred after Xms with 60 task run(s) active
Event loop saturates, all API calls timeout, gateway gets kill/restarted by launchd in a loop

Observed in production

OpenClaw 2026.4.29 (a448042)
65 exec-approval-followup tasks stuck in running since 2026-03-30 (33 days)
2 cron email-watcher tasks also stuck
Channel reload deferred for 33,911,115ms (~9.4 hours)
Gateway forked 202 times by launchd due to health check failures
All API calls (Telegram, Feishu, QQ Bot) timing out due to event loop blockage

Root cause

In src/tasks/task-registry.maintenance.ts, the hasBackingSession function:

function hasBackingSession(task) {
  if (task.runtime === cron) { ... } // Fixed in #75307
  if (task.runtime === cli && hasActiveCliRun(task)) return true;
  const childSessionKey = task.childSessionKey?.trim();
  if (!childSessionKey) return true;
  // ...
  if (task.runtime === subagent || task.runtime === cli) {
    // For cli: checks findTaskSessionEntry which returns true
    // if the session key exists in the store.
    // agent:main:main is a persistent session — it always exists.
    return Boolean(entry);
  }
}

For exec-approval-followup tasks:

runtime = cli
childSessionKey = agent:main:main (the main session)
hasActiveCliRun returns false (the embedded run has ended)
findTaskSessionEntry returns true (main session always exists in store)
hasBackingSession → true
shouldMarkLost → false
Task is never reconciled

Expected behavior

After the exec-approval-followup embedded run completes (success or failure), the task run should be marked as succeeded or failed within a reasonable time (e.g. the existing 5-minute grace period). It should not remain running forever just because the parent session still exists.

Suggested fix

The cli runtime path in hasBackingSession should not consider a task as backed merely because its childSessionKey session exists. A session existing does not mean the specific run is still active. Possible approaches:

After hasActiveCliRun returns false for a cli task, return false immediately instead of falling through to the session existence check
Or: track the specific sessionId (not just sessionKey) when creating the follow-up task, and check if that specific session instance is still active
Or: add a max age for running tasks, after which they are automatically marked lost regardless of session state

Impact

67 zombie tasks blocking the entire task queue
Channel reload permanently deferred
Event loop saturated (P99 delay > 40s)
All messaging channels (Telegram, Feishu, QQ Bot) non-functional
Gateway crash loop (202 restarts)
Requires manual intervention to clean up tasks/runs.json

#75307 — same bug for cron runtime (fixed in v2026.4.29)
#59349 — exec follow-up leaking into new session after /new
#72143 — exec follow-up fallback retry/prefix issues

extent analysis

TL;DR

The most likely fix is to modify the hasBackingSession function to correctly handle cli runtime tasks by not considering a task as backed merely because its childSessionKey session exists.

Guidance

Review the hasBackingSession function in src/tasks/task-registry.maintenance.ts to understand the current logic and identify the need for a fix.
Consider implementing one of the suggested approaches:
- Return false immediately after hasActiveCliRun returns false for a cli task.
- Track the specific sessionId when creating the follow-up task and check if that session instance is still active.
- Add a max age for running tasks, after which they are automatically marked lost regardless of session state.
Verify the fix by triggering exec approvals, letting the follow-up embedded runs fail, and checking if the tasks are correctly marked as succeeded or failed within a reasonable time.

Example

function hasBackingSession(task) {
  if (task.runtime === cli && !hasActiveCliRun(task)) {
    return false; // Return false immediately if hasActiveCliRun returns false
  }
  // ... rest of the function remains the same
}

Notes

The suggested fix aims to address the issue with cli runtime tasks, but it may not cover all possible scenarios. Additional testing and verification are necessary to ensure the fix works as expected.

Recommendation

Apply the workaround by modifying the hasBackingSession function to correctly handle cli runtime tasks, as this is the most direct way to address the issue and prevent tasks from remaining in a running state indefinitely.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] exec-approval-followup tasks stuck in 'running' forever, blocking channel reload and saturating event loop [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #76199: fix(tasks): mark exec-approval-followup cli tasks lost when run ends (#76162)

Description (problem / solution / changelog)

Problem

Fix

Tests

Changed files

PR #76216: fix(status): guard resolveSessionModelRef against non-string model fields (#76206)

Description (problem / solution / changelog)

Problem

Fix

Tests

Changed files

Code Example

Summary

Repro

Observed in production

Root cause

Expected behavior

Suggested fix

Impact

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING