openclaw - 💡(How to fix) Fix Zombie running tasks: CLI runtime tasks that terminate are never auto-transitioned to failed/timed_out [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73121Fetched 2026-04-28 06:27:20
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

  1. Cancel UX improvement: Make openclaw tasks cancel accept full UUIDs and provide clear error feedback when cancellation fails.

Root Cause

  1. No zombie detection: The cli runtime does not have a mechanism to periodically check whether the underlying process for a running task is still alive.
  2. No process exit hook: When a CLI task process exits (especially abnormally, e.g. killed, OOM, gateway restart), there is no exit hook to update the task state.
  3. Cancel UX gap: openclaw tasks cancel accepts short UUID prefixes but may fail silently when the prefix is ambiguous or the task is already dead.
RAW_BUFFERClick to expand / collapse

Problem

cli runtime tasks (subagent/exec-approval) that have had their underlying process terminated remain stuck in running state indefinitely. Their task records are never auto-transitioned to failed or timed_out, causing the task list to accumulate "zombie" entries.

Reproduction

On 2026-04-28, openclaw tasks list --status running returned 13 tasks still in running state:

CreatedDescription
2026-04-02subagent task (26 days old, process long dead)
2026-04-21subagent task (7 days old)
2026-04-27multiple exec-approval CLI tasks

These tasks' processes had all terminated, but their state in the task registry was never updated. Manual cancellation via openclaw tasks cancel was also unreliable — the CLI only provided 8-character UUID prefixes, which caused cancellation to fail silently.

Root Cause

  1. No zombie detection: The cli runtime does not have a mechanism to periodically check whether the underlying process for a running task is still alive.
  2. No process exit hook: When a CLI task process exits (especially abnormally, e.g. killed, OOM, gateway restart), there is no exit hook to update the task state.
  3. Cancel UX gap: openclaw tasks cancel accepts short UUID prefixes but may fail silently when the prefix is ambiguous or the task is already dead.

Impact

  • Task list shows inaccurate state (dead tasks appear as running)
  • Task count grows unbounded over time
  • Heartbeat/cleanup agents waste cycles trying to cancel tasks that cannot be cancelled
  • Related to memory bloat from unbounded task record accumulation (#73114)

Suggested Fixes

  1. Process liveness check: Periodically scan cli runtime tasks in running state and check if the process is still alive (via PID or file descriptor). Auto-transition to failed if the process is gone.
  2. Exit hook: Register a process exit handler for CLI tasks that updates task state on termination.
  3. Gateway restart recovery: On gateway startup, scan all cli tasks in running state and mark them failed or unknown if their processes are no longer running.
  4. Cancel UX improvement: Make openclaw tasks cancel accept full UUIDs and provide clear error feedback when cancellation fails.

Environment

  • OpenClaw: 2026.4.24
  • Node.js: v24.12.0
  • OS: macOS 26.3.1 (arm64)
  • 13 zombie running tasks found, oldest 26 days

extent analysis

TL;DR

Implement a periodic process liveness check to auto-transition stuck tasks to failed state.

Guidance

  • Introduce a mechanism to periodically scan cli runtime tasks in running state and check if the underlying process is still alive.
  • Register a process exit handler for CLI tasks to update task state on termination.
  • Consider implementing a gateway restart recovery mechanism to scan and update task states on startup.
  • Improve the openclaw tasks cancel command to accept full UUIDs and provide clear error feedback.

Example

// Pseudocode example of a process liveness check
const runningTasks = getRunningTasks();
runningTasks.forEach(task => {
  if (!isProcessAlive(task.pid)) {
    updateTaskState(task.id, 'failed');
  }
});

Notes

The provided suggestions are based on the issue description and may require additional implementation details. The isProcessAlive function and updateTaskState function are assumed to be implemented separately.

Recommendation

Apply workaround: Implement a periodic process liveness check to auto-transition stuck tasks to failed state, as this is a crucial step in resolving the issue and preventing further task accumulation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING