openclaw - 💡(How to fix) Fix [Bug]: Embedded run timeout leaves zombie handle blocking heartbeat delivery [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52231Fetched 2026-04-08 01:13:56
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
0
Timeline (top)
commented ×1

Error Message

log.warn(force-cleared zombie run: sessionId=${params.sessionId});

Root Cause

In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is placed in a finally block:

} finally {
    clearTimeout(abortTimer);
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles, the finally block never runs.

The abort timer fires and calls abortRun(true), but if the underlying HTTP client does not honor the abort (or the connection is in a state where it cannot be interrupted), the promise hangs indefinitely.

In src/auto-reply/reply/queue-policy.ts:

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // heartbeat silently killed!
  // ...
}

Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.

Code Example

} finally {
    clearTimeout(abortTimer);
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

---

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // heartbeat silently killed!
  // ...
}

---

2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000

---

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000
RAW_BUFFERClick to expand / collapse

Bug Summary

When an embedded run times out but the underlying provider promise never settles (e.g., dead HTTP connection, hung stream), the run handle stays in ACTIVE_EMBEDDED_RUNS permanently. This silently kills all subsequent heartbeat deliveries for the session.

Steps to Reproduce

  1. Configure a heartbeat sharing a session with the main agent (default behavior)
  2. Trigger a condition where the provider connection hangs (e.g., misconfigured endpoint, network interruption)
  3. Wait for the embedded run timeout to fire
  4. Observe that ACTIVE_EMBEDDED_RUNS still contains the zombie handle after timeout
  5. All subsequent heartbeat ticks are silently dropped

Root Cause Analysis

In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is placed in a finally block:

} finally {
    clearTimeout(abortTimer);
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles, the finally block never runs.

The abort timer fires and calls abortRun(true), but if the underlying HTTP client does not honor the abort (or the connection is in a state where it cannot be interrupted), the promise hangs indefinitely.

In src/auto-reply/reply/queue-policy.ts:

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // heartbeat silently killed!
  // ...
}

Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.

Observed Behavior

From gateway logs:

2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000

The zombie run persisted for 10+ hours. During that period, no heartbeat was delivered.

Suggested Fix

After the abort timer fires and a grace period elapses (e.g., 30-60s), forcibly remove the handle:

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000

Alternatively, waitForActiveEmbeddedRuns could forcibly clear runs that have exceeded their timeout.

Impact

  • Heartbeat failure causes session to appear dead
  • No auto-recovery without gateway restart
  • Silent failure - very difficult to diagnose

Labels

bug, regression, embedded-run, heartbeat

extent analysis

Fix Plan

To address the issue of zombie runs, we will implement a mechanism to forcibly remove the handle after a grace period. Here are the steps:

  • Introduce a new constant ZOMBIE_CLEANUP_GRACE_MS with a value of 30,000 (30 seconds) to define the grace period.
  • Modify the abort timer callback to include a check for zombie runs and remove the handle if necessary:
// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS);
  • Alternatively, consider modifying waitForActiveEmbeddedRuns to forcibly clear runs that have exceeded their timeout.

Verification

To verify the fix, follow these steps:

  • Configure a heartbeat sharing a session with the main agent.
  • Trigger a condition where the provider connection hangs (e.g., misconfigured endpoint, network interruption).
  • Wait for the embedded run timeout to fire.
  • Check the gateway logs for the presence of a "force-cleared zombie run" message.
  • Verify that subsequent heartbeat ticks are delivered successfully.

Extra Tips

  • Monitor the gateway logs for "force-cleared zombie run" messages to detect and diagnose potential issues.
  • Consider implementing additional logging or monitoring to detect and alert on zombie runs.
  • Review the ZOMBIE_CLEANUP_GRACE_MS value and adjust as needed to balance between clearing zombie runs and allowing for legitimate long-running operations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING