openclaw - 💡(How to fix) Fix Zombie embedded run blocks heartbeats permanently after provider timeout [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52224Fetched 2026-04-08 01:13:59
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
0
Author
Timeline (top)
commented ×1

When an embedded run times out but the underlying provider promise never settles (e.g., dead connection, hung HTTP stream), the run handle stays in ACTIVE_EMBEDDED_RUNS permanently. Since resolveActiveRunQueueAction returns "drop" for heartbeats when isActive === true, all subsequent heartbeat ticks are silently discarded until the gateway is restarted.

Error Message

log.warn(force-cleared zombie run: sessionId=${params.sessionId} runId=${params.runId});

Root Cause

In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is correctly placed in a finally block:

} finally {
    clearTimeout(abortTimer);
    // ...
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles (hung connection, broken pipe without TCP reset), the finally block never runs.

The abort timer at the outer scope fires and calls abortRun(true), which signals the AbortController. But if the underlying HTTP client doesn't honor the abort (or the connection is in a state where it can't be interrupted), the promise hangs indefinitely.

Meanwhile, in src/auto-reply/reply/queue-policy.ts:

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // <-- heartbeat silently killed
  // ...
}

Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.

Code Example

} finally {
    clearTimeout(abortTimer);
    // ...
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

---

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // <-- heartbeat silently killed
  // ...
}

---

2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T17:34:03 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T19:35:41 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T23:08:57 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000

---

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId} runId=${params.runId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000
RAW_BUFFERClick to expand / collapse

Summary

When an embedded run times out but the underlying provider promise never settles (e.g., dead connection, hung HTTP stream), the run handle stays in ACTIVE_EMBEDDED_RUNS permanently. Since resolveActiveRunQueueAction returns "drop" for heartbeats when isActive === true, all subsequent heartbeat ticks are silently discarded until the gateway is restarted.

Reproduction

  1. Configure a heartbeat that shares a session with the main agent (default behavior, no isolatedSession)
  2. Trigger a condition where the provider connection hangs (e.g., a 404 from a misconfigured Azure endpoint, or a network interruption)
  3. Wait for the embedded run timeout (default 600s) to fire
  4. Observe that ACTIVE_EMBEDDED_RUNS still contains the zombie handle after timeout
  5. All subsequent heartbeat ticks are silently dropped

Root Cause

In src/agents/pi-embedded-runner/run/attempt.ts, clearActiveEmbeddedRun is correctly placed in a finally block:

} finally {
    clearTimeout(abortTimer);
    // ...
    clearActiveEmbeddedRun(params.sessionId, queueHandle, params.sessionKey);
}

However, this finally only executes when the await abortable(activeSession.prompt(...)) promise resolves or rejects. If the abort signal fires but the provider stream/promise never settles (hung connection, broken pipe without TCP reset), the finally block never runs.

The abort timer at the outer scope fires and calls abortRun(true), which signals the AbortController. But if the underlying HTTP client doesn't honor the abort (or the connection is in a state where it can't be interrupted), the promise hangs indefinitely.

Meanwhile, in src/auto-reply/reply/queue-policy.ts:

export function resolveActiveRunQueueAction(params) {
  if (!params.isActive) return "run-now";
  if (params.isHeartbeat) return "drop";  // <-- heartbeat silently killed
  // ...
}

Since the zombie handle keeps isEmbeddedPiRunActive(sessionId) returning true, every heartbeat tick hits "drop" and exits without any log message.

Observed Behavior

From gateway logs (v2026.3.13-1):

2026-03-21T13:08:05 [agent/embedded] embedded run timeout: runId=620fd6bf sessionId=50d88d7d timeoutMs=600000
2026-03-21T13:08:13 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T17:34:03 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T19:35:41 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
2026-03-21T23:08:57 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000

The zombie run persisted for 10+ hours until the machine was shut down. During that entire period, no heartbeat was delivered.

Suggested Fix

After the abort timer fires and a grace period elapses (e.g., 30-60s), forcibly remove the handle from ACTIVE_EMBEDDED_RUNS regardless of whether the promise settled:

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId} runId=${params.runId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS); // e.g., 30_000

Alternatively, waitForActiveEmbeddedRuns (used during restarts) could forcibly clear runs that have exceeded their timeout, rather than just logging and returning { drained: false }.

Environment

  • OpenClaw v2026.3.13-1
  • Provider: azure-openai-responses (Azure OpenAI Responses API)
  • macOS, Node v22.22.0

Related

  • #51735 (azure-openai-responses missing from MODEL_APIS enum — the misconfiguration that triggered the initial 404s)
  • The "drop" path in resolveActiveRunQueueAction produces no log output, making this failure mode completely silent and very hard to diagnose

extent analysis

Fix Plan

To fix the issue of zombie runs persisting in ACTIVE_EMBEDDED_RUNS after a timeout, we need to implement a mechanism to forcibly remove the handle after a grace period. Here are the steps:

  1. Introduce a new constant: Define a constant ZOMBIE_CLEANUP_GRACE_MS with a value of 30,000 (30 seconds) or 60,000 (60 seconds) to control the grace period.
  2. Modify the abort timer callback: In the abort timer callback, after calling abortRun(true), add a setTimeout function to check if the handle is still in ACTIVE_EMBEDDED_RUNS after the grace period.
  3. Forcibly remove the handle: If the handle is still present, remove it from ACTIVE_EMBEDDED_RUNS, notify that the embedded run has ended, and log a warning message.

Example code:

const ZOMBIE_CLEANUP_GRACE_MS = 30_000; // 30 seconds

// In the abort timer callback, after abortRun(true):
setTimeout(() => {
  if (ACTIVE_EMBEDDED_RUNS.get(params.sessionId) === queueHandle) {
    ACTIVE_EMBEDDED_RUNS.delete(params.sessionId);
    notifyEmbeddedRunEnded(params.sessionId);
    log.warn(`force-cleared zombie run: sessionId=${params.sessionId} runId=${params.runId}`);
  }
}, ZOMBIE_CLEANUP_GRACE_MS);

Alternatively, you can modify the waitForActiveEmbeddedRuns function to forcibly clear runs that have exceeded their timeout.

Verification

To verify that the fix worked:

  1. Trigger a zombie run: Reproduce the conditions that lead to a zombie run (e.g., a hung connection or a misconfigured Azure endpoint).
  2. Wait for the timeout: Wait for the embedded run timeout to fire (default 600s).
  3. Check the logs: Verify that the zombie run is forcibly removed from ACTIVE_EMBEDDED_RUNS after the grace period (30-60s).
  4. Verify heartbeat delivery: Check that subsequent heartbeat ticks are delivered successfully.

Extra Tips

  • Consider adding logging to the resolveActiveRunQueueAction function to make the failure mode less silent and easier to diagnose.
  • Review the azure-openai-responses provider configuration to ensure it is correct and functioning as expected.
  • Test the fix thoroughly to ensure it resolves the issue without introducing any new problems.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Zombie embedded run blocks heartbeats permanently after provider timeout [1 comments, 2 participants]