openclaw - ✅(Solved) Fix [Bug]: abortable(activeSession.prompt()) creates zombie Agent loop when signal is pre-aborted [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74859Fetched 2026-05-01 05:40:42
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
2
Timeline (top)
cross-referenced ×3referenced ×3commented ×1

When params.abortSignal is already aborted before activeSession.prompt() is called (e.g. rapid consecutive messages with messages.queue.mode: "interrupt"), abortable() immediately rejects but the prompt() async chain has already started. The floating Promise creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts, causing the Agent to loop indefinitely calling the LLM after the attempt has exited. Observed: 2617 LLM calls over 103 minutes from a single zombie run.

Error Message

const err = new Error("Request was aborted."); const abortSignal = AbortSignal.abort(new Error("second message arrived")); The Agent continues calling the LLM indefinitely after the attempt has returned. Each iteration: ~90k input tokens + ~35 output tokens, stopReason always toolUse, tools always throw AbortError (caught as error result), model retries the same tool call. Loop never terminates unless the process restarts.

  • All tool results are "Aborted" (error result) Why the inner loop never stops: Agent._runLoop() in pi-agent-core only exits on stopReason === "error" | "aborted". The zombie's signal is never aborted. Tools throw AbortError (from the outer runAbortController.signal), but this is caught as an error tool result — the model retries indefinitely.

Root Cause

Root cause: await abortable(activeSession.prompt(effectivePrompt)) in attempt.ts (introduced in 016693a1f). JavaScript evaluates activeSession.prompt() first (starting the async chain), then abortable() races it. When the signal is pre-aborted, abortable() rejects immediately but the floating Promise from prompt() creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts.

Fix Action

Fixed

PR fix notes

PR #74979: fix(attempt): prevent zombie Agent loop when abort arrives before prompt()

Description (problem / solution / changelog)

Summary

  • Problem: When params.abortSignal is already aborted before activeSession.prompt() is called, JS evaluates prompt() first (starting the async chain), then abortable() rejects immediately. The floating Promise creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts — the Agent loops indefinitely after the attempt exits. Observed in production: 2617 LLM calls over 103 minutes from a single zombie run.
  • Why it matters: Silent unbounded LLM cost; no user-visible symptom. Triggered deterministically by queue.mode: interrupt + rapid consecutive messages, and probabilistically by timeout-compaction retries.
  • What changed: (1) Pre-prompt guard in attempt.ts — check runAbortController.signal.aborted before calling activeSession.prompt() and throw early, so no floating Promise is created. (2) Defensive agent.abort() + clearAllQueues() in the finally block as a backstop for aborts that arrive mid-prompt.
  • What did NOT change: No behavior change for normal (non-aborted) runs. No config added. maxLlmCallsPerRun hardening deferred to a follow-up per ClawSweeper's triage recommendation.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration

Linked Issue/PR

  • Closes #74859
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: await abortable(activeSession.prompt(...)) — JS evaluates the argument activeSession.prompt() before abortable() runs its abort check. When the signal is pre-aborted, a floating Promise escapes into Agent._runLoop() with a fresh abortController that is never signaled.
  • Missing detection / guardrail: No pre-call abort state check before prompt(); finally block did not terminate an escaped Agent.
  • Contributing context: void activeSession.abort() in abortRun() is fire-and-forget and only aborts agent.abortController — which is undefined until _runLoop() starts. If abort arrives before that, the call is a no-op.

Regression Test Plan (if applicable)

  • Coverage level: [x] Unit test
  • Target file: src/agents/pi-embedded-runner/run/attempt.abort-before-prompt.test.ts
  • Scenarios locked in:
    1. Pre-aborted signal → prompt() is skipped → zero LLM calls after attempt exits
    2. Normal (non-aborted) signal → prompt() executes normally, not affected by the guard
  • Why this is the smallest reliable guardrail: Uses a real Agent + mock streamFn with a call counter. No network, no gateway stack. Deterministic — the signal is already aborted at call time, eliminating any race.

User-visible / Behavior Changes

None. Aborted runs already returned immediately; this prevents hidden background activity that was invisible to users anyway.

Diagram (if applicable)

Before:
  abort arrives → abortRun() → void activeSession.abort() [no-op, abortController=undefined]
                → activeSession.prompt() starts _runLoop, creates fresh abortController
                → abortable() rejects → attempt exits
                → _runLoop loops forever (fresh controller never aborted)

After:
  abort arrives → abortRun() → runAbortController.signal.aborted = true
                → pre-prompt guard: signal.aborted → throw AbortError (prompt() never called)
                → attempt exits cleanly
                → finally: agent.abort() + clearAllQueues() [backstop]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux
  • Runtime: Node 22
  • Model/provider: Any (bug is model-agnostic)
  • Relevant config: messages.queue.mode: interrupt

Steps

  1. Configure messages.queue.mode: interrupt
  2. Send a message; within < 1s send a second message
  3. Observe the first run's attempt exits with durationMs < 50ms

Expected

  • No LLM calls after the attempt exits

Actual (before fix)

  • LLM calls continue for minutes/hours after the attempt exits; embedded run done never appears for the zombie

Evidence

  • Failing test before + passing after: attempt.abort-before-prompt.test.ts — 2/2 green on patched code; the pre-aborted path produces zero LLM calls after attempt exits.

Human Verification (required)

  • Verified scenarios: unit tests pass (attempt.abort-before-prompt.test.ts — 2/2 green); pre-aborted signal path skips prompt() confirmed by code review.
  • Edge cases checked: agent.abort() called on a completed run where abortController is undefined — optional chain makes it a no-op.
  • What I did not verify: live end-to-end with a real LLM provider and interrupt mode (will follow up in the issue).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: agent.abort() in finally could theoretically interrupt a normally completing run if _runLoop hasn't cleared abortController yet.
    • Mitigation: Agent.abort() calls this.abortController?.abort() — when _runLoop completes normally it sets abortController = undefined, making the call a documented no-op. Covered by the "normal completion" test case.

Changed files

  • src/agents/pi-embedded-runner/run/attempt.abort-before-prompt.test.ts (added, +306/-0)
  • src/agents/pi-embedded-runner/run/attempt.spawn-workspace.test-support.ts (modified, +4/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +28/-0)

PR #75012: fix(agents): prevent zombie Agent loop when abort signal is pre-aborted (#74859)

Description (problem / solution / changelog)

Summary

Prevents the embedded runner from starting a floating Agent loop when the abort signal is already aborted before activeSession.prompt() is called.

Closes #74859

Root Cause

When params.abortSignal is pre-aborted (e.g. rapid consecutive messages with messages.queue.mode: "interrupt"):

  1. onAbort() fires → abortRun()runAbortController.abort() + void activeSession.abort()
  2. Code still reaches await abortable(activeSession.prompt(prompt))
  3. JavaScript evaluates activeSession.prompt() first (starts Agent _runLoop with a fresh internal abortController)
  4. Then abortable() wraps/rejects immediately
  5. The Agent loop runs with nobody to abort its internal controller → zombie

Observed: 2617 LLM calls over 103 minutes from a single zombie run.

Fix

Two changes in src/agents/pi-embedded-runner/run/attempt.ts:

1. Pre-prompt abort guard

Before every activeSession.prompt() call, check if the run is already aborted. If so, throw makeAbortError() instead of starting a prompt that would create an unabortable Agent loop.

if (aborted || runAbortController.signal.aborted) {
  throw makeAbortError(runAbortController.signal);
}

2. Finally-block safety net

Add void activeSession.abort() in the finally block when the run was aborted or timed out. This catches rare race windows where a prompt was started just before the pre-abort guard fires.

if (aborted || timedOut) {
  void activeSession.abort();
}

Uses void (not await) to avoid blocking the cleanup path.

Testing

All 127 existing tests in attempt.test.ts pass:

 Test Files  1 passed (1)
      Tests  127 passed (127)

Risk Assessment

Low risk. Both changes are additive guards:

  • The pre-prompt guard only fires when the run is already aborted — no change to the happy path
  • The finally-block activeSession.abort() is best-effort fire-and-forget, same as the existing call in abortRun()
  • The catch block at line 2768 correctly handles the thrown AbortError (existing isRunnerAbortError check)

Changed files

  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +11/-0)

Code Example

import { Agent, type AgentMessage } from "@mariozechner/pi-agent-core";
import type { Api, Message, Model } from "@mariozechner/pi-ai";
import { afterEach, beforeEach, describe, expect, it } from "vitest";
import {
  createDefaultEmbeddedSession,
  getHoisted,
  resetEmbeddedAttemptHarness,
  testModel,
} from "./attempt.spawn-workspace.test-support.js";

const sleep = (ms: number) => new Promise<void>((r) => setTimeout(r, ms));
const mockModel = testModel as unknown as Model<Api>;

const mockTool = {
  name: "mock_tool",
  label: "Mock Tool",
  description: "mock",
  parameters: { type: "object" as const, properties: {} },
  execute: async () => ({ content: [{ type: "text" as const, text: "Aborted" }], details: {} }),
};

function createToolUseStreamFn(tracker: { count: number }) {
  return async (_model: unknown, _context: unknown, options?: { signal?: AbortSignal }) => {
    tracker.count += 1;
    await sleep(5);
    if (options?.signal?.aborted) {
      const err = new Error("Request was aborted.");
      err.name = "AbortError";
      throw err;
    }
    const message = {
      role: "assistant" as const,
      content: [
        { type: "toolCall" as const, id: `call_${tracker.count}`, name: "mock_tool", arguments: {} },
      ],
      usage: { input: 70, output: 51, cacheRead: 0, cacheWrite: 0, totalTokens: 121, cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 } },
      stopReason: "toolUse" as const,
      timestamp: Date.now(),
    };
    return {
      [Symbol.asyncIterator]() {
        let done = false;
        return { async next() { if (!done) { done = true; return { done: false, value: { type: "done", message } }; } return { done: true, value: undefined }; } };
      },
      async result() { return message; },
    } as never;
  };
}

const hoisted = getHoisted();

describe("Agent zombie loop (upstream bug)", () => {
  beforeEach(() => { resetEmbeddedAttemptHarness(); });

  it("bug: abort-before-prompt produces floating Promise, Agent loops after attempt exits", { timeout: 10_000 }, async () => {
    const tracker = { count: 0 };
    const agent = new Agent({
      initialState: { systemPrompt: "test", model: mockModel, tools: [mockTool] },
      streamFn: createToolUseStreamFn(tracker),
      convertToLlm: (msgs: AgentMessage[]): Message[] =>
        msgs.filter((m) => ["user", "assistant", "toolResult"].includes(m.role)) as Message[],
    });

    hoisted.createAgentSessionMock.mockResolvedValue({
      session: createDefaultEmbeddedSession({
        prompt: async (_session, prompt) => {
          agent.prompt(prompt).catch(() => {});
          await sleep(50);
        },
      }),
    });

    const abortSignal = AbortSignal.abort(new Error("second message arrived"));
    const { runEmbeddedAttempt } = await import("./attempt.js");

    await runEmbeddedAttempt({
      sessionId: "zombie-test", sessionKey: "agent:main:main",
      sessionFile: "/tmp/zombie-test.jsonl", workspaceDir: "/tmp", agentDir: "/tmp",
      config: {}, prompt: "first message", timeoutMs: 5_000, runId: "zombie-run",
      provider: "openai", modelId: "gpt-test", model: mockModel,
      authStorage: { getApiKey: async () => undefined } as never,
      modelRegistry: {} as never, thinkLevel: "off",
      senderIsOwner: true, disableMessageTool: true, abortSignal,
    });

    const countAtExit = tracker.count;
    await sleep(500);
    const countAfterWait = tracker.count;

    console.log(`LLM calls at exit=${countAtExit}, after 500ms=${countAfterWait}, delta=${countAfterWait - countAtExit}`);
    expect(countAfterWait).toBeGreaterThan(countAtExit);

    agent.abort();
    agent.clearAllQueues?.();
    await agent.waitForIdle();
  });
});
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When params.abortSignal is already aborted before activeSession.prompt() is called (e.g. rapid consecutive messages with messages.queue.mode: "interrupt"), abortable() immediately rejects but the prompt() async chain has already started. The floating Promise creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts, causing the Agent to loop indefinitely calling the LLM after the attempt has exited. Observed: 2617 LLM calls over 103 minutes from a single zombie run.

Steps to reproduce

  1. Configure messages.queue.mode: "interrupt" in openclaw.json
  2. Send a message to the agent
  3. Within < 1 second, send a second message (interrupt mode aborts the first run)
  4. Observe that the first run's Agent continues calling the LLM in the background after the attempt has exited

Alternatively, run the reproduction test below which uses a pre-aborted AbortSignal to simulate the same condition deterministically.

<details> <summary>agent-zombie-loop.test.ts (click to expand)</summary>
import { Agent, type AgentMessage } from "@mariozechner/pi-agent-core";
import type { Api, Message, Model } from "@mariozechner/pi-ai";
import { afterEach, beforeEach, describe, expect, it } from "vitest";
import {
  createDefaultEmbeddedSession,
  getHoisted,
  resetEmbeddedAttemptHarness,
  testModel,
} from "./attempt.spawn-workspace.test-support.js";

const sleep = (ms: number) => new Promise<void>((r) => setTimeout(r, ms));
const mockModel = testModel as unknown as Model<Api>;

const mockTool = {
  name: "mock_tool",
  label: "Mock Tool",
  description: "mock",
  parameters: { type: "object" as const, properties: {} },
  execute: async () => ({ content: [{ type: "text" as const, text: "Aborted" }], details: {} }),
};

function createToolUseStreamFn(tracker: { count: number }) {
  return async (_model: unknown, _context: unknown, options?: { signal?: AbortSignal }) => {
    tracker.count += 1;
    await sleep(5);
    if (options?.signal?.aborted) {
      const err = new Error("Request was aborted.");
      err.name = "AbortError";
      throw err;
    }
    const message = {
      role: "assistant" as const,
      content: [
        { type: "toolCall" as const, id: `call_${tracker.count}`, name: "mock_tool", arguments: {} },
      ],
      usage: { input: 70, output: 51, cacheRead: 0, cacheWrite: 0, totalTokens: 121, cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 } },
      stopReason: "toolUse" as const,
      timestamp: Date.now(),
    };
    return {
      [Symbol.asyncIterator]() {
        let done = false;
        return { async next() { if (!done) { done = true; return { done: false, value: { type: "done", message } }; } return { done: true, value: undefined }; } };
      },
      async result() { return message; },
    } as never;
  };
}

const hoisted = getHoisted();

describe("Agent zombie loop (upstream bug)", () => {
  beforeEach(() => { resetEmbeddedAttemptHarness(); });

  it("bug: abort-before-prompt produces floating Promise, Agent loops after attempt exits", { timeout: 10_000 }, async () => {
    const tracker = { count: 0 };
    const agent = new Agent({
      initialState: { systemPrompt: "test", model: mockModel, tools: [mockTool] },
      streamFn: createToolUseStreamFn(tracker),
      convertToLlm: (msgs: AgentMessage[]): Message[] =>
        msgs.filter((m) => ["user", "assistant", "toolResult"].includes(m.role)) as Message[],
    });

    hoisted.createAgentSessionMock.mockResolvedValue({
      session: createDefaultEmbeddedSession({
        prompt: async (_session, prompt) => {
          agent.prompt(prompt).catch(() => {});
          await sleep(50);
        },
      }),
    });

    const abortSignal = AbortSignal.abort(new Error("second message arrived"));
    const { runEmbeddedAttempt } = await import("./attempt.js");

    await runEmbeddedAttempt({
      sessionId: "zombie-test", sessionKey: "agent:main:main",
      sessionFile: "/tmp/zombie-test.jsonl", workspaceDir: "/tmp", agentDir: "/tmp",
      config: {}, prompt: "first message", timeoutMs: 5_000, runId: "zombie-run",
      provider: "openai", modelId: "gpt-test", model: mockModel,
      authStorage: { getApiKey: async () => undefined } as never,
      modelRegistry: {} as never, thinkLevel: "off",
      senderIsOwner: true, disableMessageTool: true, abortSignal,
    });

    const countAtExit = tracker.count;
    await sleep(500);
    const countAfterWait = tracker.count;

    console.log(`LLM calls at exit=${countAtExit}, after 500ms=${countAfterWait}, delta=${countAfterWait - countAtExit}`);
    expect(countAfterWait).toBeGreaterThan(countAtExit);

    agent.abort();
    agent.clearAllQueues?.();
    await agent.waitForIdle();
  });
});
</details>

Expected behavior

When a run is aborted (via interrupt mode, timeout, or RPC), the Agent should stop all LLM calls promptly. No floating Promises should outlive the attempt lifecycle.

Actual behavior

The Agent continues calling the LLM indefinitely after the attempt has returned. Each iteration: ~90k input tokens + ~35 output tokens, stopReason always toolUse, tools always throw AbortError (caught as error result), model retries the same tool call. Loop never terminates unless the process restarts.

OpenClaw version

All releases since v2026.1.20 (bug introduced in commit 016693a1f on 2026-01-18)

Operating system

Linux (also reproducible on macOS)

Install method

pnpm dev / npm global

Model

Any model (bug is model-agnostic; the loop is in the Agent runtime, not the LLM)

Provider / routing chain

Any provider (bug is provider-agnostic)

Additional provider/model setup details

NOT_ENOUGH_INFO

Logs, screenshots, and evidence

Production observations across 3 independent cases:

CaseTriggerDurationLLM calls
1timeout-compaction retry76 min~2130
2timeout-compaction retry2+ hours~952 (log truncated)
3user rapid messages (652ms apart)103 min2617

Log signature of a zombie run:

  • embedded run prompt end durationMs=<very small, e.g. 22-26ms> (abortable() rejected immediately)
  • Continued model.usage stopReason=toolUse lines after run cleanup for the same runId
  • All tool results are "Aborted" (error result)
  • embedded run done never appears

Impact and severity

Affected: Any user with messages.queue.mode: "interrupt" who sends rapid consecutive messages Severity: High — silent resource drain, potential large API cost Frequency: Near-deterministic with interrupt mode + rapid messages; lower probability via timeout-compaction Consequence: Unbounded LLM API cost, server resource exhaustion, no user-visible indication of the problem

Additional information

Root cause: await abortable(activeSession.prompt(effectivePrompt)) in attempt.ts (introduced in 016693a1f). JavaScript evaluates activeSession.prompt() first (starting the async chain), then abortable() races it. When the signal is pre-aborted, abortable() rejects immediately but the floating Promise from prompt() creates a new Agent._runLoop() with a fresh abortController that nobody ever aborts.

Why the inner loop never stops: Agent._runLoop() in pi-agent-core only exits on stopReason === "error" | "aborted". The zombie's signal is never aborted. Tools throw AbortError (from the outer runAbortController.signal), but this is caught as an error tool result — the model retries indefinitely.

Why the circuit breaker doesn't fire: Tool wrapper order is abort-check (outer) → loop-detection (inner). The abort throw short-circuits before the loop detector ever runs.

Proposed 3-layer fix:

  1. Pre-prompt guard: check aborted state before calling activeSession.prompt() — eliminates the floating Promise at source
  2. finally block: call agent.abort() + agent.clearAllQueues() during attempt cleanup — terminates any escaped Agent
  3. Per-run LLM call hard cap: shared counter across attempts, configurable via agents.defaults.maxLlmCallsPerRun — ultimate safety net independent of abort signal propagation

extent analysis

TL;DR

To fix the issue, implement a pre-prompt guard to check the aborted state before calling activeSession.prompt(), ensuring that no floating Promise is created when the signal is pre-aborted.

Guidance

  • Check the aborted state of the signal before calling activeSession.prompt() to prevent the creation of a floating Promise.
  • Implement a finally block to call agent.abort() and agent.clearAllQueues() during attempt cleanup to terminate any escaped Agent.
  • Consider introducing a per-run LLM call hard cap, configurable via agents.defaults.maxLlmCallsPerRun, as an ultimate safety net independent of abort signal propagation.
  • Review the tool wrapper order to ensure that the loop detector runs after the abort check to prevent the circuit breaker from being short-circuited.

Example

if (!abortSignal.aborted) {
  await abortable(activeSession.prompt(effectivePrompt));
} else {
  // Handle aborted state
}

Notes

The proposed 3-layer fix provides a comprehensive solution to the issue, addressing the root cause and introducing additional safety measures to prevent similar problems in the future.

Recommendation

Apply the proposed 3-layer fix, starting with the pre-prompt guard, to ensure that the Agent stops all LLM calls promptly when a run is aborted.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When a run is aborted (via interrupt mode, timeout, or RPC), the Agent should stop all LLM calls promptly. No floating Promises should outlive the attempt lifecycle.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: abortable(activeSession.prompt()) creates zombie Agent loop when signal is pre-aborted [2 pull requests, 1 comments, 2 participants]