openclaw - ✅(Solved) Fix Design: tool abort signal handling and framework-level execution interruption [3 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65223Fetched 2026-04-12 13:25:00
View on GitHub
Comments
2
Participants
2
Timeline
11
Reactions
0
Author
Timeline (top)
cross-referenced ×3mentioned ×3subscribed ×3commented ×2

Tools receive an AbortSignal parameter in their execute() signature but currently ignore it (typed as _signal). When a run is interrupted (abort, force-detach), the tool continues executing to natural completion, blocking the old run's cleanup and wasting resources.

Root Cause

Tools receive an AbortSignal parameter in their execute() signature but currently ignore it (typed as _signal). When a run is interrupted (abort, force-detach), the tool continues executing to natural completion, blocking the old run's cleanup and wasting resources.

Fix Action

Fix / Workaround

  • #65221 — Interrupt scheduling race fix (PR B, documents tool continuation as known limitation)
  • #65010 — Steer gate fix (PR A)
  • #65222 — Telegram delivery bypass (PR C, includes debouncer bypass)
  • #64951 — Inbound debouncer keyChains issue (closed, resolved by #65222)

PR fix notes

PR #65010: fix: use isActive+isStopped instead of isStreaming for steer message injection

Description (problem / solution / changelog)

Problem

The isStreaming() guard in queueEmbeddedPiMessage and queueReplyRunMessage is too restrictive — it blocks steer/interrupt messages during tool execution (when isStreaming is false but the agent loop is still running). This prevents legitimate message delivery in steer and interrupt queue modes.

Upstream PRs #52351 and #60604 address related issues but neither has been merged.

Root Cause

isStreaming only reflects whether the LLM is actively streaming tokens. During tool execution, isStreaming is false but the agent loop is still running and can drain steering messages between tool calls. The guard incorrectly rejects steer messages in this window.

Fix

Replace isStreaming() with a precise agent-loop lifecycle check via isStopped():

  • isStopped() returns true when the agent loop is NOT actively running — either before the first prompt starts (startup window) or after the prompt finishes (teardown window).
  • agentLoopStarted flag set immediately before activeSession.prompt() — after all pre-prompt work (image loading, hooks, overflow prechecks).
  • agentLoopStopped flag set in the prompt finally block.
  • isStopped = !started || stopped — guards both startup and teardown windows while allowing steer during the full active agent loop (streaming + tool execution).

In agent-runner.ts, the steer guard changes from shouldSteer && isStreaming to shouldSteer && isActive. The now-dead isStreaming param is removed from runReplyAgent, along with its unused destructure in runPreparedReply that was triggering an oxlint CI failure.

Changes

FileChange
runs.tsAdd isStopped? to EmbeddedPiQueueHandle; add isStopped guard in queueEmbeddedPiMessage
attempt.tsAdd agentLoopStarted/agentLoopStopped flags; expose isStopped() on queue handle
agent-runner.tsChange steer guard from isStreaming to isActive; remove dead isStreaming param
reply-run-registry.tsAdd isStopped?() to ReplyBackendHandle; use isStopped in queueReplyRunMessage
get-reply-run.tsRemove dead isStreaming destructure (oxlint fix)
runs.test.ts6 test cases covering all steer acceptance/rejection states
agent-runner.steer-guard.test.ts3 test cases for the isActive steer guard
Test call-site cleanupDrop isStreaming: params from 4 existing tests

Testing

  • Targeted tests pass: runs.test.ts (14), agent-runner.steer-guard.test.ts (3), get-reply-run-queue.test.ts (3)
  • oxlint clean (previously failed with unused isStreaming variable)
  • Import cycle / type checks pass
  • Lightly tested with live Telegram interrupt workflow

AI Disclosure

  • AI-assisted (OpenClaw agent + Codex review)
  • Author understands all changes

Fixes #48003

Related

Part of a 3-PR fix set for interrupt/steer message delivery:

  • PR A (this): #65010 — steer gate fix (isStreamingisActive/isStopped)
  • PR B: #65221 — interrupt scheduling race fix (core)
  • PR C: #65222 — Telegram delivery bypass (depends on A + B)
  • Issue: #65223 — tool abort signal design (future work)

Changed files

  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +9/-0)
  • src/agents/pi-embedded-runner/runs.test.ts (modified, +78/-1)
  • src/agents/pi-embedded-runner/runs.ts (modified, +11/-4)
  • src/auto-reply/reply/agent-runner-direct-runtime-config.test.ts (modified, +0/-2)
  • src/auto-reply/reply/agent-runner.media-paths.test.ts (modified, +0/-1)
  • src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +0/-19)
  • src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts (modified, +0/-1)
  • src/auto-reply/reply/agent-runner.steer-guard.test.ts (added, +235/-0)
  • src/auto-reply/reply/agent-runner.ts (modified, +1/-3)
  • src/auto-reply/reply/get-reply-run.ts (modified, +2/-3)
  • src/auto-reply/reply/reply-run-registry.ts (modified, +11/-1)

PR #65221: fix: resolve interrupt scheduling race between embedded run and ReplyOperation registries

Description (problem / solution / changelog)

Problem

When a user message arrives during an active run in interrupt queue mode, the scheduling layer rejects it with "⚠️ Previous run is still shutting down" and the message is lost.

Root cause

Two registries track run activity: ACTIVE_EMBEDDED_RUNS (scheduling) and ReplyOperation (delivery). After abort:

  1. waitForEmbeddedPiRunEnd resolves when ACTIVE_EMBEDDED_RUNS.delete() fires
  2. But isEmbeddedPiRunActive() checks both registries via OR
  3. ReplyOperation.clearState() runs in the old runs finally block — not yet complete
  4. New message scheduling calls resolveBusyState()isActive still true → rejected

Changes

1. Two-phase wait (waitForEmbeddedPiRunEnd)

After the embedded run handle is removed, chain waitForReplyRunEndBySessionId with remaining timeout budget so callers see a fully idle session.

2. Force-detach fallback (forceDetachEmbeddedRun)

When abort+wait times out, remove the old runs handle from ACTIVE_EMBEDDED_RUNS so a new run can register immediately. The old runs finally block detects handle mismatch → no-op.

3. Retry with force (agent-runner.ts)

On ReplyRunAlreadyActiveError, wait 500ms for the reply-op to clear, then retry with force: true. Only in interrupt mode — collect/followup modes preserve the existing run to avoid dropped assistant progress.

4. Reply-run-registry hardening

  • skipNotify option prevents premature waiter resolution during force-supersede
  • isReplaced branch returns immediately without mutating maps the replacement op may reference
  • Normal clear path iterates all waitKeysBySessionId aliases to remove stale entries after session rotation

Known Limitations

  • Tool execution continues after force-detach — tools run to natural completion. Framework-level abort race or per-tool signal handling is planned as a separate feature.
  • Steer messages during long tool execution are queued but not processed until the tool returns. Steer-aware tool yield is future work.
  • sessions_send HTTP connections cannot be cancelled after abort.

Tests

29 tests passing across 3 test files:

  • runs.test.ts: force-detach, isEmbeddedPiRunActiveForSessionKey, handle identity guard
  • get-reply-run-queue.test.ts: force-detach + direct return, force-detach failure fallback
  • reply-run-registry.test.ts: skipNotify, isReplaced branch, alias preservation during force-supersede

Related

Part of a 3-PR fix set for interrupt/steer message delivery:

  • PR A: #65010 — steer gate fix (isStreamingisActive)
  • PR B (this): #65221 — interrupt scheduling race fix (core, independent of A)
  • PR C: #65222 — Telegram delivery bypass (depends on A + B)
  • Issue: #65223 — tool abort signal design (future work)
  • Supersedes #65021 (which mixed core + Telegram changes)

AI Disclosure

  • AI-assisted (OpenClaw agent + Codex review)
  • Fully tested (29 tests across 3 test files)
  • Author understands all changes

Changed files

  • src/agents/pi-embedded-runner.ts (modified, +2/-0)
  • src/agents/pi-embedded-runner/runs.test.ts (modified, +76/-0)
  • src/agents/pi-embedded-runner/runs.ts (modified, +87/-3)
  • src/agents/pi-embedded.runtime.ts (modified, +2/-0)
  • src/agents/pi-embedded.ts (modified, +2/-0)
  • src/auto-reply/reply/agent-runner.ts (modified, +73/-5)
  • src/auto-reply/reply/get-reply-run-queue.test.ts (modified, +133/-0)
  • src/auto-reply/reply/get-reply-run-queue.ts (modified, +42/-3)
  • src/auto-reply/reply/get-reply-run.ts (modified, +5/-0)
  • src/auto-reply/reply/reply-run-registry.test.ts (modified, +415/-0)
  • src/auto-reply/reply/reply-run-registry.ts (modified, +156/-17)

PR #65222: fix(telegram): bypass grammY sequential key for interrupt/steer message delivery

Description (problem / solution / changelog)

Problem

In Telegram, grammY's sequentialize middleware serializes all updates for the same chat. When a run is active (e.g., executing a tool call), new messages queue behind it. This prevents:

  • Interrupt mode: user message cannot reach the scheduling layer to trigger abort
  • Steer mode: user message cannot be injected into the active run's steering queue

Changes

1. Chat-session cache (chatSessionCache)

Maps Telegram chat IDs → OpenClaw session keys. Capped at 500 entries with LRU eviction. Populated when session state is resolved during message processing.

2. Per-message sequential key

When a run is active for the cached session key AND the queue mode is interrupt or steer/steer-backlog/steer+backlog, generate a unique per-message key instead of the per-chat key. This lets grammY process the new message concurrently with the running update.

3. Lightweight run-active-check module

New re-export path (pi-embedded-runner/run-active-check.ts) avoids pulling the full embedded-agent module graph into reply-runtime.ts → channel plugins. Addresses module boundary concern where reply-runtime is a lightweight SDK entry.

4. Live config for queue mode decisions

The sequential key bypass reads live config via loadConfig() instead of the outer config snapshot, so queue mode changes take effect without restart.

Dependencies

Clean 3-commit stack on top of the latest prerequisite commits; will rebase away once both merge:

  • #65010 — steer gate fix (isStreamingisActive/isStopped)
  • #65221 — interrupt scheduling race fix (force-detach + ReplyOperation hardening + seq+identity-gated retry)

Tests

  • Telegram scoped tests (sequential-key.test.ts, bot.create-telegram-bot.test.ts, etc.): pass
  • Full Telegram suite: 1281/1281 pass
  • Auto-reply scoped (runs.test.ts, get-reply-run-queue.test.ts, reply-run-registry.test.ts, agent-runner.steer-guard.test.ts): 52/52 pass
  • oxlint: 0 errors (previously failing on unused isStreaming/isInterruptMode)

Known Limitations

  • chatSessionCache maps chat ID → session key; group chats with multiple senders sharing one chat ID work correctly (cache is per-chat, not per-sender)
  • The bypass only activates when isRunActiveForSessionKey returns true; if the cache misses (first message in a chat), the fallback is normal per-chat serialization

Related

Part of a 3-PR fix set for interrupt/steer message delivery:

  • PR A: #65010 — steer gate fix (isStreamingisActive/isStopped)
  • PR B: #65221 — interrupt scheduling race fix (core)
  • PR C (this): #65222 — Telegram delivery bypass
  • Issue: #65223 — tool abort signal design (future work)
  • Supersedes Telegram-specific parts of #65021

AI Disclosure

  • AI-assisted (OpenClaw agent + Codex review)
  • Fully tested (see Tests section above)
  • Author understands all changes

Changed files

  • extensions/telegram/src/bot-deps.ts (modified, +6/-0)
  • extensions/telegram/src/bot-handlers.runtime.ts (modified, +83/-0)
  • extensions/telegram/src/bot-native-commands.ts (modified, +2/-0)
  • extensions/telegram/src/bot.create-telegram-bot.test.ts (modified, +5/-6)
  • extensions/telegram/src/bot.ts (modified, +53/-3)
  • extensions/telegram/src/sequential-key.test.ts (modified, +139/-44)
  • extensions/telegram/src/sequential-key.ts (modified, +49/-1)
  • src/agents/pi-embedded-runner.ts (modified, +2/-0)
  • src/agents/pi-embedded-runner/run-active-check.ts (added, +7/-0)
  • src/agents/pi-embedded-runner/run/attempt.ts (modified, +9/-0)
  • src/agents/pi-embedded-runner/runs.test.ts (modified, +154/-1)
  • src/agents/pi-embedded-runner/runs.ts (modified, +98/-7)
  • src/agents/pi-embedded.runtime.ts (modified, +2/-0)
  • src/agents/pi-embedded.ts (modified, +2/-0)
  • src/auto-reply/reply/agent-runner-direct-runtime-config.test.ts (modified, +0/-2)
  • src/auto-reply/reply/agent-runner.media-paths.test.ts (modified, +0/-1)
  • src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +0/-19)
  • src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts (modified, +0/-1)
  • src/auto-reply/reply/agent-runner.steer-guard.test.ts (added, +235/-0)
  • src/auto-reply/reply/agent-runner.ts (modified, +74/-8)
  • src/auto-reply/reply/get-reply-run-queue.test.ts (modified, +133/-0)
  • src/auto-reply/reply/get-reply-run-queue.ts (modified, +42/-3)
  • src/auto-reply/reply/get-reply-run.ts (modified, +7/-3)
  • src/auto-reply/reply/queue/settings.ts (modified, +28/-0)
  • src/auto-reply/reply/reply-run-registry.test.ts (modified, +415/-0)
  • src/auto-reply/reply/reply-run-registry.ts (modified, +167/-18)
  • src/plugin-sdk/core.ts (modified, +1/-0)
  • src/plugin-sdk/reply-runtime.ts (modified, +3/-2)
RAW_BUFFERClick to expand / collapse

Context

Tools receive an AbortSignal parameter in their execute() signature but currently ignore it (typed as _signal). When a run is interrupted (abort, force-detach), the tool continues executing to natural completion, blocking the old run's cleanup and wasting resources.

Current Behavior

  • process poll --timeout 30000: 250ms tick loop continues for full duration after abort
  • sessions_send: HTTP connection to gateway hangs until timeout (callGateway has no cancel mechanism)
  • All other tools: blocked until natural return

The old run cannot complete its finally block until the tool returns, keeping the ReplyOperation alive.

Key Findings from PR A/B/C Implementation

Tool results persist in conversation history before abort check

pi-agent-core's executeToolCalls always pushes tool results to currentContext.messages before checking stopReason === "aborted" in the next loop iteration. This means:

  • New runs can see old tool results (session IDs, run IDs, file paths) in conversation history
  • Agents can theoretically resume operations by reading history — e.g., re-polling a process session ID from a previous run's tool result
  • This is an implicit "resume" mechanism that doesn't require explicit framework support

forceDetach no longer notifies global waiters

As of #65221, forceDetachEmbeddedRun no longer calls notifyEmbeddedRunEnded. Concurrent waiters (session-reset, session management) time out naturally rather than receiving a premature idle signal. This means the detached run's continued tool execution doesn't create false-idle signals for other code paths.

Tool-by-tool analysis after interrupt

ToolProcess survives?New run can resume?Worth adding signal handling?
exec (start)Yes (bash registry)Yes, via session ID in historyNo — don't kill process
process pollN/A (no process)N/AYes — stop 250ms tick loop
sessions_sendChild run continuesOrphaned (no cancel for callGateway)Known limitation
browserState persistentYesNo special handling needed
Short-lived toolsComplete before abortN/ANot worth handling

Design Questions

1. Per-tool vs framework-level

Per-tool: Each tool manually checks the signal and returns early. Pros: can return meaningful partial results. Cons: every new tool needs explicit handling; easy to forget.

Framework-level (Promise.race wrapper): Wrap every tool.execute() call with a race against the abort signal at the agent-core layer. Pros: automatic coverage for all tools. Cons: tool continues running in background; no partial results; tool result in conversation history becomes "interrupted" rather than actual output.

2. Interrupt vs steer semantics

The current AbortSignal maps to interrupt semantics (kill old run, start new). For steer mode, tools need a different signal — "there is a steering message, yield back to the agent loop so it can be processed" — rather than abort.

Should tools receive a second signal/callback for steer awareness? Or should the framework race against the steering queue?

3. Partial results

When a tool is interrupted mid-execution:

  • Should the partial result (e.g., last 5 seconds of process poll output) be preserved in conversation history?
  • If using framework-level race, the real tool result is lost — only "Tool execution interrupted" appears in history
  • Note: even without explicit partial results, tool results from previous calls in the same run ARE preserved (see "Key Findings" above)

4. Resource cleanup

  • process poll: Stop tick loop, but keep underlying exec process alive (new run can resume polling)
  • sessions_send: No way to cancel callGateway HTTP call currently — connection leaks until timeout
  • exec (process start): Process should never be killed on abort — only the poll

Proposed Approach

Layered:

  1. Framework-level safety net (new): Race tool.execute() against abort signal at the agent-core wrapper level. Guarantees no tool can block the agent loop indefinitely. One change covers all tools.
  2. Per-tool optimization (opt-in): Tools that can meaningfully handle interruption check the signal themselves and return partial results. Framework race becomes a fallback.

This does not depend on PR C (abort-signal-propagation branch) — it can be built directly on top of the merged PR A + PR B codebase.

Related

  • #65221 — Interrupt scheduling race fix (PR B, documents tool continuation as known limitation)
  • #65010 — Steer gate fix (PR A)
  • #65222 — Telegram delivery bypass (PR C, includes debouncer bypass)
  • #64951 — Inbound debouncer keyChains issue (closed, resolved by #65222)

extent analysis

TL;DR

Implement a framework-level safety net by wrapping every tool.execute() call with a Promise.race against the abort signal to prevent tools from blocking the agent loop indefinitely.

Guidance

  • Introduce a framework-level wrapper around tool.execute() calls to race against the abort signal, ensuring no tool can block the agent loop indefinitely.
  • Identify tools that can meaningfully handle interruption and opt-in for per-tool optimization, allowing them to return partial results.
  • Prioritize tools like process poll and sessions_send for signal handling, as they have significant resource implications when interrupted.
  • Consider preserving partial results in conversation history for interrupted tools, weighing the benefits against the potential for incomplete or misleading information.

Example

const executeTool = (tool, signal) => {
  return Promise.race([
    tool.execute(),
    new Promise((_, reject) => {
      signal.addEventListener('abort', () => {
        reject(new Error('Tool execution interrupted'));
      });
    }),
  ]);
};

Notes

The proposed approach does not depend on the abort-signal-propagation branch (PR C) and can be built directly on top of the merged PR A + PR B codebase. However, the effectiveness of this solution may vary depending on the specific tool implementations and their ability to handle interruption.

Recommendation

Apply the proposed layered approach, starting with the framework-level safety net, to ensure that no tool can block the agent loop indefinitely. This provides a solid foundation for further per-tool optimizations and improvements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Design: tool abort signal handling and framework-level execution interruption [3 pull requests, 2 comments, 2 participants]