openclaw - ✅(Solved) Fix Design: tool abort signal handling and framework-level execution interruption [3 pull requests, 2 comments, 2 participants]

openclaw2026-04-12 07:08:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#65223•Fetched 2026-04-12 13:25:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jetd1

Participants

jetd1

openclaw-barnacle[bot]

Timeline (top)

cross-referenced ×3mentioned ×3subscribed ×3commented ×2

Tools receive an AbortSignal parameter in their execute() signature but currently ignore it (typed as _signal). When a run is interrupted (abort, force-detach), the tool continues executing to natural completion, blocking the old run's cleanup and wasting resources.

Root Cause

Fix Action

Fix / Workaround

#65221 — Interrupt scheduling race fix (PR B, documents tool continuation as known limitation)
#65010 — Steer gate fix (PR A)
#65222 — Telegram delivery bypass (PR C, includes debouncer bypass)
#64951 — Inbound debouncer keyChains issue (closed, resolved by #65222)

PR fix notes

PR #65010: fix: use isActive+isStopped instead of isStreaming for steer message injection

Repository: openclaw/openclaw
Author: jetd1
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/65010

Description (problem / solution / changelog)

Problem

The isStreaming() guard in queueEmbeddedPiMessage and queueReplyRunMessage is too restrictive — it blocks steer/interrupt messages during tool execution (when isStreaming is false but the agent loop is still running). This prevents legitimate message delivery in steer and interrupt queue modes.

Upstream PRs #52351 and #60604 address related issues but neither has been merged.

Root Cause

isStreaming only reflects whether the LLM is actively streaming tokens. During tool execution, isStreaming is false but the agent loop is still running and can drain steering messages between tool calls. The guard incorrectly rejects steer messages in this window.

Fix

Replace isStreaming() with a precise agent-loop lifecycle check via isStopped():

isStopped() returns true when the agent loop is NOT actively running — either before the first prompt starts (startup window) or after the prompt finishes (teardown window).
agentLoopStarted flag set immediately before activeSession.prompt() — after all pre-prompt work (image loading, hooks, overflow prechecks).
agentLoopStopped flag set in the prompt finally block.
isStopped = !started || stopped — guards both startup and teardown windows while allowing steer during the full active agent loop (streaming + tool execution).

In agent-runner.ts, the steer guard changes from shouldSteer && isStreaming to shouldSteer && isActive. The now-dead isStreaming param is removed from runReplyAgent, along with its unused destructure in runPreparedReply that was triggering an oxlint CI failure.

Changes

File	Change
`runs.ts`	Add `isStopped?` to `EmbeddedPiQueueHandle`; add `isStopped` guard in `queueEmbeddedPiMessage`
`attempt.ts`	Add `agentLoopStarted`/`agentLoopStopped` flags; expose `isStopped()` on queue handle
`agent-runner.ts`	Change steer guard from `isStreaming` to `isActive`; remove dead `isStreaming` param
`reply-run-registry.ts`	Add `isStopped?()` to `ReplyBackendHandle`; use `isStopped` in `queueReplyRunMessage`
`get-reply-run.ts`	Remove dead `isStreaming` destructure (oxlint fix)
`runs.test.ts`	6 test cases covering all steer acceptance/rejection states
`agent-runner.steer-guard.test.ts`	3 test cases for the isActive steer guard
Test call-site cleanup	Drop `isStreaming:` params from 4 existing tests

Testing

Targeted tests pass: runs.test.ts (14), agent-runner.steer-guard.test.ts (3), get-reply-run-queue.test.ts (3)
oxlint clean (previously failed with unused isStreaming variable)
Import cycle / type checks pass
Lightly tested with live Telegram interrupt workflow

AI Disclosure

AI-assisted (OpenClaw agent + Codex review)
Author understands all changes

Fixes #48003

Part of a 3-PR fix set for interrupt/steer message delivery:

PR A (this): #65010 — steer gate fix (isStreaming → isActive/isStopped)
PR B: #65221 — interrupt scheduling race fix (core)
PR C: #65222 — Telegram delivery bypass (depends on A + B)
Issue: #65223 — tool abort signal design (future work)

Changed files

src/agents/pi-embedded-runner/run/attempt.ts (modified, +9/-0)
src/agents/pi-embedded-runner/runs.test.ts (modified, +78/-1)
src/agents/pi-embedded-runner/runs.ts (modified, +11/-4)
src/auto-reply/reply/agent-runner-direct-runtime-config.test.ts (modified, +0/-2)
src/auto-reply/reply/agent-runner.media-paths.test.ts (modified, +0/-1)
src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +0/-19)
src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts (modified, +0/-1)
src/auto-reply/reply/agent-runner.steer-guard.test.ts (added, +235/-0)
src/auto-reply/reply/agent-runner.ts (modified, +1/-3)
src/auto-reply/reply/get-reply-run.ts (modified, +2/-3)
src/auto-reply/reply/reply-run-registry.ts (modified, +11/-1)

PR #65221: fix: resolve interrupt scheduling race between embedded run and ReplyOperation registries

Repository: openclaw/openclaw
Author: jetd1
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/65221

Description (problem / solution / changelog)

Problem

When a user message arrives during an active run in interrupt queue mode, the scheduling layer rejects it with "⚠️ Previous run is still shutting down" and the message is lost.

Root cause

Two registries track run activity: ACTIVE_EMBEDDED_RUNS (scheduling) and ReplyOperation (delivery). After abort:

waitForEmbeddedPiRunEnd resolves when ACTIVE_EMBEDDED_RUNS.delete() fires
But isEmbeddedPiRunActive() checks both registries via OR
ReplyOperation.clearState() runs in the old runs finally block — not yet complete
New message scheduling calls resolveBusyState() → isActive still true → rejected

Changes

1. Two-phase wait (`waitForEmbeddedPiRunEnd`)

After the embedded run handle is removed, chain waitForReplyRunEndBySessionId with remaining timeout budget so callers see a fully idle session.

2. Force-detach fallback (`forceDetachEmbeddedRun`)

When abort+wait times out, remove the old runs handle from ACTIVE_EMBEDDED_RUNS so a new run can register immediately. The old runs finally block detects handle mismatch → no-op.

3. Retry with force (`agent-runner.ts`)

On ReplyRunAlreadyActiveError, wait 500ms for the reply-op to clear, then retry with force: true. Only in interrupt mode — collect/followup modes preserve the existing run to avoid dropped assistant progress.

4. Reply-run-registry hardening

skipNotify option prevents premature waiter resolution during force-supersede
isReplaced branch returns immediately without mutating maps the replacement op may reference
Normal clear path iterates all waitKeysBySessionId aliases to remove stale entries after session rotation

Known Limitations

Tool execution continues after force-detach — tools run to natural completion. Framework-level abort race or per-tool signal handling is planned as a separate feature.
Steer messages during long tool execution are queued but not processed until the tool returns. Steer-aware tool yield is future work.
sessions_send HTTP connections cannot be cancelled after abort.

Tests

29 tests passing across 3 test files:

runs.test.ts: force-detach, isEmbeddedPiRunActiveForSessionKey, handle identity guard
get-reply-run-queue.test.ts: force-detach + direct return, force-detach failure fallback
reply-run-registry.test.ts: skipNotify, isReplaced branch, alias preservation during force-supersede

Part of a 3-PR fix set for interrupt/steer message delivery:

PR A: #65010 — steer gate fix (isStreaming → isActive)
PR B (this): #65221 — interrupt scheduling race fix (core, independent of A)
PR C: #65222 — Telegram delivery bypass (depends on A + B)
Issue: #65223 — tool abort signal design (future work)
Supersedes #65021 (which mixed core + Telegram changes)

AI Disclosure

AI-assisted (OpenClaw agent + Codex review)
Fully tested (29 tests across 3 test files)
Author understands all changes

Changed files

src/agents/pi-embedded-runner.ts (modified, +2/-0)
src/agents/pi-embedded-runner/runs.test.ts (modified, +76/-0)
src/agents/pi-embedded-runner/runs.ts (modified, +87/-3)
src/agents/pi-embedded.runtime.ts (modified, +2/-0)
src/agents/pi-embedded.ts (modified, +2/-0)
src/auto-reply/reply/agent-runner.ts (modified, +73/-5)
src/auto-reply/reply/get-reply-run-queue.test.ts (modified, +133/-0)
src/auto-reply/reply/get-reply-run-queue.ts (modified, +42/-3)
src/auto-reply/reply/get-reply-run.ts (modified, +5/-0)
src/auto-reply/reply/reply-run-registry.test.ts (modified, +415/-0)
src/auto-reply/reply/reply-run-registry.ts (modified, +156/-17)

PR #65222: fix(telegram): bypass grammY sequential key for interrupt/steer message delivery

Repository: openclaw/openclaw
Author: jetd1
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/65222

Description (problem / solution / changelog)

Problem

In Telegram, grammY's sequentialize middleware serializes all updates for the same chat. When a run is active (e.g., executing a tool call), new messages queue behind it. This prevents:

Interrupt mode: user message cannot reach the scheduling layer to trigger abort
Steer mode: user message cannot be injected into the active run's steering queue

Changes

1. Chat-session cache (`chatSessionCache`)

Maps Telegram chat IDs → OpenClaw session keys. Capped at 500 entries with LRU eviction. Populated when session state is resolved during message processing.

2. Per-message sequential key

When a run is active for the cached session key AND the queue mode is interrupt or steer/steer-backlog/steer+backlog, generate a unique per-message key instead of the per-chat key. This lets grammY process the new message concurrently with the running update.

3. Lightweight run-active-check module

New re-export path (pi-embedded-runner/run-active-check.ts) avoids pulling the full embedded-agent module graph into reply-runtime.ts → channel plugins. Addresses module boundary concern where reply-runtime is a lightweight SDK entry.

4. Live config for queue mode decisions

The sequential key bypass reads live config via loadConfig() instead of the outer config snapshot, so queue mode changes take effect without restart.

Dependencies

Clean 3-commit stack on top of the latest prerequisite commits; will rebase away once both merge:

#65010 — steer gate fix (isStreaming → isActive/isStopped)
#65221 — interrupt scheduling race fix (force-detach + ReplyOperation hardening + seq+identity-gated retry)

Tests

Telegram scoped tests (sequential-key.test.ts, bot.create-telegram-bot.test.ts, etc.): pass
Full Telegram suite: 1281/1281 pass
Auto-reply scoped (runs.test.ts, get-reply-run-queue.test.ts, reply-run-registry.test.ts, agent-runner.steer-guard.test.ts): 52/52 pass
oxlint: 0 errors (previously failing on unused isStreaming/isInterruptMode)

Known Limitations

chatSessionCache maps chat ID → session key; group chats with multiple senders sharing one chat ID work correctly (cache is per-chat, not per-sender)
The bypass only activates when isRunActiveForSessionKey returns true; if the cache misses (first message in a chat), the fallback is normal per-chat serialization

Part of a 3-PR fix set for interrupt/steer message delivery:

PR A: #65010 — steer gate fix (isStreaming → isActive/isStopped)
PR B: #65221 — interrupt scheduling race fix (core)
PR C (this): #65222 — Telegram delivery bypass
Issue: #65223 — tool abort signal design (future work)
Supersedes Telegram-specific parts of #65021

AI Disclosure

AI-assisted (OpenClaw agent + Codex review)
Fully tested (see Tests section above)
Author understands all changes

Changed files

extensions/telegram/src/bot-deps.ts (modified, +6/-0)
extensions/telegram/src/bot-handlers.runtime.ts (modified, +83/-0)
extensions/telegram/src/bot-native-commands.ts (modified, +2/-0)
extensions/telegram/src/bot.create-telegram-bot.test.ts (modified, +5/-6)
extensions/telegram/src/bot.ts (modified, +53/-3)
extensions/telegram/src/sequential-key.test.ts (modified, +139/-44)
extensions/telegram/src/sequential-key.ts (modified, +49/-1)
src/agents/pi-embedded-runner.ts (modified, +2/-0)
src/agents/pi-embedded-runner/run-active-check.ts (added, +7/-0)
src/agents/pi-embedded-runner/run/attempt.ts (modified, +9/-0)
src/agents/pi-embedded-runner/runs.test.ts (modified, +154/-1)
src/agents/pi-embedded-runner/runs.ts (modified, +98/-7)
src/agents/pi-embedded.runtime.ts (modified, +2/-0)
src/agents/pi-embedded.ts (modified, +2/-0)
src/auto-reply/reply/agent-runner-direct-runtime-config.test.ts (modified, +0/-2)
src/auto-reply/reply/agent-runner.media-paths.test.ts (modified, +0/-1)
src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +0/-19)
src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts (modified, +0/-1)
src/auto-reply/reply/agent-runner.steer-guard.test.ts (added, +235/-0)
src/auto-reply/reply/agent-runner.ts (modified, +74/-8)
src/auto-reply/reply/get-reply-run-queue.test.ts (modified, +133/-0)
src/auto-reply/reply/get-reply-run-queue.ts (modified, +42/-3)
src/auto-reply/reply/get-reply-run.ts (modified, +7/-3)
src/auto-reply/reply/queue/settings.ts (modified, +28/-0)
src/auto-reply/reply/reply-run-registry.test.ts (modified, +415/-0)
src/auto-reply/reply/reply-run-registry.ts (modified, +167/-18)
src/plugin-sdk/core.ts (modified, +1/-0)
src/plugin-sdk/reply-runtime.ts (modified, +3/-2)

RAW_BUFFERClick to expand / collapse

Context

Current Behavior

process poll --timeout 30000: 250ms tick loop continues for full duration after abort
sessions_send: HTTP connection to gateway hangs until timeout (callGateway has no cancel mechanism)
All other tools: blocked until natural return

The old run cannot complete its finally block until the tool returns, keeping the ReplyOperation alive.

Key Findings from PR A/B/C Implementation

Tool results persist in conversation history before abort check

pi-agent-core's executeToolCalls always pushes tool results to currentContext.messages before checking stopReason === "aborted" in the next loop iteration. This means:

New runs can see old tool results (session IDs, run IDs, file paths) in conversation history
Agents can theoretically resume operations by reading history — e.g., re-polling a process session ID from a previous run's tool result
This is an implicit "resume" mechanism that doesn't require explicit framework support

forceDetach no longer notifies global waiters

As of #65221, forceDetachEmbeddedRun no longer calls notifyEmbeddedRunEnded. Concurrent waiters (session-reset, session management) time out naturally rather than receiving a premature idle signal. This means the detached run's continued tool execution doesn't create false-idle signals for other code paths.

Tool-by-tool analysis after interrupt

Tool	Process survives?	New run can resume?	Worth adding signal handling?
`exec` (start)	Yes (bash registry)	Yes, via session ID in history	No — don't kill process
`process poll`	N/A (no process)	N/A	Yes — stop 250ms tick loop
`sessions_send`	Child run continues	Orphaned (no cancel for `callGateway`)	Known limitation
`browser`	State persistent	Yes	No special handling needed
Short-lived tools	Complete before abort	N/A	Not worth handling

Design Questions

1. Per-tool vs framework-level

Per-tool: Each tool manually checks the signal and returns early. Pros: can return meaningful partial results. Cons: every new tool needs explicit handling; easy to forget.

Framework-level (Promise.race wrapper): Wrap every tool.execute() call with a race against the abort signal at the agent-core layer. Pros: automatic coverage for all tools. Cons: tool continues running in background; no partial results; tool result in conversation history becomes "interrupted" rather than actual output.

2. Interrupt vs steer semantics

The current AbortSignal maps to interrupt semantics (kill old run, start new). For steer mode, tools need a different signal — "there is a steering message, yield back to the agent loop so it can be processed" — rather than abort.

Should tools receive a second signal/callback for steer awareness? Or should the framework race against the steering queue?

3. Partial results

When a tool is interrupted mid-execution:

Should the partial result (e.g., last 5 seconds of process poll output) be preserved in conversation history?
If using framework-level race, the real tool result is lost — only "Tool execution interrupted" appears in history
Note: even without explicit partial results, tool results from previous calls in the same run ARE preserved (see "Key Findings" above)

4. Resource cleanup

process poll: Stop tick loop, but keep underlying exec process alive (new run can resume polling)
sessions_send: No way to cancel callGateway HTTP call currently — connection leaks until timeout
exec (process start): Process should never be killed on abort — only the poll

Proposed Approach

Layered:

Framework-level safety net (new): Race tool.execute() against abort signal at the agent-core wrapper level. Guarantees no tool can block the agent loop indefinitely. One change covers all tools.
Per-tool optimization (opt-in): Tools that can meaningfully handle interruption check the signal themselves and return partial results. Framework race becomes a fallback.

This does not depend on PR C (abort-signal-propagation branch) — it can be built directly on top of the merged PR A + PR B codebase.

#65221 — Interrupt scheduling race fix (PR B, documents tool continuation as known limitation)
#65010 — Steer gate fix (PR A)
#65222 — Telegram delivery bypass (PR C, includes debouncer bypass)
#64951 — Inbound debouncer keyChains issue (closed, resolved by #65222)

extent analysis

TL;DR

Implement a framework-level safety net by wrapping every tool.execute() call with a Promise.race against the abort signal to prevent tools from blocking the agent loop indefinitely.

Guidance

Introduce a framework-level wrapper around tool.execute() calls to race against the abort signal, ensuring no tool can block the agent loop indefinitely.
Identify tools that can meaningfully handle interruption and opt-in for per-tool optimization, allowing them to return partial results.
Prioritize tools like process poll and sessions_send for signal handling, as they have significant resource implications when interrupted.
Consider preserving partial results in conversation history for interrupted tools, weighing the benefits against the potential for incomplete or misleading information.

Example

const executeTool = (tool, signal) => {
  return Promise.race([
    tool.execute(),
    new Promise((_, reject) => {
      signal.addEventListener('abort', () => {
        reject(new Error('Tool execution interrupted'));
      });
    }),
  ]);
};

Notes

The proposed approach does not depend on the abort-signal-propagation branch (PR C) and can be built directly on top of the merged PR A + PR B codebase. However, the effectiveness of this solution may vary depending on the specific tool implementations and their ability to handle interruption.

Recommendation

Apply the proposed layered approach, starting with the framework-level safety net, to ensure that no tool can block the agent loop indefinitely. This provides a solid foundation for further per-tool optimizations and improvements.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #conversation history #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Design: tool abort signal handling and framework-level execution interruption [3 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #65010: fix: use isActive+isStopped instead of isStreaming for steer message injection

Description (problem / solution / changelog)

Problem

Root Cause

Fix

Changes

Testing

AI Disclosure

Related

Changed files

PR #65221: fix: resolve interrupt scheduling race between embedded run and ReplyOperation registries

Description (problem / solution / changelog)

Problem

Root cause

Changes

1. Two-phase wait (waitForEmbeddedPiRunEnd)

2. Force-detach fallback (forceDetachEmbeddedRun)

3. Retry with force (agent-runner.ts)

4. Reply-run-registry hardening

Known Limitations

Tests

Related

AI Disclosure

Changed files

PR #65222: fix(telegram): bypass grammY sequential key for interrupt/steer message delivery

Description (problem / solution / changelog)

Problem

Changes

1. Chat-session cache (chatSessionCache)

2. Per-message sequential key

3. Lightweight run-active-check module

4. Live config for queue mode decisions

Dependencies

Tests

Known Limitations

Related

AI Disclosure

Changed files

Context

Current Behavior

Key Findings from PR A/B/C Implementation

Tool results persist in conversation history before abort check

forceDetach no longer notifies global waiters

Tool-by-tool analysis after interrupt

Design Questions

1. Per-tool vs framework-level

2. Interrupt vs steer semantics

3. Partial results

4. Resource cleanup

Proposed Approach

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Two-phase wait (`waitForEmbeddedPiRunEnd`)

2. Force-detach fallback (`forceDetachEmbeddedRun`)

3. Retry with force (`agent-runner.ts`)

1. Chat-session cache (`chatSessionCache`)