hermes - 💡(How to fix) Fix feat: cooperative cancellation token for concurrent tool execution [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28809Fetched 2026-05-20 04:01:46
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×4

Root Cause

f.cancel() on a running Future returns False and has no effect:

# tool_executor.py L314–L323 (interrupt path)
if agent._interrupt_requested:
    for f in not_done:
        f.cancel()  # ← only cancels PENDING futures, not RUNNING ones
    concurrent.futures.wait(not_done, timeout=3.0)
    break  # ← breaks the wait loop, but executor.shutdown(wait=True) in __exit__ still blocks

The infrastructure for cooperative cancellation already exists:

  • tools/interrupt.pyis_interrupted() checks per-thread interrupt state
  • tools/environments/base.py_wait_for_process() polls is_interrupted() in its main loop
  • tools/code_execution_tool.py L1325 → checks is_interrupted() mid-execution

But only 3 tool files actually check is_interrupted(). Most tools (20+) have zero interrupt awareness:

ToolHas interrupt check?Long-running?
terminal (environments/base.py)✅ YesYes
execute_code (code_execution_tool.py)✅ YesYes
modal (modal_utils.py)✅ YesYes
mcp_tool⚠️ Partial (asyncio cancel)Yes
web_search❌ NoCan be (rate limits)
web_extract❌ NoCan be (large pages)
read_file❌ NoRarely
write_file❌ NoNo
patch❌ NoNo
browser_*❌ NoYes (page loads)
image_gen❌ NoYes
tts❌ NoYes
... 15+ more❌ NoVaries

Fix Action

Fix / Workaround

This means:

  • /steer mid-execution → steer breakout is effectively cosmetic for concurrent paths; all tools run to completion before the model sees the guidance (addressed partially in #28808)
  • Ctrl+C / stop command → interrupt sets _interrupt_requested and fans out per-thread interrupt signals, but tools without is_interrupted() checks (web_search, read_file, write_file, patch, etc.) run to completion, blocking the conversation loop
ToolHas interrupt check?Long-running?
terminal (environments/base.py)✅ YesYes
execute_code (code_execution_tool.py)✅ YesYes
modal (modal_utils.py)✅ YesYes
mcp_tool⚠️ Partial (asyncio cancel)Yes
web_search❌ NoCan be (rate limits)
web_extract❌ NoCan be (large pages)
read_file❌ NoRarely
write_file❌ NoNo
patch❌ NoNo
browser_*❌ NoYes (page loads)
image_gen❌ NoYes
tts❌ NoYes
... 15+ more❌ NoVaries
  • New file: agent/cancellation_token.py (~60 lines)
  • Modified: tool_executor.py (token creation + propagation, ~30 lines)
  • Modified: 5-8 high-impact tools (web_search, web_extract, browser_*, image_gen, tts) — add token.raise_if_cancelled() at 2-3 yield points each (~5 lines per tool)
  • No changes: Short/fast tools (read_file, write_file, patch) — they complete in <100ms, no point cancelling
  • Tests: ~20 new tests (token lifecycle, cooperative cancellation, edge cases)
  • Zero breaking changes: Token is optional; tools that don't check it behave exactly as before

Code Example

# tool_executor.py L314L323 (interrupt path)
if agent._interrupt_requested:
    for f in not_done:
        f.cancel()  # ← only cancels PENDING futures, not RUNNING ones
    concurrent.futures.wait(not_done, timeout=3.0)
    break  # ← breaks the wait loop, but executor.shutdown(wait=True) in __exit__ still blocks

---

# agent/cancellation_token.py
class CancellationToken:
    """Cooperative cancellation for concurrent tool execution."""
    def __init__(self):
        self._cancelled = threading.Event()
        self._reason: str | None = None  # "interrupt" | "steer"
    
    def cancel(self, reason: str = "interrupt"):
        self._reason = reason
        self._cancelled.set()
    
    @property
    def is_cancelled(self) -> bool:
        return self._cancelled.is_set()
    
    @property
    def reason(self) -> str | None:
        return self._reason
    
    def raise_if_cancelled(self):
        if self._cancelled.is_set():
            raise ToolCancelledError(self._reason or "unknown")

---

# In _run_tool():
_thread_local.token = token
try:
    result = agent._invoke_tool(...)
finally:
    _thread_local.token = None

---

# In tools that do I/O loops:
from tools.interrupt import is_interrupted
# ... or via cancellation token:
from agent.cancellation_token import get_token
def some_tool(...):
    token = get_token()
    for chunk in streaming_response:
        if token and token.is_cancelled:
            return f"[Cancelled: {token.reason}]"
        ...

---

# In execute_tool_calls_concurrent():
token = CancellationToken()
for i, (tc, name, args, ...) in enumerate(parsed_calls):
    f = executor.submit(ctx.run, _run_tool, i, tc, name, args, token=token)

# In wait loop:
if agent._interrupt_requested:
    token.cancel(reason="interrupt")
if getattr(agent, "_pending_steer", None) is not None:
    token.cancel(reason="steer")
RAW_BUFFERClick to expand / collapse

Problem

When tools execute concurrently via ThreadPoolExecutor, user-initiated interrupt and /steer signals have limited effect on already-running tool workers. The concurrent execution wait loop (tool_executor.py L288–L380) can only f.cancel() futures that haven't been picked up by a worker yet. Since max_workers = min(32, num_tools+4), all tools typically start almost immediately, making f.cancel() a no-op for running futures.

This means:

  • /steer mid-execution → steer breakout is effectively cosmetic for concurrent paths; all tools run to completion before the model sees the guidance (addressed partially in #28808)
  • Ctrl+C / stop command → interrupt sets _interrupt_requested and fans out per-thread interrupt signals, but tools without is_interrupted() checks (web_search, read_file, write_file, patch, etc.) run to completion, blocking the conversation loop

Root Cause

f.cancel() on a running Future returns False and has no effect:

# tool_executor.py L314–L323 (interrupt path)
if agent._interrupt_requested:
    for f in not_done:
        f.cancel()  # ← only cancels PENDING futures, not RUNNING ones
    concurrent.futures.wait(not_done, timeout=3.0)
    break  # ← breaks the wait loop, but executor.shutdown(wait=True) in __exit__ still blocks

The infrastructure for cooperative cancellation already exists:

  • tools/interrupt.pyis_interrupted() checks per-thread interrupt state
  • tools/environments/base.py_wait_for_process() polls is_interrupted() in its main loop
  • tools/code_execution_tool.py L1325 → checks is_interrupted() mid-execution

But only 3 tool files actually check is_interrupted(). Most tools (20+) have zero interrupt awareness:

ToolHas interrupt check?Long-running?
terminal (environments/base.py)✅ YesYes
execute_code (code_execution_tool.py)✅ YesYes
modal (modal_utils.py)✅ YesYes
mcp_tool⚠️ Partial (asyncio cancel)Yes
web_search❌ NoCan be (rate limits)
web_extract❌ NoCan be (large pages)
read_file❌ NoRarely
write_file❌ NoNo
patch❌ NoNo
browser_*❌ NoYes (page loads)
image_gen❌ NoYes
tts❌ NoYes
... 15+ more❌ NoVaries

Proposed Solution: Cooperative Cancellation Token

Add a CancellationToken that propagates through the concurrent tool execution pipeline:

1. Token Object

# agent/cancellation_token.py
class CancellationToken:
    """Cooperative cancellation for concurrent tool execution."""
    def __init__(self):
        self._cancelled = threading.Event()
        self._reason: str | None = None  # "interrupt" | "steer"
    
    def cancel(self, reason: str = "interrupt"):
        self._reason = reason
        self._cancelled.set()
    
    @property
    def is_cancelled(self) -> bool:
        return self._cancelled.is_set()
    
    @property
    def reason(self) -> str | None:
        return self._reason
    
    def raise_if_cancelled(self):
        if self._cancelled.is_set():
            raise ToolCancelledError(self._reason or "unknown")

2. Thread-Local Injection

# In _run_tool():
_thread_local.token = token
try:
    result = agent._invoke_tool(...)
finally:
    _thread_local.token = None

3. Tool Integration Points

Tools can check cancellation at natural yield points:

# In tools that do I/O loops:
from tools.interrupt import is_interrupted
# ... or via cancellation token:
from agent.cancellation_token import get_token
def some_tool(...):
    token = get_token()
    for chunk in streaming_response:
        if token and token.is_cancelled:
            return f"[Cancelled: {token.reason}]"
        ...

4. Executor Integration

# In execute_tool_calls_concurrent():
token = CancellationToken()
for i, (tc, name, args, ...) in enumerate(parsed_calls):
    f = executor.submit(ctx.run, _run_tool, i, tc, name, args, token=token)

# In wait loop:
if agent._interrupt_requested:
    token.cancel(reason="interrupt")
if getattr(agent, "_pending_steer", None) is not None:
    token.cancel(reason="steer")

Impact Analysis

Code PathCurrent BehaviorAfter Cancellation Token
Interrupt during concurrent terminal()Terminal exits early via is_interrupted()Same (already works)
Interrupt during concurrent web_search()Runs to completionEarly return with partial results
Steer during concurrent browser_navigateRuns to completionEarly return, steer delivered immediately
Steer during concurrent execute_codeCode exits early via is_interrupted()Same (already works)

Implementation Scope

  • New file: agent/cancellation_token.py (~60 lines)
  • Modified: tool_executor.py (token creation + propagation, ~30 lines)
  • Modified: 5-8 high-impact tools (web_search, web_extract, browser_*, image_gen, tts) — add token.raise_if_cancelled() at 2-3 yield points each (~5 lines per tool)
  • No changes: Short/fast tools (read_file, write_file, patch) — they complete in <100ms, no point cancelling
  • Tests: ~20 new tests (token lifecycle, cooperative cancellation, edge cases)
  • Zero breaking changes: Token is optional; tools that don't check it behave exactly as before

Related Issues

  • #28172 — Steer breakout for tool execution (PR #28808, partial fix)
  • The concurrent interrupt path has the same limitation — f.cancel() cannot stop running futures

Evidence

PR #28808 added steer breakout to the concurrent wait loop. Testing revealed that f.cancel() returns False for all 3 running futures because ThreadPoolExecutor starts them immediately. The steer was only drained post-completion, not mid-execution. This confirms that cooperative cancellation is the only reliable mechanism for concurrent tool interruption.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix feat: cooperative cancellation token for concurrent tool execution [1 participants]