hermes - 💡(How to fix) Fix [Bug]: ThreadPoolExecutor leaks _interrupted_threads state when _run

Error Message

However, this cleanup is not wrapped in a finally block. If _invoke_tool or any of the logging/result-assignment steps raises an unhandled exception (e.g. CancelledError or KeyboardInterrupt), the thread exits without clearing its TID from the global _interrupted_threads set. Because concurrent.futures.ThreadPoolExecutor recycles thread IDs, the next time a completely unrelated tool is scheduled onto that same recycled thread, is_interrupted() instantly returns True. This causes tools like browser_snapshot or send_message to instantly abort with {"success": false, "error": "Interrupted"} in 0.00s. The patched tool will crash the worker thread. Because _run_tool only catches Exception and lacks a finally block for cleanup, the thread exits abruptly and its Thread ID is permanently leaked into the global _interrupted_threads set. Observation: The tool instantly aborts in 0.00s with {"success": false, "error": "Interrupted"}, completely blocking the agent from using that tool. xpected Behaviour: Even if a concurrent tool worker thread crashes, raises an unhandled exception (like BaseException), or is aggressively terminated, its thread-local interrupt state must be reliably cleaned up. ecause the thread cleanup code (_set_interrupt(False, _worker_tid)) is placed at the end of the _run_tool function without being wrapped in a finally block, it gets entirely bypassed if the thread crashes via an unhandled exception (like a BaseException, strict timeout, or hard termination). As a result, the crashed thread's ID remains permanently stuck in the global _interrupted_threads set. Later, when the ThreadPoolExecutor reuses that exact same OS thread ID to execute a new, completely unrelated tool, the new tool checks the global list and sees a stale interrupt flag. This causes tools (like browser_snapshot or send_message) to immediately abort in 0.00s and fail with: {"success": false, "error": "Interrupted"}

Additional Logs / Traceback (optional)

Root Cause

Because concurrent.futures.ThreadPoolExecutor recycles thread IDs, the next time a completely unrelated tool is scheduled onto that same recycled thread, is_interrupted() instantly returns True. This causes tools like browser_snapshot or send_message to instantly abort with {"success": false, "error": "Interrupted"} in 0.00s.

Fix Action

Fix / Workaround

Start a Hermes agent session. Modify any tool (e.g., send_message or a dummy tool) to artificially raise a BaseException during its execution (e.g., raise BaseException("Simulated thread crash")). This mimics a severe timeout, a KeyboardInterrupt propagating to the worker, or an unhandled system exit. Prompt the agent to run multiple tools concurrently so it uses execute_tool_calls_concurrent. The patched tool will crash the worker thread. Because _run_tool only catches Exception and lacks a finally block for cleanup, the thread exits abruptly and its Thread ID is permanently leaked into the global _interrupted_threads set. In the same session, prompt the agent to use a tool that checks is_interrupted() (such as browser_snapshot, send_message, or web_search). Since ThreadPoolExecutor recycles thread IDs, the OS will eventually assign the new tool execution to the same "poisoned" thread ID. Observation: The tool instantly aborts in 0.00s with {"success": false, "error": "Interrupted"}, completely blocking the agent from using that tool.

Code Example

⚠️  This will upload the following to a public paste service:
  • System info (OS, Python version, Hermes version, provider, which API keys
    are configured — NOT the actual keys)
  • Recent log lines (agent.log, errors.log, gateway.log — may contain
    conversation fragments and file paths)
  • Full agent.log and gateway.log (up to 512 KB each — likely contains
    conversation content, tool outputs, and file paths)

Pastes auto-delete after 6 hours.

Collecting debug report...
Uploading...

Debug report uploaded:
  Report       https://paste.rs/rcrsx
  agent.log    https://paste.rs/kSPOw
  gateway.log  https://paste.rs/fwtnL

⏱  Pastes will auto-delete in 6 hours.
To delete now:  hermes debug delete <url>

---

Bug Description

In agent/tool_executor.py, the _run_tool worker function for execute_tool_calls_concurrent handles its own thread-local interrupt state cleanup (calling _set_interrupt(False, _worker_tid)) at the end of the function.

The Fix: Wrap the execution logic in _run_tool inside a try...finally block, moving the _tool_worker_threads.discard and _set_interrupt(False) calls into the finally block so they are guaranteed to run regardless of how the thread exits.

Steps to Reproduce

In real-world usage, this occurs when an asynchronous network timeout, a KeyboardInterrupt (SIGINT) from the terminal, or a gateway-level cancellation aggressively terminates a concurrent tool worker thread before it can naturally reach the cleanup lines at the bottom of

Expected Behavior

xpected Behaviour: Even if a concurrent tool worker thread crashes, raises an unhandled exception (like BaseException), or is aggressively terminated, its thread-local interrupt state must be reliably cleaned up.

When the ThreadPoolExecutor later recycles that Thread ID for a completely new task, the new tool should start with a clean slate and execute normally. It should never inherit a stale, leftover "Interrupted" state from a previous task that causes it to instantly fail in 0.00s.

Actual Behavior

ecause the thread cleanup code (_set_interrupt(False, _worker_tid)) is placed at the end of the _run_tool function without being wrapped in a finally block, it gets entirely bypassed if the thread crashes via an unhandled exception (like a BaseException, strict timeout, or hard termination).

As a result, the crashed thread's ID remains permanently stuck in the global _interrupted_threads set. Later, when the ThreadPoolExecutor reuses that exact same OS thread ID to execute a new, completely unrelated tool, the new tool checks the global list and sees a stale interrupt flag. This causes tools (like browser_snapshot or send_message) to immediately abort in 0.00s and fail with: {"success": false, "error": "Interrupted"}

Affected Component

Tools (terminal, file ops, web, code execution, etc.)

Messaging Platform (if gateway-related)

No response

Debug Report

⚠️  This will upload the following to a public paste service:
  • System info (OS, Python version, Hermes version, provider, which API keys
    are configured — NOT the actual keys)
  • Recent log lines (agent.log, errors.log, gateway.log — may contain
    conversation fragments and file paths)
  • Full agent.log and gateway.log (up to 512 KB each — likely contains
    conversation content, tool outputs, and file paths)

Pastes auto-delete after 6 hours.

Collecting debug report...
Uploading...

Debug report uploaded:
  Report       https://paste.rs/rcrsx
  agent.log    https://paste.rs/kSPOw
  gateway.log  https://paste.rs/fwtnL

⏱  Pastes will auto-delete in 6 hours.
To delete now:  hermes debug delete <url>

Operating System

Windows

Python Version

3.11

Hermes Version

0.15

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

I'd like to fix this myself and submit a PR

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: ThreadPoolExecutor leaks _interrupted_threads state when _run_tool raises an unhandled exception

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

TRENDING