hermes - ✅(Solved) Fix [Delegation] Subagent `run_conversation()` has no timeout — can block indefinitely on slow API/network [2 pull requests, 2 comments, 2 participants]

iamagenius00 · 2026-04-22T02:04:54Z

[hermes] run single child in delegate tool.py calls child.run conversation line 507 with no timeout protection. If a subagent's LLM API call or tool-level HTTP… `_run_single_child()` in `delegate_tool.py` calls `child.run_conversation()` (line 507) with no timeout protection. If a subagent's LLM API call or tool-level HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism (lines 468–498) keeps reporting the parent as active, preventing the gateway inactivity watchdog from intervening. # PR #13770: fix(delegation): add hard timeout and stale detection for subagent execution - Repository: NousResearch/hermes-agent - Author: iamagenius00 - State: closed | merged: True - Link: https://github.com/NousResearch/hermes-agent/pull/13770 ## Description (problem / solution / changelog) ## Summary Add a configurable hard timeout and heartbeat stale detection for subagent execution in `delegate_tool.py`, preventing indefinite blocking when a child agent's API call or tool-level HTTP request hangs. Closes #13768 ## Problem `_run_single_child()` calls `child.run_conversation()` with no timeout protection. If the child's LLM API or a tool HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism masks the issue by continuously reporting the parent as active, preventing the gateway inactivity watchdog from intervening. Both single-task and batch modes are equally vulnerable — batch mode's interrupt check loop prepares `interrupted` results but `ThreadPoolExecutor.__exit__` calls `shutdown(wait=True)`, which blocks until stuck threads finish. ## Changes 1. **Hard timeout on `run_conversation()`**: Wraps the call in a `ThreadPoolExecutor` with `future.result(timeout=child_timeout)`. Default 300s (5 min), configurable via `delegation.child_timeout_seconds` in config.yaml or `DELEGATION_CHILD_TIMEOUT_SECONDS` env var. Minimum 30s. 2. **Heartbeat stale detection**: Tracks whether the child's `api_call_count` advances between heartbeat cycles. After 5 consecutive cycles (~2.5 min) with no progress, stops touching the parent's activity timestamp so the gateway timeout can fire as a last resort. 3. **`timeout` exit_reason**: New status/exit_reason value alongside `completed`, `max_iterations`, and `interrupted`, giving the parent agent a clear signal that the result is unreliable. 4. **`shutdown(wait=False)`**: Avoids the `ThreadPoolExecutor.__exit__` deadlock when a child thread is stuck on blocking I/O. ## Config ```yaml delegation: child_timeout_seconds: 300 # default, minimum 30 ``` ## Known Limitation After a timeout, the main thread calls `child.close()` in the `finally` block while the child thread may still be running `run_conversation()` on the same object. This is a race condition — but the practical impact is minimal: the stuck child was going to be cleaned up anyway, and any exceptions from the concurrent close are caught by the existing `except` handler. Worst case is a few extra debug-level log lines. ## Testing - `py_compile` passes - Follows existing patterns for config reading (`_get_max_concurrent_children`, `_get_max_spawn_depth`) - Timeout path returns the same result dict structure as existing error/interrupted paths ## Changed files - `scripts/release.py` (modified, +1/-0) - `tools/delegate_tool.py` (modified, +106/-2) --- # PR #13797: fix(delegate): add timeout to subagent run_conversation() (Closes #13768) - Repository: NousResearch/hermes-agent - Author: ms-alan - State: closed | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/13797 ## Description (problem / solution / changelog) ## Summary child.run_conversation() in _run_single_child was a raw blocking call — if the subagent got stuck on a slow API or network issue, the parent agent would hang indefinitely. **Fix:** Run run_conversation in a dedicated thread with join(timeout=300). On timeout, call child.interrupt() for graceful shutdown, then set a synthetic timeout result. ## Changes - tools/delegate_tool.py: Wrap child.run_conversation() in a thread, enforce 300s timeout, interrupt + return error result on timeout ## Testing Simulate a stuck subagent (e.g. mock API delay > timeout) → parent returns error instead of hanging Closes #13768 ## Changed files - `cli.py` (modified, +2/-1) - `tools/delegate_tool.py` (modified, +38/-1) ## Fix / Workaround - **Indefinite hang**: Subagent dispatched for web research gets stuck on an unresponsive endpoint. Parent waits indefinitely. User sees no error, no progress, no timeout — just silence. - **Cascading unreliability**: When multiple subagents are dispatched in batch and one hangs, the parent must wait for all to finish before returning any results. (The batch loop does handle interrupt, but only if one is explicitly sent.) - **Silent failure**: Because the heartbeat keeps the parent "alive," the gateway's 30-minute inactivity timeout — the last line of defense — never fires. ## Summary `_run_single_child()` in `delegate_tool.py

hermes2026-04-22 02:04:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13768•Fetched 2026-04-22 08:04:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

iamagenius00

Participants

alt-glitch

iamagenius00

Timeline (top)

labeled ×4commented ×2cross-referenced ×2referenced ×2

_run_single_child() in delegate_tool.py calls child.run_conversation() (line 507) with no timeout protection. If a subagent's LLM API call or tool-level HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism (lines 468–498) keeps reporting the parent as active, preventing the gateway inactivity watchdog from intervening.

Error Message

Pseudocode — wrap run_conversation in a future with timeout

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...): ... with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(child.run_conversation, user_message=goal) try: result = future.result(timeout=child_timeout_seconds) except TimeoutError: child.interrupt() # signal child to stop return {"status": "timeout", "exit_reason": "timeout", ...}

Root Cause

1. No hard timeout on run_conversation()

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

This is a synchronous call with no timeout wrapper. If the child agent's API provider is slow or a tool-level HTTP request hangs (e.g., browser/web tools hitting unresponsive endpoints), this line never returns.

2. Heartbeat masks the hang

The heartbeat thread (lines 468–498) touches parent._last_activity_ts every 30 seconds. It reads the child's api_call_count and current_tool for the log description, but does not use this data to detect stalls. Whether or not the child is making progress, the heartbeat keeps telling the gateway everything is fine.

3. Both single-task and batch modes are equally vulnerable

Single-task mode (line 800) calls _run_single_child() directly with no interrupt check. Batch mode (lines 828–868) has an interrupt check loop with wait(timeout=0.5), but this is a false safety net: even after the loop breaks and prepares status: "interrupted" results, ThreadPoolExecutor.__exit__ calls shutdown(wait=True), which blocks until all running threads finish. If a child is stuck on a blocking I/O call (e.g., socket read), the interrupt flag is set but the child never gets a chance to check it — so shutdown never returns, and the prepared results are never delivered.

Fix Action

Fix / Workaround

Indefinite hang: Subagent dispatched for web research gets stuck on an unresponsive endpoint. Parent waits indefinitely. User sees no error, no progress, no timeout — just silence.
Cascading unreliability: When multiple subagents are dispatched in batch and one hangs, the parent must wait for all to finish before returning any results. (The batch loop does handle interrupt, but only if one is explicitly sent.)
Silent failure: Because the heartbeat keeps the parent "alive," the gateway's 30-minute inactivity timeout — the last line of defense — never fires.

PR fix notes

PR #13770: fix(delegation): add hard timeout and stale detection for subagent execution

Repository: NousResearch/hermes-agent
Author: iamagenius00
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/13770

Description (problem / solution / changelog)

Summary

Add a configurable hard timeout and heartbeat stale detection for subagent execution in delegate_tool.py, preventing indefinite blocking when a child agent's API call or tool-level HTTP request hangs.

Closes #13768

Problem

_run_single_child() calls child.run_conversation() with no timeout protection. If the child's LLM API or a tool HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism masks the issue by continuously reporting the parent as active, preventing the gateway inactivity watchdog from intervening.

Both single-task and batch modes are equally vulnerable — batch mode's interrupt check loop prepares interrupted results but ThreadPoolExecutor.__exit__ calls shutdown(wait=True), which blocks until stuck threads finish.

Changes

Hard timeout on run_conversation(): Wraps the call in a ThreadPoolExecutor with future.result(timeout=child_timeout). Default 300s (5 min), configurable via delegation.child_timeout_seconds in config.yaml or DELEGATION_CHILD_TIMEOUT_SECONDS env var. Minimum 30s.
Heartbeat stale detection: Tracks whether the child's api_call_count advances between heartbeat cycles. After 5 consecutive cycles (~2.5 min) with no progress, stops touching the parent's activity timestamp so the gateway timeout can fire as a last resort.
timeout exit_reason: New status/exit_reason value alongside completed, max_iterations, and interrupted, giving the parent agent a clear signal that the result is unreliable.
shutdown(wait=False): Avoids the ThreadPoolExecutor.__exit__ deadlock when a child thread is stuck on blocking I/O.

Config

delegation:
  child_timeout_seconds: 300  # default, minimum 30

Known Limitation

After a timeout, the main thread calls child.close() in the finally block while the child thread may still be running run_conversation() on the same object. This is a race condition — but the practical impact is minimal: the stuck child was going to be cleaned up anyway, and any exceptions from the concurrent close are caught by the existing except handler. Worst case is a few extra debug-level log lines.

Testing

py_compile passes
Follows existing patterns for config reading (_get_max_concurrent_children, _get_max_spawn_depth)
Timeout path returns the same result dict structure as existing error/interrupted paths

Changed files

scripts/release.py (modified, +1/-0)
tools/delegate_tool.py (modified, +106/-2)

PR #13797: fix(delegate): add timeout to subagent run_conversation() (Closes #13768)

Repository: NousResearch/hermes-agent
Author: ms-alan
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13797

Description (problem / solution / changelog)

Summary

child.run_conversation() in _run_single_child was a raw blocking call — if the subagent got stuck on a slow API or network issue, the parent agent would hang indefinitely.

Fix: Run run_conversation in a dedicated thread with join(timeout=300). On timeout, call child.interrupt() for graceful shutdown, then set a synthetic timeout result.

Changes

tools/delegate_tool.py: Wrap child.run_conversation() in a thread, enforce 300s timeout, interrupt + return error result on timeout

Testing

Simulate a stuck subagent (e.g. mock API delay > timeout) → parent returns error instead of hanging

Closes #13768

Changed files

cli.py (modified, +2/-1)
tools/delegate_tool.py (modified, +38/-1)

Code Example

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

---

# Pseudocode — wrap run_conversation in a future with timeout
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

1. No hard timeout on run_conversation()

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

2. Heartbeat masks the hang

3. Both single-task and batch modes are equally vulnerable

Observed Impact

As a daily user of delegate_task for research (web search, GitHub exploration, arxiv), I encounter this roughly once every three delegation calls:

Indefinite hang: Subagent dispatched for web research gets stuck on an unresponsive endpoint. Parent waits indefinitely. User sees no error, no progress, no timeout — just silence.
Cascading unreliability: When multiple subagents are dispatched in batch and one hangs, the parent must wait for all to finish before returning any results. (The batch loop does handle interrupt, but only if one is explicitly sent.)
Silent failure: Because the heartbeat keeps the parent "alive," the gateway's 30-minute inactivity timeout — the last line of defense — never fires.

Suggested Fix

Add a configurable hard timeout to _run_single_child():

# Pseudocode — wrap run_conversation in a future with timeout
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}

Key considerations:

delegation.child_timeout_seconds config option (default: 300s seems reasonable)
Heartbeat stale detection (optional enhancement): if api_call_count hasn't advanced in N heartbeat cycles, mark the child as stale and stop masking the hang
timeout as a new exit_reason value alongside the existing completed / max_iterations / interrupted

Environment

Hermes Agent (gateway mode, Telegram)
Model: claude-opus-4-6 via Anthropic
delegation config: default (max_concurrent_children=3, max_iterations=50)

extent analysis

TL;DR

Add a configurable hard timeout to _run_single_child() to prevent indefinite hangs when a subagent's API call or tool-level HTTP request hangs.

Guidance

Implement a timeout wrapper around child.run_conversation() using concurrent.futures to prevent blocking indefinitely.
Introduce a delegation.child_timeout_seconds config option to control the timeout duration.
Consider enhancing the heartbeat mechanism to detect stale children by monitoring api_call_count advancements.
Update the exit_reason values to include timeout as a new option.

Example

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}

Notes

The suggested fix focuses on introducing a timeout mechanism to prevent indefinite hangs. However, the heartbeat stale detection enhancement is optional and may require additional development.

Recommendation

Apply the workaround by adding a configurable hard timeout to _run_single_child() to prevent indefinite hangs and improve the overall reliability of the delegation mechanism. This change will allow the system to detect and recover from stuck subagents, reducing the likelihood of cascading unreliability and silent failures.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Delegation] Subagent `run_conversation()` has no timeout — can block indefinitely on slow API/network [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Pseudocode — wrap run_conversation in a future with timeout

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #13770: fix(delegation): add hard timeout and stale detection for subagent execution

Description (problem / solution / changelog)

Summary

Problem

Changes

Config

Known Limitation

Testing

Changed files

PR #13797: fix(delegate): add timeout to subagent run_conversation() (Closes #13768)

Description (problem / solution / changelog)

Summary

Changes

Testing

Changed files

Code Example

Summary

Root Cause

Observed Impact

Suggested Fix

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING