hermes - ✅(Solved) Fix [Delegation] Subagent `run_conversation()` has no timeout — can block indefinitely on slow API/network [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13768Fetched 2026-04-22 08:04:15
View on GitHub
Comments
2
Participants
2
Timeline
11
Reactions
0
Timeline (top)
labeled ×4commented ×2cross-referenced ×2referenced ×2

_run_single_child() in delegate_tool.py calls child.run_conversation() (line 507) with no timeout protection. If a subagent's LLM API call or tool-level HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism (lines 468–498) keeps reporting the parent as active, preventing the gateway inactivity watchdog from intervening.

Error Message

Pseudocode — wrap run_conversation in a future with timeout

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...): ... with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(child.run_conversation, user_message=goal) try: result = future.result(timeout=child_timeout_seconds) except TimeoutError: child.interrupt() # signal child to stop return {"status": "timeout", "exit_reason": "timeout", ...}

Root Cause

1. No hard timeout on run_conversation()

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

This is a synchronous call with no timeout wrapper. If the child agent's API provider is slow or a tool-level HTTP request hangs (e.g., browser/web tools hitting unresponsive endpoints), this line never returns.

2. Heartbeat masks the hang

The heartbeat thread (lines 468–498) touches parent._last_activity_ts every 30 seconds. It reads the child's api_call_count and current_tool for the log description, but does not use this data to detect stalls. Whether or not the child is making progress, the heartbeat keeps telling the gateway everything is fine.

3. Both single-task and batch modes are equally vulnerable

Single-task mode (line 800) calls _run_single_child() directly with no interrupt check. Batch mode (lines 828–868) has an interrupt check loop with wait(timeout=0.5), but this is a false safety net: even after the loop breaks and prepares status: "interrupted" results, ThreadPoolExecutor.__exit__ calls shutdown(wait=True), which blocks until all running threads finish. If a child is stuck on a blocking I/O call (e.g., socket read), the interrupt flag is set but the child never gets a chance to check it — so shutdown never returns, and the prepared results are never delivered.

Fix Action

Fix / Workaround

  • Indefinite hang: Subagent dispatched for web research gets stuck on an unresponsive endpoint. Parent waits indefinitely. User sees no error, no progress, no timeout — just silence.
  • Cascading unreliability: When multiple subagents are dispatched in batch and one hangs, the parent must wait for all to finish before returning any results. (The batch loop does handle interrupt, but only if one is explicitly sent.)
  • Silent failure: Because the heartbeat keeps the parent "alive," the gateway's 30-minute inactivity timeout — the last line of defense — never fires.

PR fix notes

PR #13770: fix(delegation): add hard timeout and stale detection for subagent execution

Description (problem / solution / changelog)

Summary

Add a configurable hard timeout and heartbeat stale detection for subagent execution in delegate_tool.py, preventing indefinite blocking when a child agent's API call or tool-level HTTP request hangs.

Closes #13768

Problem

_run_single_child() calls child.run_conversation() with no timeout protection. If the child's LLM API or a tool HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism masks the issue by continuously reporting the parent as active, preventing the gateway inactivity watchdog from intervening.

Both single-task and batch modes are equally vulnerable — batch mode's interrupt check loop prepares interrupted results but ThreadPoolExecutor.__exit__ calls shutdown(wait=True), which blocks until stuck threads finish.

Changes

  1. Hard timeout on run_conversation(): Wraps the call in a ThreadPoolExecutor with future.result(timeout=child_timeout). Default 300s (5 min), configurable via delegation.child_timeout_seconds in config.yaml or DELEGATION_CHILD_TIMEOUT_SECONDS env var. Minimum 30s.

  2. Heartbeat stale detection: Tracks whether the child's api_call_count advances between heartbeat cycles. After 5 consecutive cycles (~2.5 min) with no progress, stops touching the parent's activity timestamp so the gateway timeout can fire as a last resort.

  3. timeout exit_reason: New status/exit_reason value alongside completed, max_iterations, and interrupted, giving the parent agent a clear signal that the result is unreliable.

  4. shutdown(wait=False): Avoids the ThreadPoolExecutor.__exit__ deadlock when a child thread is stuck on blocking I/O.

Config

delegation:
  child_timeout_seconds: 300  # default, minimum 30

Known Limitation

After a timeout, the main thread calls child.close() in the finally block while the child thread may still be running run_conversation() on the same object. This is a race condition — but the practical impact is minimal: the stuck child was going to be cleaned up anyway, and any exceptions from the concurrent close are caught by the existing except handler. Worst case is a few extra debug-level log lines.

Testing

  • py_compile passes
  • Follows existing patterns for config reading (_get_max_concurrent_children, _get_max_spawn_depth)
  • Timeout path returns the same result dict structure as existing error/interrupted paths

Changed files

  • scripts/release.py (modified, +1/-0)
  • tools/delegate_tool.py (modified, +106/-2)

PR #13797: fix(delegate): add timeout to subagent run_conversation() (Closes #13768)

Description (problem / solution / changelog)

Summary

child.run_conversation() in _run_single_child was a raw blocking call — if the subagent got stuck on a slow API or network issue, the parent agent would hang indefinitely.

Fix: Run run_conversation in a dedicated thread with join(timeout=300). On timeout, call child.interrupt() for graceful shutdown, then set a synthetic timeout result.

Changes

  • tools/delegate_tool.py: Wrap child.run_conversation() in a thread, enforce 300s timeout, interrupt + return error result on timeout

Testing

Simulate a stuck subagent (e.g. mock API delay > timeout) → parent returns error instead of hanging

Closes #13768

Changed files

  • cli.py (modified, +2/-1)
  • tools/delegate_tool.py (modified, +38/-1)

Code Example

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

---

# Pseudocode — wrap run_conversation in a future with timeout
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}
RAW_BUFFERClick to expand / collapse

Summary

_run_single_child() in delegate_tool.py calls child.run_conversation() (line 507) with no timeout protection. If a subagent's LLM API call or tool-level HTTP request hangs, the entire delegation blocks indefinitely. The heartbeat mechanism (lines 468–498) keeps reporting the parent as active, preventing the gateway inactivity watchdog from intervening.

Root Cause

1. No hard timeout on run_conversation()

# delegate_tool.py line 507
result = child.run_conversation(user_message=goal)

This is a synchronous call with no timeout wrapper. If the child agent's API provider is slow or a tool-level HTTP request hangs (e.g., browser/web tools hitting unresponsive endpoints), this line never returns.

2. Heartbeat masks the hang

The heartbeat thread (lines 468–498) touches parent._last_activity_ts every 30 seconds. It reads the child's api_call_count and current_tool for the log description, but does not use this data to detect stalls. Whether or not the child is making progress, the heartbeat keeps telling the gateway everything is fine.

3. Both single-task and batch modes are equally vulnerable

Single-task mode (line 800) calls _run_single_child() directly with no interrupt check. Batch mode (lines 828–868) has an interrupt check loop with wait(timeout=0.5), but this is a false safety net: even after the loop breaks and prepares status: "interrupted" results, ThreadPoolExecutor.__exit__ calls shutdown(wait=True), which blocks until all running threads finish. If a child is stuck on a blocking I/O call (e.g., socket read), the interrupt flag is set but the child never gets a chance to check it — so shutdown never returns, and the prepared results are never delivered.

Observed Impact

As a daily user of delegate_task for research (web search, GitHub exploration, arxiv), I encounter this roughly once every three delegation calls:

  • Indefinite hang: Subagent dispatched for web research gets stuck on an unresponsive endpoint. Parent waits indefinitely. User sees no error, no progress, no timeout — just silence.
  • Cascading unreliability: When multiple subagents are dispatched in batch and one hangs, the parent must wait for all to finish before returning any results. (The batch loop does handle interrupt, but only if one is explicitly sent.)
  • Silent failure: Because the heartbeat keeps the parent "alive," the gateway's 30-minute inactivity timeout — the last line of defense — never fires.

Suggested Fix

Add a configurable hard timeout to _run_single_child():

# Pseudocode — wrap run_conversation in a future with timeout
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}

Key considerations:

  • delegation.child_timeout_seconds config option (default: 300s seems reasonable)
  • Heartbeat stale detection (optional enhancement): if api_call_count hasn't advanced in N heartbeat cycles, mark the child as stale and stop masking the hang
  • timeout as a new exit_reason value alongside the existing completed / max_iterations / interrupted

Environment

  • Hermes Agent (gateway mode, Telegram)
  • Model: claude-opus-4-6 via Anthropic
  • delegation config: default (max_concurrent_children=3, max_iterations=50)

extent analysis

TL;DR

Add a configurable hard timeout to _run_single_child() to prevent indefinite hangs when a subagent's API call or tool-level HTTP request hangs.

Guidance

  • Implement a timeout wrapper around child.run_conversation() using concurrent.futures to prevent blocking indefinitely.
  • Introduce a delegation.child_timeout_seconds config option to control the timeout duration.
  • Consider enhancing the heartbeat mechanism to detect stale children by monitoring api_call_count advancements.
  • Update the exit_reason values to include timeout as a new option.

Example

from concurrent.futures import ThreadPoolExecutor, TimeoutError

def _run_single_child(...):
    ...
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(child.run_conversation, user_message=goal)
        try:
            result = future.result(timeout=child_timeout_seconds)
        except TimeoutError:
            child.interrupt()  # signal child to stop
            return {"status": "timeout", "exit_reason": "timeout", ...}

Notes

The suggested fix focuses on introducing a timeout mechanism to prevent indefinite hangs. However, the heartbeat stale detection enhancement is optional and may require additional development.

Recommendation

Apply the workaround by adding a configurable hard timeout to _run_single_child() to prevent indefinite hangs and improve the overall reliability of the delegation mechanism. This change will allow the system to detect and recover from stuck subagents, reducing the likelihood of cascading unreliability and silent failures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING