hermes - ✅(Solved) Fix [Bug]: N-API-call subagent timeout lacks tool_trace diagnostics — cannot identify last stuck tool [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17308Fetched 2026-04-30 06:48:32
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×2

Error Message

When a subagent running via delegate_task times out after making N>0 API calls, the lead agent receives only a vague error message with no diagnostic information about which tool was last executing. This makes it impossible to distinguish between:

  1. Extract current_tool from _summary for the error message

Fix Action

Fixed

PR fix notes

PR #17329: fix(delegate): surface tool_trace on N-API-call subagent timeouts (#17308)

Description (problem / solution / changelog)

Closes #17308.

Problem

When a subagent under delegate_task times out after making >0 API calls, the lead agent gets a vague string and nothing else:

Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request.

There's no way to tell apart the two failure modes:

  1. Tool finished, next LLM request hung — the tool itself is fine; the provider froze.
  2. Tool itself hung — network partition, blocked I/O, etc.

This was the gap between the two existing diagnostic paths:

PathCoverage
Normal completion (#1175)tool_trace in return dict
0-API-call timeout (#15105)diagnostic_path with structured log
N-API-call timeoutNone ← this PR

Fix

Three pieces:

1. Extract a shared trace builder

The normal-completion branch already reconstructs tool_trace from result['messages']. Pulled that loop out into a module-level _build_tool_trace_from_messages() helper so both branches use one implementation.

2. Reconstruct trace on the N-API-call timeout branch

In _run_single_child's timeout branch (when is_timeout and child_api_calls > 0):

  • Read child._session_messages and run it through the helper.
  • If the trace tail has no matching tool-role response → mark status='in_progress' (the tool itself is hung).
  • Read get_activity_summary().current_tool. If it disagrees with the trace tail, prefer it — the tool-role write can lag because the agent writes the assistant message first and the tool response only after the tool returns.

3. Surface the diagnostics

Return dict now carries tool_trace, last_tool, last_tool_status, current_tool. Error message gets a last_tool=X (status=Y) suffix so it shows up in logs and the lead's prompt:

Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request. last_tool=terminal (status=in_progress)

0-API-call timeouts (diagnostic_path branch) and non-timeout errors leave the new fields empty/None so consumers don't read stale data.

Tests

Added two test classes in tests/tools/test_delegate_subagent_timeout_diagnostic.py:

TestRunSingleChildTimeoutToolTrace — end-to-end through _run_single_child with a tiny timeout:

  • test_timeout_after_completed_tool_marks_status_ok — tool returned cleanly → status=ok, current_tool=None
  • test_timeout_inside_running_tool_marks_status_in_progress — tool never returned → status=in_progress, current_tool set
  • test_timeout_with_tool_error_preserves_error_status — error responses keep status=error
  • test_timeout_with_parallel_tool_calls_pairs_by_id — out-of-order replies still pair correctly
  • test_zero_api_call_timeout_skips_tool_trace — 0-API branch keeps the new fields empty (no stale data alongside diagnostic_path)
  • test_timeout_with_no_session_messages_attr_does_not_crash — degrades to empty trace if _session_messages is absent

TestBuildToolTraceFromMessages — direct unit tests for the extracted helper (non-list input, non-dict entries, assistants without tool_calls, tool responses without tool_call_id).

$ python -m pytest tests/tools/test_delegate_subagent_timeout_diagnostic.py -q
.................                                                       [100%]
17 passed in 3.88s

Combined with the existing test_delegate.py suite: 137/137 pass.

Changed files

  • tests/tools/test_delegate_subagent_timeout_diagnostic.py (modified, +254/-0)
  • tools/delegate_tool.py (modified, +113/-32)

PR #17340: fix(auxiliary): preserve raw_base_url for Anthropic SDK wrapping in resolve_provider_client

Description (problem / solution / changelog)

Problem

After #17308 (f3371c39) threaded main_runtime through to auto_title_session, title generation started returning HTTP 404 for users on minimax-cn (and any other provider whose inference_base_url ends with /anthropic).

Root cause chain:

  1. resolve_api_key_provider_credentials("minimax-cn") returns base_url = "https://api.minimaxi.com/anthropic"
  2. _to_openai_base_url() converts it to "https://api.minimaxi.com/v1" (correct for the OpenAI SDK client)
  3. _wrap_if_needed is called with this already-converted /v1 URL
  4. Because main_runtime now carries api_mode="anthropic_messages", _maybe_wrap_anthropic decides to wrap → calls build_anthropic_client(key, "https://api.minimaxi.com/v1")
  5. The Anthropic SDK appends /v1/messages → actual request hits https://api.minimaxi.com/v1/messages404

The correct endpoint is https://api.minimaxi.com/anthropic/v1/messages.

Fix

Preserve raw_base_url (pre-_to_openai_base_url conversion) and pass it to _wrap_if_needed, so _maybe_wrap_anthropic receives the original /anthropic-suffixed URL for both:

  • endpoint detection (_endpoint_speaks_anthropic_messages)
  • Anthropic SDK client construction (build_anthropic_client)

The OpenAI client continues to use the /v1-converted base_url.

Affected providers

All API-key providers whose inference_base_url ends with /anthropic when a main_runtime with api_mode=anthropic_messages is available: minimax, minimax-cn, and any compatible third-party gateway.

Changed files

  • agent/auxiliary_client.py (modified, +6/-4)
RAW_BUFFERClick to expand / collapse

Bug Description

When a subagent running via delegate_task times out after making N>0 API calls, the lead agent receives only a vague error message with no diagnostic information about which tool was last executing. This makes it impossible to distinguish between:

  1. 'Tool completed but next LLM request stuck' — the tool itself finished but the provider's response handling froze
  2. 'Tool itself hung' — the tool call never returned (network partition, blocked I/O, etc.)

Impact

  • Lead agents cannot triage delegation failures programmatically
  • Users see: 'Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request.' — no actionable detail
  • The 0-API-call timeout case (#15105) and normal completion case (#1175) both have structured diagnostics; only the N-API-call timeout gap was left without observability

Existing Coverage

ScenarioIssue/PRDiagnostics
Normal completion#1175tool_trace in return dict
0-API-call timeout#15105diagnostic_path with structured log
N-API-call timeoutThis issueNone ← gap

Proposed Fix

Add the same tool_trace / last_tool / last_tool_status instrumentation to the N-API-call timeout path in _run_single_child(), mirroring what #1175 added for the normal completion path:

  1. Extract current_tool from _summary for the error message
  2. Reconstruct tool_trace from result['messages'] (assistant tool_calls + tool role responses)
  3. Return tool_trace, last_tool, last_tool_status in the timeout response dict

三段式覆盖全景

路径触发条件诊断信息
#1175 正常完成completed=Truetool_trace
#15105 0-API超时is_timeout and api_calls == 0diagnostic_path
This issue N-API超时is_timeout and api_calls > 0tool_trace + last_tool + last_tool_status

extent analysis

TL;DR

Add tool_trace, last_tool, and last_tool_status to the N-API-call timeout path in _run_single_child() to provide diagnostic information.

Guidance

  • Extract current_tool from _summary for the error message to identify the last tool executing.
  • Reconstruct tool_trace from result['messages'] to provide a record of tool calls and responses.
  • Return tool_trace, last_tool, and last_tool_status in the timeout response dict to enable programmatic triage of delegation failures.
  • Verify the fix by checking the response dict for the added diagnostic information in the N-API-call timeout scenario.

Example

def _run_single_child():
    # ...
    if is_timeout and api_calls > 0:
        current_tool = _summary['current_tool']
        tool_trace = reconstruct_tool_trace(result['messages'])
        return {
            'tool_trace': tool_trace,
            'last_tool': current_tool,
            'last_tool_status': 'timeout'
        }

Notes

This fix assumes that the reconstruct_tool_trace function can accurately rebuild the tool_trace from the result['messages']. Additional error handling may be necessary to ensure the fix is robust.

Recommendation

Apply the proposed fix to add diagnostic information to the N-API-call timeout path, enabling lead agents to triage delegation failures programmatically.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: N-API-call subagent timeout lacks tool_trace diagnostics — cannot identify last stuck tool [2 pull requests, 1 participants]