hermes - ✅(Solved) Fix [Bug]: N-API-call subagent timeout lacks tool_trace diagnostics — cannot identify last stuck tool [2 pull requests, 1 participants]

hermes2026-04-29 06:25:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17308•Fetched 2026-04-30 06:48:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

YeahloYip

Participants

YeahloYip

Timeline (top)

labeled ×4cross-referenced ×2

Error Message

When a subagent running via delegate_task times out after making N>0 API calls, the lead agent receives only a vague error message with no diagnostic information about which tool was last executing. This makes it impossible to distinguish between:

Extract current_tool from _summary for the error message

Fix Action

Fixed

Fixed by PR: fix(delegate): surface tool_trace on N-API-call subagent timeouts (#17308) (https://github.com/NousResearch/hermes-agent/pull/17329)
Fixed by PR: fix(auxiliary): preserve raw_base_url for Anthropic SDK wrapping in resolve_provider_client (https://github.com/NousResearch/hermes-agent/pull/17340)

PR fix notes

PR #17329: fix(delegate): surface tool_trace on N-API-call subagent timeouts (#17308)

Repository: NousResearch/hermes-agent
Author: Sanjays2402
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17329

Description (problem / solution / changelog)

Closes #17308.

Problem

When a subagent under delegate_task times out after making >0 API calls, the lead agent gets a vague string and nothing else:

Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request.

There's no way to tell apart the two failure modes:

Tool finished, next LLM request hung — the tool itself is fine; the provider froze.
Tool itself hung — network partition, blocked I/O, etc.

This was the gap between the two existing diagnostic paths:

Path	Coverage
Normal completion (#1175)	`tool_trace` in return dict
0-API-call timeout (#15105)	`diagnostic_path` with structured log
N-API-call timeout	None ← this PR

Fix

Three pieces:

1. Extract a shared trace builder

The normal-completion branch already reconstructs tool_trace from result['messages']. Pulled that loop out into a module-level _build_tool_trace_from_messages() helper so both branches use one implementation.

2. Reconstruct trace on the N-API-call timeout branch

In _run_single_child's timeout branch (when is_timeout and child_api_calls > 0):

Read child._session_messages and run it through the helper.
If the trace tail has no matching tool-role response → mark status='in_progress' (the tool itself is hung).
Read get_activity_summary().current_tool. If it disagrees with the trace tail, prefer it — the tool-role write can lag because the agent writes the assistant message first and the tool response only after the tool returns.

3. Surface the diagnostics

Return dict now carries tool_trace, last_tool, last_tool_status, current_tool. Error message gets a last_tool=X (status=Y) suffix so it shows up in logs and the lead's prompt:

Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request. last_tool=terminal (status=in_progress)

0-API-call timeouts (diagnostic_path branch) and non-timeout errors leave the new fields empty/None so consumers don't read stale data.

Tests

Added two test classes in tests/tools/test_delegate_subagent_timeout_diagnostic.py:

TestRunSingleChildTimeoutToolTrace — end-to-end through _run_single_child with a tiny timeout:

test_timeout_after_completed_tool_marks_status_ok — tool returned cleanly → status=ok, current_tool=None
test_timeout_inside_running_tool_marks_status_in_progress — tool never returned → status=in_progress, current_tool set
test_timeout_with_tool_error_preserves_error_status — error responses keep status=error
test_timeout_with_parallel_tool_calls_pairs_by_id — out-of-order replies still pair correctly
test_zero_api_call_timeout_skips_tool_trace — 0-API branch keeps the new fields empty (no stale data alongside diagnostic_path)
test_timeout_with_no_session_messages_attr_does_not_crash — degrades to empty trace if _session_messages is absent

TestBuildToolTraceFromMessages — direct unit tests for the extracted helper (non-list input, non-dict entries, assistants without tool_calls, tool responses without tool_call_id).

$ python -m pytest tests/tools/test_delegate_subagent_timeout_diagnostic.py -q
.................                                                       [100%]
17 passed in 3.88s

Combined with the existing test_delegate.py suite: 137/137 pass.

Changed files

tests/tools/test_delegate_subagent_timeout_diagnostic.py (modified, +254/-0)
tools/delegate_tool.py (modified, +113/-32)

PR #17340: fix(auxiliary): preserve raw_base_url for Anthropic SDK wrapping in resolve_provider_client

Repository: NousResearch/hermes-agent
Author: lzsnolimit
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17340

Description (problem / solution / changelog)

Problem

After #17308 (f3371c39) threaded main_runtime through to auto_title_session, title generation started returning HTTP 404 for users on minimax-cn (and any other provider whose inference_base_url ends with /anthropic).

Root cause chain:

resolve_api_key_provider_credentials("minimax-cn") returns base_url = "https://api.minimaxi.com/anthropic"
_to_openai_base_url() converts it to "https://api.minimaxi.com/v1" (correct for the OpenAI SDK client)
_wrap_if_needed is called with this already-converted /v1 URL
Because main_runtime now carries api_mode="anthropic_messages", _maybe_wrap_anthropic decides to wrap → calls build_anthropic_client(key, "https://api.minimaxi.com/v1")
The Anthropic SDK appends /v1/messages → actual request hits https://api.minimaxi.com/v1/messages → 404

The correct endpoint is https://api.minimaxi.com/anthropic/v1/messages.

Fix

Preserve raw_base_url (pre-_to_openai_base_url conversion) and pass it to _wrap_if_needed, so _maybe_wrap_anthropic receives the original /anthropic-suffixed URL for both:

endpoint detection (_endpoint_speaks_anthropic_messages)
Anthropic SDK client construction (build_anthropic_client)

The OpenAI client continues to use the /v1-converted base_url.

Affected providers

All API-key providers whose inference_base_url ends with /anthropic when a main_runtime with api_mode=anthropic_messages is available: minimax, minimax-cn, and any compatible third-party gateway.

Changed files

agent/auxiliary_client.py (modified, +6/-4)

RAW_BUFFERClick to expand / collapse

Bug Description

'Tool completed but next LLM request stuck' — the tool itself finished but the provider's response handling froze
'Tool itself hung' — the tool call never returned (network partition, blocked I/O, etc.)

Impact

Lead agents cannot triage delegation failures programmatically
Users see: 'Subagent timed out after 120s with 3 API call(s) completed — likely stuck on a slow API call or unresponsive network request.' — no actionable detail
The 0-API-call timeout case (#15105) and normal completion case (#1175) both have structured diagnostics; only the N-API-call timeout gap was left without observability

Existing Coverage

Scenario	Issue/PR	Diagnostics
Normal completion	#1175	`tool_trace` in return dict
0-API-call timeout	#15105	`diagnostic_path` with structured log
N-API-call timeout	This issue	None ← gap

Proposed Fix

Add the same tool_trace / last_tool / last_tool_status instrumentation to the N-API-call timeout path in _run_single_child(), mirroring what #1175 added for the normal completion path:

Extract current_tool from _summary for the error message
Reconstruct tool_trace from result['messages'] (assistant tool_calls + tool role responses)
Return tool_trace, last_tool, last_tool_status in the timeout response dict

三段式覆盖全景

路径	触发条件	诊断信息
#1175 正常完成	`completed=True`	`tool_trace`
#15105 0-API超时	`is_timeout and api_calls == 0`	`diagnostic_path`
This issue N-API超时	`is_timeout and api_calls > 0`	`tool_trace` + `last_tool` + `last_tool_status`

extent analysis

TL;DR

Add tool_trace, last_tool, and last_tool_status to the N-API-call timeout path in _run_single_child() to provide diagnostic information.

Guidance

Extract current_tool from _summary for the error message to identify the last tool executing.
Reconstruct tool_trace from result['messages'] to provide a record of tool calls and responses.
Return tool_trace, last_tool, and last_tool_status in the timeout response dict to enable programmatic triage of delegation failures.
Verify the fix by checking the response dict for the added diagnostic information in the N-API-call timeout scenario.

Example

def _run_single_child():
    # ...
    if is_timeout and api_calls > 0:
        current_tool = _summary['current_tool']
        tool_trace = reconstruct_tool_trace(result['messages'])
        return {
            'tool_trace': tool_trace,
            'last_tool': current_tool,
            'last_tool_status': 'timeout'
        }

Notes

This fix assumes that the reconstruct_tool_trace function can accurately rebuild the tool_trace from the result['messages']. Additional error handling may be necessary to ensure the fix is robust.

Recommendation

Apply the proposed fix to add diagnostic information to the N-API-call timeout path, enabling lead agents to triage delegation failures programmatically.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: N-API-call subagent timeout lacks tool_trace diagnostics — cannot identify last stuck tool [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #17329: fix(delegate): surface tool_trace on N-API-call subagent timeouts (#17308)

Description (problem / solution / changelog)

Problem

Fix

1. Extract a shared trace builder

2. Reconstruct trace on the N-API-call timeout branch

3. Surface the diagnostics

Tests

Changed files

PR #17340: fix(auxiliary): preserve raw_base_url for Anthropic SDK wrapping in resolve_provider_client

Description (problem / solution / changelog)

Problem

Fix

Affected providers

Changed files

三段式覆盖全景

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING