hermes - ✅(Solved) Fix [Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28834Fetched 2026-05-20 04:01:36
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
labeled ×4referenced ×3subscribed ×2cross-referenced ×1

Error Message

The agent’s tool call output consistently freezes at the [Calling tool: tool_name with arguments: {...}] marker. The response never completes — no tool result, no follow-up message, no error. The agent has made the call but the response never arrives back.

Additional Logs / Traceback (optional)

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

PR fix notes

PR #29004: fix(run_agent): unfreeze first tool call after idle on macOS (#28834)

Description (problem / solution / changelog)

What does this PR do?

Fixes #28834 — first tool call after a 2-3 minute idle pause freezes at [Calling tool: ...] until the user kicks the loop (e.g. with /goal).

Root cause is in AIAgent._build_keepalive_http_client (run_agent.py), which configures the TCP keepalive socket options applied to every provider connection. Two adjacent bugs combine to produce the freeze:

  1. The Linux branch sets TCP_KEEPIDLE / TCP_KEEPINTVL / TCP_KEEPCNT together → dead peer detected in ~60 s. The macOS branch only sets TCP_KEEPALIVE (the idle knob's macOS name) and falls through, leaving KEEPINTVL and KEEPCNT at kernel defaults of 75 s × 8 ≈ 10 minutes. After a 2-3 min idle, the provider socket is silently dropped by intermediate NAT/firewall but macOS doesn't notice for nearly 10 more minutes.
  2. Even with the keepalive fix, there's still a narrow window where httpx's keepalive pool hands out a zombie connection to the next request before the keepalive timer has had a chance to mark it dead. Without a connection-level retry, that request hangs / errors with no automatic recovery.

The fix is two small changes inside _build_keepalive_http_client, each in its own commit:

  • Split TCP_KEEPINTVL / TCP_KEEPCNT out of the TCP_KEEPIDLE branch and gate them on their own hasattr checks — both are exposed on macOS in Python ≥ 3.10. macOS now matches Linux's ~60 s detection budget.
  • Pass retries=1 to httpx.HTTPTransport so a stale-pool connection that beats the keepalive timer triggers a single transparent re-dial. httpx only retries connection-establishment failures, so this can't double-submit a half-sent request.

Related Issue

Fixes #28834.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • run_agent.py (+27/-3) — _build_keepalive_http_client:
    • Set TCP_KEEPINTVL=10 and TCP_KEEPCNT=3 on both the Linux and macOS branches via independent hasattr checks (was Linux-only).
    • Pass retries=1 to httpx.HTTPTransport so connection-establishment failures (stale pool connections) re-dial transparently.
  • tests/run_agent/test_keepalive_socket_options.py (+179, new) — three test classes covering the new contract:
    • TestKeepaliveSharedKnobs — 4 cases running against the real host's socket module: SO_KEEPALIVE on, TCP_KEEPINTVL=10, TCP_KEEPCNT=3, and a 30 s idle warm-up under whichever of TCP_KEEPIDLE / TCP_KEEPALIVE the platform exposes.
    • TestMacOSKeepaliveParity — 1 case stubbing sys.modules['socket'] with a macOS-flavored facade (no TCP_KEEPIDLE) so the test exercises the macOS branch even on Linux CI runners. Pins all three values (30 / 10 / 3) to lock the budget.
    • TestStalePoolRetry — 1 case asserting the constructed httpx.HTTPTransport carries retries=1 via the underlying httpcore pool's _retries attribute.

No other production files touched. No config schema changes, no new env vars, no public-API surface change.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
  2. Run the new tests on their own:
    scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v
    Expected: 6 passed.
  3. Run the wider OpenAI-client transport suite to confirm no cross-file regressions:
    scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
      tests/run_agent/test_create_openai_client_reuse.py \
      tests/run_agent/test_create_openai_client_proxy_env.py \
      tests/run_agent/test_create_openai_client_kwargs_isolation.py \
      tests/run_agent/test_async_httpx_del_neuter.py \
      tests/run_agent/test_sequential_chats_live.py
    Expected: 27 passed, 1 skipped.
  4. (Optional, on macOS only — reproduces the original issue) Open hermes in interactive mode, run any tool call, idle 3 minutes, then send a request that requires another tool call. Before the fix: hangs at [Calling tool: ...]. After the fix: completes within the usual latency.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(run_agent): ... × 2, test(run_agent): ... × 1)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments updated to call out the macOS gap + #28834)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — macOS branch fixed to match Linux; Windows path is unchanged (the helper short-circuits via try/except, and Windows has no TCP_KEEPINTVL); fix is the documented behaviour on both supported platforms
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v
4 workers [6 items]
============================== 6 passed in 1.59s ===============================

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
    tests/run_agent/test_create_openai_client_reuse.py \
    tests/run_agent/test_create_openai_client_proxy_env.py \
    tests/run_agent/test_create_openai_client_kwargs_isolation.py \
    tests/run_agent/test_async_httpx_del_neuter.py \
    tests/run_agent/test_sequential_chats_live.py
4 workers [28 items]
======================== 27 passed, 1 skipped in 3.94s =========================

Changed files

  • run_agent.py (modified, +27/-3)
  • tests/run_agent/test_keepalive_socket_options.py (added, +179/-0)

Code Example

Debug report uploaded:
  Report       https://paste.rs/21p96
  agent.log    https://paste.rs/j2IzB
  gateway.log  https://paste.rs/OADPl

---
RAW_BUFFERClick to expand / collapse

Bug Description

Description:

The agent’s tool call output consistently freezes at the [Calling tool: tool_name with arguments: {...}] marker. The response never completes — no tool result, no follow-up message, no error. The agent has made the call but the response never arrives back.

The command /goal reliably unblocks the frozen state and the agent resumes producing output within 2-5 seconds.

Steps to Reproduce

Steps to reproduce the behavior:

Have a normal conversation with the agent (any model, any provider). Stop sending messages for 2-3 minutes. Send a new request that requires tool calls (e.g., “read this file”, “run this command”). Observe: the response stops at [Calling tool: tool_name with arguments: {...}] and never completes. Send /goal — the agent resumes within seconds. The issue is not sporadic — during active back-and-forth conversation, tool calls work fine. It is specifically the first tool call after a pause that hangs.

Expected Behavior

Tool call results should be returned and displayed regardless of idle time between messages.

Actual Behavior

the symptom is purely textual: the conversation wedges at the [Calling tool: ...] marker and produces no further output until /goal is sent. Such as, [Calling tool: execute_code with arguments={“code”: “from hermes_tools import read_file\n\n# Try reading in chunks\nfor offset in range(1, 1000, 200):\n r = read_file(path="/Users/eb/.hermes/config.yaml", offset=offset, limit=200)\n print(f"— offset {offset} —")\n print(r["content"])\n if "truncated" not in r or not r["truncated"]:\n break”}]

Affected Component

CLI (interactive chat)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Debug report uploaded:
  Report       https://paste.rs/21p96
  agent.log    https://paste.rs/j2IzB
  gateway.log  https://paste.rs/OADPl

Operating System

macOS (26.3.1)

Python Version

3.9.6

Hermes Version

0.14.0

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks [1 pull requests, 1 participants]