hermes - 💡(How to fix) Fix Auxiliary LLM calls (session_search, skills_hub, etc.) use short 30s default timeout with no local-endpoint detection — causes retry storms on slow local inference

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When using a slow local inference server (vLLM, llama.cpp, Ollama), auxiliary tasks like session_search make LLM API calls with a 30-second default timeout (_DEFAULT_AUX_TIMEOUT in agent/auxiliary_client.py:3140). Unlike the main agent loop (which auto-detects local endpoints and bumps timeouts to 1800s), the auxiliary client has no is_local_endpoint() check. This causes ReadTimeout → retry → new request while original is still processing → vLLM queue saturation. Reproduction

  1. Configure Hermes to use a local inference server (e.g., vLLM at http://10.0.0.1:8088/v1-openai/) with a model that takes >30s for large-context responses
  2. Ask the agent a question that triggers session_search (e.g., "search my past sessions for X")
  3. The agent calls the LLM → vLLM processes it → agent gets tool response → executes session_search
  4. session_search calls _summarize_session() → async_call_llm() → auxiliary_client.py:3916
  5. The auxiliary client uses the config value auxiliary.session_search.timeout (default 30s) or _DEFAULT_AUX_TIMEOUT (30s) — see auxiliary_client.py:3140
  6. vLLM takes >30s → httpcore.ReadTimeout → the tool retries (3 attempts built into session_search_tool.py:225-259)
  7. Each retry spawns a new /v1/chat/completions request while the original is still being processed by vLLM
  8. With max_concurrency: 3, up to 3 concurrent summarizations + the main agent's next call saturate the inference server queue Log Evidence tools/session_search_tool.py:228 → _summarize_session() agent/auxiliary_client.py:3916 → async_call_llm() httpcore.ReadTimeout → httpx.ReadTimeout 18 occurrences observed in a single session.

Root Cause The main agent loop has local-endpoint awareness in two places:

  • run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
  • run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
  • run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:
  1. _get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
  2. async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Related: #21525

Root Cause

Root Cause The main agent loop has local-endpoint awareness in two places:

  • run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
  • run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
  • run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:
  1. _get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
  2. async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Fix Action

Fix / Workaround

Root Cause The main agent loop has local-endpoint awareness in two places:

  • run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
  • run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
  • run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:
  1. _get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
  2. async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.
RAW_BUFFERClick to expand / collapse

Summary When using a slow local inference server (vLLM, llama.cpp, Ollama), auxiliary tasks like session_search make LLM API calls with a 30-second default timeout (_DEFAULT_AUX_TIMEOUT in agent/auxiliary_client.py:3140). Unlike the main agent loop (which auto-detects local endpoints and bumps timeouts to 1800s), the auxiliary client has no is_local_endpoint() check. This causes ReadTimeout → retry → new request while original is still processing → vLLM queue saturation. Reproduction

  1. Configure Hermes to use a local inference server (e.g., vLLM at http://10.0.0.1:8088/v1-openai/) with a model that takes >30s for large-context responses
  2. Ask the agent a question that triggers session_search (e.g., "search my past sessions for X")
  3. The agent calls the LLM → vLLM processes it → agent gets tool response → executes session_search
  4. session_search calls _summarize_session() → async_call_llm() → auxiliary_client.py:3916
  5. The auxiliary client uses the config value auxiliary.session_search.timeout (default 30s) or _DEFAULT_AUX_TIMEOUT (30s) — see auxiliary_client.py:3140
  6. vLLM takes >30s → httpcore.ReadTimeout → the tool retries (3 attempts built into session_search_tool.py:225-259)
  7. Each retry spawns a new /v1/chat/completions request while the original is still being processed by vLLM
  8. With max_concurrency: 3, up to 3 concurrent summarizations + the main agent's next call saturate the inference server queue Log Evidence tools/session_search_tool.py:228 → _summarize_session() agent/auxiliary_client.py:3916 → async_call_llm() httpcore.ReadTimeout → httpx.ReadTimeout 18 occurrences observed in a single session.

Root Cause The main agent loop has local-endpoint awareness in two places:

  • run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
  • run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
  • run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:
  1. _get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
  2. async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Related: #21525

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING