hermes - 💡(How to fix) Fix Auxiliary LLM calls (session_search, skills_hub, etc.) use short 30s default timeout with no local-endpoint detection — causes retry storms on slow local inference

When using a slow local inference server (vLLM, llama.cpp, Ollama), auxiliary tasks like session_search make LLM API calls with a 30-second default timeout (_DEFAULT_AUX_TIMEOUT in agent/auxiliary_client.py:3140). Unlike the main agent loop (which auto-detects local endpoints and bumps timeouts to 1800s), the auxiliary client has no is_local_endpoint() check. This causes ReadTimeout → retry → new request while original is still processing → vLLM queue saturation. Reproduction

Configure Hermes to use a local inference server (e.g., vLLM at http://10.0.0.1:8088/v1-openai/) with a model that takes >30s for large-context responses
Ask the agent a question that triggers session_search (e.g., "search my past sessions for X")
The agent calls the LLM → vLLM processes it → agent gets tool response → executes session_search
session_search calls _summarize_session() → async_call_llm() → auxiliary_client.py:3916
The auxiliary client uses the config value auxiliary.session_search.timeout (default 30s) or _DEFAULT_AUX_TIMEOUT (30s) — see auxiliary_client.py:3140
vLLM takes >30s → httpcore.ReadTimeout → the tool retries (3 attempts built into session_search_tool.py:225-259)
Each retry spawns a new /v1/chat/completions request while the original is still being processed by vLLM
With max_concurrency: 3, up to 3 concurrent summarizations + the main agent's next call saturate the inference server queue Log Evidence tools/session_search_tool.py:228 → _summarize_session() agent/auxiliary_client.py:3916 → async_call_llm() httpcore.ReadTimeout → httpx.ReadTimeout 18 occurrences observed in a single session.

Root Cause The main agent loop has local-endpoint awareness in two places:

run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:

_get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Related: #21525

Root Cause

Root Cause The main agent loop has local-endpoint awareness in two places:

run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:

_get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Fix Action

Fix / Workaround

Root Cause The main agent loop has local-endpoint awareness in two places:

run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:

_get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Summary When using a slow local inference server (vLLM, llama.cpp, Ollama), auxiliary tasks like session_search make LLM API calls with a 30-second default timeout (_DEFAULT_AUX_TIMEOUT in agent/auxiliary_client.py:3140). Unlike the main agent loop (which auto-detects local endpoints and bumps timeouts to 1800s), the auxiliary client has no is_local_endpoint() check. This causes ReadTimeout → retry → new request while original is still processing → vLLM queue saturation. Reproduction

Configure Hermes to use a local inference server (e.g., vLLM at http://10.0.0.1:8088/v1-openai/) with a model that takes >30s for large-context responses
Ask the agent a question that triggers session_search (e.g., "search my past sessions for X")
The agent calls the LLM → vLLM processes it → agent gets tool response → executes session_search
session_search calls _summarize_session() → async_call_llm() → auxiliary_client.py:3916
The auxiliary client uses the config value auxiliary.session_search.timeout (default 30s) or _DEFAULT_AUX_TIMEOUT (30s) — see auxiliary_client.py:3140
vLLM takes >30s → httpcore.ReadTimeout → the tool retries (3 attempts built into session_search_tool.py:225-259)
Each retry spawns a new /v1/chat/completions request while the original is still being processed by vLLM
With max_concurrency: 3, up to 3 concurrent summarizations + the main agent's next call saturate the inference server queue Log Evidence tools/session_search_tool.py:228 → _summarize_session() agent/auxiliary_client.py:3916 → async_call_llm() httpcore.ReadTimeout → httpx.ReadTimeout 18 occurrences observed in a single session.

Root Cause The main agent loop has local-endpoint awareness in two places:

run_agent.py:6812 — streaming read timeout auto-raised from 120s to 1800s for local endpoints
run_agent.py:7371 — streaming stale timeout disabled (float("inf")) for local endpoints
run_agent.py:2860 — non-streaming stale timeout disabled for local endpoints The auxiliary client (agent/auxiliary_client.py) has no equivalent checks. Tasks affected include session_search, skills_hub, approval, mcp, title_generation — all default to 30s via auxiliary.<task>.timeout in config.yaml. Workaround (config-only, immediate) In config.yaml: auxiliary: session_search: timeout: 600 # was 30 max_concurrency: 1 # was 3 And add a providers.custom.request_timeout_seconds as a floor for all calls. Suggested Fix (code) Add is_local_endpoint() detection to the auxiliary client timeout resolution in agent/auxiliary_client.py, matching the pattern in run_agent.py. Specifically:

_get_task_timeout() (line 3157) should call is_local_endpoint() and raise the timeout when the resolved base URL is local and the timeout is the implicit default
async_call_llm() / call_llm() should auto-bump _DEFAULT_AUX_TIMEOUT for local providers Affected Versions Observed on Hermes Agent v0.12.0. The _DEFAULT_AUX_TIMEOUT = 30.0 has been present since the auxiliary client was introduced.

Related: #21525

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Auxiliary LLM calls (session_search, skills_hub, etc.) use short 30s default timeout with no local-endpoint detection — causes retry storms on slow local inference

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Auxiliary LLM calls (session_search, skills_hub, etc.) use short 30s default timeout with no local-endpoint detection — causes retry storms on slow local inference

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Still need to ship something?

RELATED_DISCOVERY

TRENDING