hermes - 💡(How to fix) Fix `session_search` summarization is untunable for slow/local backends — multiple hard-coded cost knobs cause timeouts

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

tools/session_search_tool.py hard-codes four cost-relevant values via module-level constants and inline literals:

  • MAX_SESSION_CHARS = 100_000 (transcript prefill per summarization call)
  • MAX_SUMMARY_TOKENS = 10_000 (generation cap per call)
  • default limit = 3 (number of sessions summarized per query)
  • max_retries = 3 inside _summarize_session() (retry count on transient/empty-content failures)

For users routing auxiliary.session_search to a local quantized model (llama.cpp, vLLM, Ollama), these defaults make session_search calls take 60–180s and frequently timeout under automated workloads. After locally reducing the first two constants by 3-5×, the retry loop becomes the dominant cost — see "Symptom (after partial mitigation)" below.

The framework already exposes auxiliary.session_search.* for provider/model/timeout/concurrency — but the actual cost knobs are not surfaced. This is a config-surface gap, not a code bug.

Error Message

def _get_session_search_int(key: str, default: int, min_v: int = 0, max_v: int | None = None) -> int: try: from hermes_cli.config import load_config cfg = load_config() except ImportError: return default aux = (cfg.get("auxiliary") or {}).get("session_search") or {} raw = aux.get(key) if raw is None: return default try: v = int(raw) except (TypeError, ValueError): return default v = max(min_v, v) if max_v is not None: v = min(max_v, v) return v

Root Cause

In tools/session_search_tool.py:

MAX_SESSION_CHARS  = 100_000   # line 28
MAX_SUMMARY_TOKENS = 10_000    # line 29
# ...inside _summarize_session() at line ~220:
max_retries = 3

And in SESSION_SEARCH_SCHEMA["parameters"]["properties"]["limit"]["default"] = 3 (line ~510), with the public session_search(limit=3, ...) default.

Per-call cost on a quantized local model:

  • Input: up to 100k chars (~25k tokens) of conversation transcript per matched session, prefilled.
  • Output: up to 10k tokens generated per session.
  • Fan-out: up to 3 parallel calls (max_concurrency default).
  • Retry amplification: up to 3 attempts per call, with linear backoff. Empty-content responses (common with reasoning-mode local models) trigger the same retry path as transient errors.

Three parallel prefills of 25k tokens + 10k-token generations saturate even a 4-slot llama.cpp serving a 35B Q4 model. Cloud frontier models hide this cost; local backends do not. Once per-call cost is reduced, the retry loop is what's left.

Fix Action

Fix / Workaround

For users routing auxiliary.session_search to a local quantized model (llama.cpp, vLLM, Ollama), these defaults make session_search calls take 60–180s and frequently timeout under automated workloads. After locally reducing the first two constants by 3-5×, the retry loop becomes the dominant cost — see "Symptom (after partial mitigation)" below.

Symptom (after partial mitigation)

After locally patching MAX_SESSION_CHARS = 30_000 and MAX_SUMMARY_TOKENS = 2_000 (3-5× reduction in per-call work), three further test runs with a raised 180s subprocess timeout still showed near-total failures:

Code Example

# tools/session_search_tool.py ~line 230, in _summarize_session():
content = extract_content_or_reasoning(response)
if content:
    return content
# Reasoning-only / empty — let the retry loop handle it
logging.warning("Session search LLM returned empty content (attempt %d/%d)", attempt + 1, max_retries)
if attempt < max_retries - 1:
    await asyncio.sleep(1 * (attempt + 1))
    continue

---

MAX_SESSION_CHARS  = 100_000   # line 28
MAX_SUMMARY_TOKENS = 10_000    # line 29
# ...inside _summarize_session() at line ~220:
max_retries = 3

---

auxiliary:
  session_search:
    provider: auto
    model: ''
    timeout: 30
    max_concurrency: 3
    # NEW knobs, all optional:
    max_session_chars: 100000   # transcript truncation per session (default unchanged)
    max_summary_tokens: 10000   # LLM output cap per session (default unchanged)
    default_limit: 3            # default for the `limit` argument (default unchanged)
    max_retries: 3              # retries on empty-content / transient failure (default unchanged)

---

def _get_session_search_int(key: str, default: int, min_v: int = 0, max_v: int | None = None) -> int:
    try:
        from hermes_cli.config import load_config
        cfg = load_config()
    except ImportError:
        return default
    aux = (cfg.get("auxiliary") or {}).get("session_search") or {}
    raw = aux.get(key)
    if raw is None:
        return default
    try:
        v = int(raw)
    except (TypeError, ValueError):
        return default
    v = max(min_v, v)
    if max_v is not None:
        v = min(max_v, v)
    return v

---

max_chars = _get_session_search_int("max_session_chars", MAX_SESSION_CHARS, min_v=1000)
max_tokens = _get_session_search_int("max_summary_tokens", MAX_SUMMARY_TOKENS, min_v=128, max_v=32000)
max_retries = _get_session_search_int("max_retries", 3, min_v=0, max_v=5)

---

default_limit = _get_session_search_int("default_limit", 3, min_v=1, max_v=5)
if not isinstance(limit, int):
    try:
        limit = int(limit)
    except (TypeError, ValueError):
        limit = default_limit
RAW_BUFFERClick to expand / collapse

session_search summarization is untunable for slow/local backends — multiple hard-coded cost knobs cause timeouts

Summary

tools/session_search_tool.py hard-codes four cost-relevant values via module-level constants and inline literals:

  • MAX_SESSION_CHARS = 100_000 (transcript prefill per summarization call)
  • MAX_SUMMARY_TOKENS = 10_000 (generation cap per call)
  • default limit = 3 (number of sessions summarized per query)
  • max_retries = 3 inside _summarize_session() (retry count on transient/empty-content failures)

For users routing auxiliary.session_search to a local quantized model (llama.cpp, vLLM, Ollama), these defaults make session_search calls take 60–180s and frequently timeout under automated workloads. After locally reducing the first two constants by 3-5×, the retry loop becomes the dominant cost — see "Symptom (after partial mitigation)" below.

The framework already exposes auxiliary.session_search.* for provider/model/timeout/concurrency — but the actual cost knobs are not surfaced. This is a config-surface gap, not a code bug.

Environment observed

  • Hermes Agent main @ 622c27e55 (also reproducible on current origin/main — no commits touch this file in the last 77).
  • Backend: llama.cpp OpenAI-compatible server (4 slots, n_ctx=131072), Qwen3.6-35B-A3B Q4_K_XL.
  • auxiliary.session_search.provider: auto (inherits main chat model), timeout: 30, max_concurrency: 3.
  • Session DB: 96 sessions / 384 messages / 8.2 MB. FTS5 indexed (messages_fts, messages_fts_trigram).

Symptom

Automated test harness running session_search with a typical query ("Search past conversations for any mention of 'Hermes Agent configuration'. Summarize what you find.") — 5 runs, 120s subprocess timeout:

RunResultTime
1success9.2s
2timeout120.1s
3success99.6s
4timeout120.0s
5timeout120.1s

60% timeout rate. Other tools against the same backend (terminal, file ops, memory save/retrieve) complete in 5–10s. FTS5 search itself is sub-millisecond — the cost is entirely in the LLM summarization step.

Symptom (after partial mitigation)

After locally patching MAX_SESSION_CHARS = 30_000 and MAX_SUMMARY_TOKENS = 2_000 (3-5× reduction in per-call work), three further test runs with a raised 180s subprocess timeout still showed near-total failures:

RunSuccessfulTimed out at 180s
#50/55/5
#60/55/5
#70/55/5

llama.cpp /metrics confirmed sustained inference activity throughout these timeouts (not a stuck/hung condition). The reduced budget made individual prefill+generate cycles much cheaper — but the 3-attempt retry loop with linear backoff now dominates: 3 attempts × 30s timeout + (1+2+3)s backoff ≈ 96s per failed call, with the parallel gather() waiting for the slowest.

The retries are especially common on local reasoning-mode models (e.g. Qwen3 with reasoning_in_content: false): extract_content_or_reasoning() returns empty when only thinking tokens were emitted, which triggers the retry path even though nothing is wrong:

# tools/session_search_tool.py ~line 230, in _summarize_session():
content = extract_content_or_reasoning(response)
if content:
    return content
# Reasoning-only / empty — let the retry loop handle it
logging.warning("Session search LLM returned empty content (attempt %d/%d)", attempt + 1, max_retries)
if attempt < max_retries - 1:
    await asyncio.sleep(1 * (attempt + 1))
    continue

For users on local reasoning models, the retry count is the single highest-impact knob once the per-call budget is right.

Root cause

In tools/session_search_tool.py:

MAX_SESSION_CHARS  = 100_000   # line 28
MAX_SUMMARY_TOKENS = 10_000    # line 29
# ...inside _summarize_session() at line ~220:
max_retries = 3

And in SESSION_SEARCH_SCHEMA["parameters"]["properties"]["limit"]["default"] = 3 (line ~510), with the public session_search(limit=3, ...) default.

Per-call cost on a quantized local model:

  • Input: up to 100k chars (~25k tokens) of conversation transcript per matched session, prefilled.
  • Output: up to 10k tokens generated per session.
  • Fan-out: up to 3 parallel calls (max_concurrency default).
  • Retry amplification: up to 3 attempts per call, with linear backoff. Empty-content responses (common with reasoning-mode local models) trigger the same retry path as transient errors.

Three parallel prefills of 25k tokens + 10k-token generations saturate even a 4-slot llama.cpp serving a 35B Q4 model. Cloud frontier models hide this cost; local backends do not. Once per-call cost is reduced, the retry loop is what's left.

Why this is a config gap, not a bug

The framework's auxiliary.session_search block already supports:

  • provider, model, base_url, api_key
  • timeout
  • extra_body
  • max_concurrency

…which is the right mental model: per-task LLM cost controls. The four hard-coded values above logically belong in the same block but aren't surfaced. Users on slower backends currently have no remediation short of editing session_search_tool.py directly (which conflicts on every git pull).

Proposed change

Read these from auxiliary.session_search.* with the current constants as defaults, preserving today's behavior for everyone:

auxiliary:
  session_search:
    provider: auto
    model: ''
    timeout: 30
    max_concurrency: 3
    # NEW knobs, all optional:
    max_session_chars: 100000   # transcript truncation per session (default unchanged)
    max_summary_tokens: 10000   # LLM output cap per session (default unchanged)
    default_limit: 3            # default for the `limit` argument (default unchanged)
    max_retries: 3              # retries on empty-content / transient failure (default unchanged)

Suggested implementation sketch (mirrors the existing _get_session_search_max_concurrency() helper at line 32):

def _get_session_search_int(key: str, default: int, min_v: int = 0, max_v: int | None = None) -> int:
    try:
        from hermes_cli.config import load_config
        cfg = load_config()
    except ImportError:
        return default
    aux = (cfg.get("auxiliary") or {}).get("session_search") or {}
    raw = aux.get(key)
    if raw is None:
        return default
    try:
        v = int(raw)
    except (TypeError, ValueError):
        return default
    v = max(min_v, v)
    if max_v is not None:
        v = min(max_v, v)
    return v

Then in _truncate_around_matches() and _summarize_session():

max_chars = _get_session_search_int("max_session_chars", MAX_SESSION_CHARS, min_v=1000)
max_tokens = _get_session_search_int("max_summary_tokens", MAX_SUMMARY_TOKENS, min_v=128, max_v=32000)
max_retries = _get_session_search_int("max_retries", 3, min_v=0, max_v=5)

And in session_search()'s limit clamp:

default_limit = _get_session_search_int("default_limit", 3, min_v=1, max_v=5)
if not isinstance(limit, int):
    try:
        limit = int(limit)
    except (TypeError, ValueError):
        limit = default_limit

The max_retries value replaces the literal max_retries = 3 inside _summarize_session(). Allowing 0 is intentional — for local reasoning-mode backends where empty-content responses are the norm rather than the exception, users may legitimately want to disable retries entirely and surface the raw-preview fallback that already exists at line ~510 ("[Raw preview — summarization unavailable]").

Behavior preservation

  • All defaults match the current hard-coded values — zero behavior change for users who don't set the new keys.
  • Bounds clamping prevents pathological values (same pattern as the existing max_concurrency helper, which clamps to [1, 5]).
  • No schema change required — these are additive optional keys in an existing auxiliary.* block.

Concrete impact for local-backend users

Two-step tuning that brings session_search reliably under typical test/automation timeouts:

  1. Budget: max_session_chars=30000, max_summary_tokens=2000, default_limit=2 — ~5× reduction in per-call prefill+generate.
  2. Retries: max_retries=1 (or 0 for reasoning-mode backends that legitimately produce empty content with the current extract_content_or_reasoning() path).

In our measurements, step 1 alone reduced per-call cost but left retry-loop overhead as the new ceiling (100% timeouts at 180s across 15 attempts). Step 2 is what eliminates the retry-amplification path. Both knobs are needed; either alone leaves a tail.

Alternatives considered

  1. Lower the constants in-tree. Cleaner for new users on slow backends, but degrades summary quality for users on frontier models who don't need tuning. The config-knob approach is opt-in.
  2. Auto-detect backend speed and adapt. Possible but a much larger change; brittle across providers.
  3. Per-call override via tool arguments. The model would have to guess good values; a profile-level config is the right scope.
  4. Treat empty-content responses as success, not retry-trigger. Would help the reasoning-mode case without a config knob, but _summarize_session() legitimately retries on transient errors too, and discriminating between "model returned reasoning-only output" and "model failed transiently" requires more nuance than a one-line behavior change. The config knob is the lower-risk path.

Happy to PR?

Yes — if the maintainers agree on the config-key names and bounds, I can put together a PR with the helper, the four call-site changes (max_session_chars, max_summary_tokens, default_limit, max_retries), and unit tests exercising each override path. Wanted to confirm the design before writing it.


Reproducer environment + measurements available on request.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix `session_search` summarization is untunable for slow/local backends — multiple hard-coded cost knobs cause timeouts