hermes - 💡(How to fix) Fix Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13332Fetched 2026-04-22 08:06:58
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×4renamed ×1

Code Example

# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal

---

tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider
RAW_BUFFERClick to expand / collapse

Problem

Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.

Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match process on keywords alone. Hybrid catches both.

Flow

  1. User sends message
  2. Hermes runs hybrid search against precomputed tool index:
    • Keyword (BM25): exact tool name mentions, parameter names, direct intent words — pure Python, <1ms
    • Semantic (embeddings): fuzzy intent, synonyms, natural language — ~50ms embed query
    • Score fusion via RRF: final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)
  3. Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools
  4. Only those schemas injected into the system prompt
  5. Single LLM call as normal — no extra round trip

Index Storage

  • Two lightweight indexes built from the same tool name + description corpus
  • Keyword: inverted index (BM25 via rank-bm25, ~500 lines, no model needed)
  • Semantic: precomputed vectors in ~/.hermes/tool_embeddings.npz (~77KB for 50 tools × 384 dims)
  • Both built once on startup, loaded into memory in milliseconds
  • Re-indexed automatically on checksum mismatch (tool names+descriptions change)
  • Re-index triggers: hermes tools enable/disable, MCP server added/removed, Hermes update

Implementation Sketch

# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal

Config

tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider

Comparison with Existing Proposals

Current#6839 Lazy LoadingThis proposal (Hybrid)
Schema tokens/call~14,000~500 + full schema on request~1,400 (top-8 schemas)
Extra LLM round trip0+1 per tool use0
Latency penalty0~1-2s per tool use~50ms embed + <1ms BM25
Token savings0%~70% schema, but extra call cost~90% schema, no extra call
Handles exact tool nameyesyesyes (keyword leg)
Handles fuzzy intentn/an/ayes (semantic leg)

The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.

Trade-offs

  • Pro: ~90% schema token reduction with zero latency penalty
  • Pro: No change to agent loop structure
  • Pro: No external DB — BM25 inverted index + numpy vectors, both tiny
  • Pro: Automatic re-indexing on tool set changes
  • Pro: Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries
  • Con: Requires an embedding model (can reuse auxiliary provider already in Hermes)
  • Con: Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files)

Related

  • #6839 — Lazy Tool Schema Loading (two-pass, extra round trip)
  • #11115 — Lean default tool exposure profile

extent analysis

TL;DR

Implementing a hybrid tool pre-selection approach using semantic and keyword search can significantly reduce schema tokens injected into API calls.

Guidance

  • To reduce schema tokens, consider implementing a hybrid search approach that combines keyword and semantic search to select relevant tool schemas before sending the user message to the LLM.
  • The proposed hybrid approach can be configured using the tools.selection parameter in the configuration file, with options for eager, hybrid, semantic, or keyword selection.
  • The hybrid approach uses a precomputed tool index and score fusion via RRF to select top-K relevant tool schemas, which can be adjusted using the rag_top_k parameter.
  • The implementation requires building and loading two lightweight indexes, a keyword index and a semantic index, which can be done at startup and re-indexed automatically on checksum mismatch.

Example

# Example configuration
tools:
  selection: hybrid
  rag_top_k: 8

Notes

  • The proposed approach requires an embedding model, which can be reused from an auxiliary provider already in Hermes.
  • There is a risk of missing a needed tool if the top-K selection is too small, which can be mitigated by always including a pinned core set of tools.

Recommendation

Apply the hybrid tool pre-selection workaround to reduce schema tokens and improve performance, as it offers a good balance between token reduction and latency penalty.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips [1 participants]