hermes - 💡(How to fix) Fix Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips [1 participants]

jack2684 · 2026-04-21T05:03:48Z

[hermes] Problem Every API call injects full tool schemas for ALL enabled tools ~14,000 tokens measured on a default hermes-cli setup with 30+ tools , regardle… ## Problem Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed. ## Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword) Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip. Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match `process` on keywords alone. Hybrid catches both. ### Flow 1. User sends message 2. Hermes runs hybrid search against precomputed tool index: - **Keyword (BM25):** exact tool name mentions, parameter names, direct intent words — pure Python, <1ms - **Semantic (embeddings):** fuzzy intent, synonyms, natural language — ~50ms embed query - **Score fusion via RRF:** `final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)` 3. Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools 4. Only those schemas injected into the system prompt 5. Single LLM call as normal — no extra round trip ### Index Storage - Two lightweight indexes built from the same tool name + description corpus - **Keyword:** inverted index (BM25 via rank-bm25, ~500 lines, no model needed) - **Semantic:** precomputed vectors in `~/.hermes/tool_embeddings.npz` (~77KB for 50 tools × 384 dims) - Both built once on startup, loaded into memory in milliseconds - Re-indexed automatically on checksum mismatch (tool names+descriptions change) - Re-index triggers: `hermes tools enable/disable`, MCP server added/removed, Hermes update ### Implementation Sketch ```python # At startup: build/load both indexes bm25_index, tool_vectors = load_or_build_indexes() # rebuilds on checksum mismatch # Per turn: hybrid retrieval keyword_ranks = bm25_index.rank(user_message) q = embed(user_message) # <50ms semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1] # Reciprocal Rank Fusion k = 60 # RRF constant scores = defaultdict(float) for rank, name in enumerate(keyword_ranks): scores[name] += 1 / (k + rank) for rank, name in enumerate(semantic_ranks): scores[name] += 1 / (k + rank) top_k = sorted(scores, key=scores.get, reverse=True)[:K] schemas = [registry.get_schema(n) for n in top_k] # inject schemas into prompt as normal ``` ### Config ```yaml tools: selection: eager # current default — all tools always injected # selection: hybrid # semantic + keyword fusion (recommended) # selection: semantic # embedding only # selection: keyword # BM25 only # rag_top_k: 8 # rag_embed_model: nomic-embed-text # or "auxiliary" to reuse existing provider ``` ## Comparison with Existing Proposals | | Current | #6839 Lazy Loading | This proposal (Hybrid) | |--|---------|-------------------|------------------------| | Schema tokens/call | ~14,000 | ~500 + full schema on request | ~1,400 (top-8 schemas) | | Extra LLM round trip | 0 | +1 per tool use | 0 | | Latency penalty | 0 | ~1-2s per tool use | ~50ms embed + <1ms BM25 | | Token savings | 0% | ~70% schema, but extra call cost | ~90% schema, no extra call | | Handles exact tool name | yes | yes | yes (keyword leg) | | Handles fuzzy intent | n/a | n/a | yes (semantic leg) | The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K. ## Trade-offs - **Pro:** ~90% schema token reduction with zero latency penalty - **Pro:** No change to agent loop structure - **Pro:** No external DB — BM25 inverted index + numpy vectors, both tiny - **Pro:** Automatic re-indexing on tool set changes - **Pro:** Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries - **Con:** Requires an embedding model (can reuse auxiliary provider already in Hermes) - **Con:** Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files) ## Related - #6839 — Lazy Tool Schema Loading (two-pass, extra round trip) - #11115 — Lean default tool exposure profile

Code Example

# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal

---

tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider

Problem

Every API call injects full tool schemas for ALL enabled tools (~14,000 tokens measured on a default hermes-cli setup with 30+ tools), regardless of whether those tools are relevant to the current message. Existing proposal #6839 addresses this with lazy loading (two-pass: name list first, full schema on model request), but it adds an extra LLM round trip every time a tool is needed.

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Before sending the user message to the LLM, run a fast hybrid search against a precomputed tool index to select only the top-K relevant tool schemas. Inject only those schemas into the prompt — single LLM call, no extra round trip.

Pure semantic search alone has a blind spot: if the user says "use browser_navigate" or "run terminal command", embeddings may rank other tools higher due to paraphrasing. Pure keyword search misses intent: "check what's running" won't match process on keywords alone. Hybrid catches both.

Flow

User sends message
Hermes runs hybrid search against precomputed tool index:
- Keyword (BM25): exact tool name mentions, parameter names, direct intent words — pure Python, <1ms
- Semantic (embeddings): fuzzy intent, synonyms, natural language — ~50ms embed query
- Score fusion via RRF: final_score = 1/(k + semantic_rank) + 1/(k + keyword_rank)
Top-K tools selected (e.g. K=8), plus a small fixed set of always-included core tools
Only those schemas injected into the system prompt
Single LLM call as normal — no extra round trip

Index Storage

Two lightweight indexes built from the same tool name + description corpus
Keyword: inverted index (BM25 via rank-bm25, ~500 lines, no model needed)
Semantic: precomputed vectors in ~/.hermes/tool_embeddings.npz (~77KB for 50 tools × 384 dims)
Both built once on startup, loaded into memory in milliseconds
Re-indexed automatically on checksum mismatch (tool names+descriptions change)
Re-index triggers: hermes tools enable/disable, MCP server added/removed, Hermes update

Implementation Sketch

# At startup: build/load both indexes
bm25_index, tool_vectors = load_or_build_indexes()  # rebuilds on checksum mismatch

# Per turn: hybrid retrieval
keyword_ranks = bm25_index.rank(user_message)
q = embed(user_message)                              # <50ms
semantic_ranks = np.argsort(np.dot(tool_vectors, q))[::-1]

# Reciprocal Rank Fusion
k = 60  # RRF constant
scores = defaultdict(float)
for rank, name in enumerate(keyword_ranks):
    scores[name] += 1 / (k + rank)
for rank, name in enumerate(semantic_ranks):
    scores[name] += 1 / (k + rank)

top_k = sorted(scores, key=scores.get, reverse=True)[:K]
schemas = [registry.get_schema(n) for n in top_k]
# inject schemas into prompt as normal

Config

tools:
  selection: eager      # current default — all tools always injected
  # selection: hybrid   # semantic + keyword fusion (recommended)
  # selection: semantic # embedding only
  # selection: keyword  # BM25 only
  # rag_top_k: 8
  # rag_embed_model: nomic-embed-text  # or "auxiliary" to reuse existing provider

Comparison with Existing Proposals

	Current	#6839 Lazy Loading	This proposal (Hybrid)
Schema tokens/call	~14,000	~500 + full schema on request	~1,400 (top-8 schemas)
Extra LLM round trip	0	+1 per tool use	0
Latency penalty	0	~1-2s per tool use	~50ms embed + <1ms BM25
Token savings	0%	~70% schema, but extra call cost	~90% schema, no extra call
Handles exact tool name	yes	yes	yes (keyword leg)
Handles fuzzy intent	n/a	n/a	yes (semantic leg)

The two approaches are also composable: hybrid pre-selects likely tools, lazy loading (#6839) handles edge cases where the model needs a tool outside the top-K.

Trade-offs

Pro: ~90% schema token reduction with zero latency penalty
Pro: No change to agent loop structure
Pro: No external DB — BM25 inverted index + numpy vectors, both tiny
Pro: Automatic re-indexing on tool set changes
Pro: Hybrid beats pure semantic or pure keyword on both exact and fuzzy queries
Con: Requires an embedding model (can reuse auxiliary provider already in Hermes)
Con: Risk of missing a needed tool if K is too small (mitigated by always including a pinned core set: terminal, read_file, search_files)

#6839 — Lazy Tool Schema Loading (two-pass, extra round trip)
#11115 — Lean default tool exposure profile

extent analysis

TL;DR

Implementing a hybrid tool pre-selection approach using semantic and keyword search can significantly reduce schema tokens injected into API calls.

Guidance

To reduce schema tokens, consider implementing a hybrid search approach that combines keyword and semantic search to select relevant tool schemas before sending the user message to the LLM.
The proposed hybrid approach can be configured using the tools.selection parameter in the configuration file, with options for eager, hybrid, semantic, or keyword selection.
The hybrid approach uses a precomputed tool index and score fusion via RRF to select top-K relevant tool schemas, which can be adjusted using the rag_top_k parameter.
The implementation requires building and loading two lightweight indexes, a keyword index and a semantic index, which can be done at startup and re-indexed automatically on checksum mismatch.

Example

# Example configuration
tools:
  selection: hybrid
  rag_top_k: 8

Notes

The proposed approach requires an embedding model, which can be reused from an auxiliary provider already in Hermes.
There is a risk of missing a needed tool if the top-K selection is too small, which can be mitigated by always including a pinned core set of tools.

Recommendation

Apply the hybrid tool pre-selection workaround to reduce schema tokens and improve performance, as it offers a good balance between token reduction and latency penalty.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Problem

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Flow

Index Storage

Implementation Sketch

Config

Comparison with Existing Proposals

Trade-offs

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Feature: Hybrid Tool Pre-Selection (Semantic + Keyword) — RAG-style schema injection to reduce token overhead without extra LLM round trips [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Problem

Proposed Solution: Hybrid Tool Pre-Selection (Semantic + Keyword)

Flow

Index Storage

Implementation Sketch

Config

Comparison with Existing Proposals

Trade-offs

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING