hermes - 💡(How to fix) Fix RFC: On-demand tool/skill/MCP discovery — decouple schema registration from process lifecycle

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

The architecture tightly couples two concerns that should be separate:

  1. Schema registration — telling the LLM "these capabilities exist and here's how to call them"
  2. Process lifecycle — keeping server processes/connections alive

These are conflated for MCP servers (start → connect → discover → keep-alive), and for built-in tools/skills they're conflated via the system prompt (register all schemas upfront because there's no on-demand discovery mechanism).

Fix Action

Fix / Workaround

Concerns & Mitigations

ConcernMitigation
LLM might not discover tools it needsLLMs are good at asking; the index makes ALL tools visible by name+description
Cold-start latency on first MCP tool callAcceptable trade-off — happens once per MCP server per session, vs every startup
LLM might call tool_get_schema too oftenCache schema in context after first fetch
Stateful MCP servers (memory, databases)Always-on flag per-server config: `keepalive: true`
Prompt cache invalidationDynamic tools are injected mid-session, same as existing `/reload-mcp` pattern
RAW_BUFFERClick to expand / collapse

Problem

Currently, Hermes eagerly loads everything at startup:

  • All built-in tools — tool schemas for all 26 CLI tools are baked into the system prompt (~12K tokens)
  • All MCP servers — 7+ servers are started as subprocesses and kept alive for the session duration, each contributing tool schemas (77 tools, ~25K tokens)
  • All installed skills — 24 skill directories are scanned and their content is available for injection

This means a first API call carries ~37K tokens of tool definitions alone, before any actual conversation context. With models like deepseek-v4-flash, processing this many input tokens adds 5-10 seconds to every first response, even for trivial queries.

The real cost: in a typical session, the LLM uses 3-5 tools out of 103 registered, and 1-2 skills out of 24 available. The other 95% of schema content is dead weight on every LLM call.

Root Cause

The architecture tightly couples two concerns that should be separate:

  1. Schema registration — telling the LLM "these capabilities exist and here's how to call them"
  2. Process lifecycle — keeping server processes/connections alive

These are conflated for MCP servers (start → connect → discover → keep-alive), and for built-in tools/skills they're conflated via the system prompt (register all schemas upfront because there's no on-demand discovery mechanism).

Proposal: Capability Registry with On-Demand Loading

Core idea

Replace the current "register everything at startup" model with a two-phase discovery pattern:

Phase 1 — Index (startup, fast): ``` For each MCP server: connect → tools/list → cache schema → disconnect (or keep minimal heartbeat) register tool NAME + DESCRIPTION in a lightweight index

For built-in tools: register tool NAME + DESCRIPTION in the same index

For skills: register skill NAME + DESCRIPTION in an index ```

The LLM gets a compact index (~200 tokens) of available capabilities — just names and one-line descriptions, not full JSON schemas. It can then request the full schema for specific tools it decides to use.

Phase 2 — On-demand (when LLM decides to use a capability): ``` LLM: calls tool_get_schema("mcp_zhihu_search_content") Agent: returns full JSON schema for that tool then if the LLM calls it, starts zhihu MCP server → forwards the tool call

LLM: calls skill_load("github-code-review") Agent: reads SKILL.md content and injects it into conversation context ```

Benefits

AspectCurrentProposed
Startup time~12s (20s before zhihu fix)~2s (only index, no MCP process start)
Input tokens on first call~37K tool schemas~200 (compact index)
MCP server processes7 running all session0-1 running at a time (on-demand)
Session memory pressure100+ tool schemas in contextOnly schemas for tools LLM actually uses
Skill loadingAll scanned at startup, injected when loadedIndex only, loaded on skill_get_content call

Design Sketch

New built-in tool: `tool_get_schema(tool_name: str)` Returns the full OpenAI-format JSON schema for a specific tool by name. The LLM calls this when it sees a promising tool in the index and wants the full parameters.

On-demand MCP server start: ```python async def handle_mcp_tool_call(server_name, tool_name, args): if server_name not in _servers: # Cold start — connect on first use server = await _connect_server(server_name, config) _servers[server_name] = server return await server.session.call_tool(tool_name, arguments=args) ```

Concerns & Mitigations

ConcernMitigation
LLM might not discover tools it needsLLMs are good at asking; the index makes ALL tools visible by name+description
Cold-start latency on first MCP tool callAcceptable trade-off — happens once per MCP server per session, vs every startup
LLM might call tool_get_schema too oftenCache schema in context after first fetch
Stateful MCP servers (memory, databases)Always-on flag per-server config: `keepalive: true`
Prompt cache invalidationDynamic tools are injected mid-session, same as existing `/reload-mcp` pattern

Migration Path

  1. Phase 0 (now): Add `tool_get_schema` built-in tool alongside existing tools
  2. Phase 1: MCP servers start in "probe" mode by default (connect → index → disconnect), controllable via `mcp_servers.<name>.mode: probe` (default) vs `keepalive`
  3. Phase 2: Build tool index into the system prompt instead of full schemas
  4. Phase 3: Extend the same pattern to skills (index in system prompt, load on `skill_get_content`)

Each phase is independently shippable and valuable on its own.

Related

  • Existing issue #30754 (skill progressive disclosure — complementary; reduces skill token waste)
  • Slash command `/reload-mcp` (existing hot-reload mechanism)
  • `tools/mcp_tool.py` — current eager-connect implementation

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING