hermes - 💡(How to fix) Fix Model Presets: per-turn expert-on-demand model escalation [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#20249Fetched 2026-05-06 06:37:52
View on GitHub
Comments
3
Participants
2
Timeline
9
Reactions
0
Timeline (top)
labeled ×4commented ×3cross-referenced ×2

Root Cause

  • Nesting: If /expert is invoked and the call itself makes tool calls, all tool calls in that turn also run under the expert model (because the entire turn is one LLM loop). Only the next user turn restores the default.
  • Concurrent presets: last one wins.
  • Provider mismatch: If the expert preset uses a different provider than the default (e.g. Flash via OpenRouter → Opus via Anthropic native), the API mode might differ. _switch_session_model() already handles this for the /model command — reuse that path.
  • Fallback interaction: If the expert model itself fails, the existing fallback chain should kick in (for that turn), then restore to default on next turn.
  • Subagents: /expert applies only to the parent agent's turn. Children via delegate_task should not inherit the one-shot preset unless explicitly configured — they have their own delegation.model config.

Fix Action

Fix / Workaround

  1. Turn dispatch in run_conversation(): Before _call_llm(), check if _one_shot_preset is set. If so:
    • Snapshot the current runtime (self.model, self.provider, etc.)
    • Load the preset's model (via _switch_session_model() or a lighter-weight override that avoids rebuilding prompt cache if the provider is the same)
    • Call LLM
    • After response arrives, restore the snapshot

Code Example

model_presets:
  expert:
    provider: openrouter
    model: anthropic/claude-opus-4-6
    reasoning_effort: high
  deepthink:
    provider: deepseek
    model: deepseek-v4-pro
    reasoning_effort: high
  cheap:
    provider: openrouter
    model: google/gemini-3-flash-preview
    reasoning_effort: none
RAW_BUFFERClick to expand / collapse

Problem

Right now, Hermes runs one model for the entire session. If you're on a cheap/fast model (Gemini Flash, DeepSeek V4 Flash), and a turn comes up that genuinely needs Opus-level reasoning — designing a complex system, debugging a multi-module race condition, or reasoning about a tricky crypto protocol — you have two bad options:

  1. /model to a bigger model, take the hit on all subsequent turns (waste of tokens and money), then /model back manually.
  2. Power through on the weak model, get a mediocre answer, then re-prompt with /retry.

Neither is good. What we need: the ability to temporarily escalate to an expert model for a single turn, then snap back to the default.

Proposed Solution: Model Presets

A new config section model_presets that defines named provider+model combinations, plus a lightweight mechanism to invoke them for one turn.

model_presets:
  expert:
    provider: openrouter
    model: anthropic/claude-opus-4-6
    reasoning_effort: high
  deepthink:
    provider: deepseek
    model: deepseek-v4-pro
    reasoning_effort: high
  cheap:
    provider: openrouter
    model: google/gemini-3-flash-preview
    reasoning_effort: none

Interaction modes

1. Explicit: /expert (or /preset deepthink)

Before the next turn, you type /expert and the agent switches to the expert preset for that single call. After the model responds, the runtime swaps back to the default. If no preset is specified, model_presets.expert is the default target. /preset <name> lets you pick any named preset.

2. Implicit: auto-detect complexity

The agent loop sniffs signals that suggest this turn needs more horsepower:

  • User asks for a design decision, architecture review, or systems-level refactor
  • The previous assistant response was long (suggests deep multi-step reasoning in progress)
  • The user explicitly says "think carefully", "analyze deeply", "design", "propose architecture"
  • The task touches multiple files across different packages (cross-context reasoning signal)

When confidence is high enough, the loop transparently routes the turn through the expert preset before falling back. The user shouldn't need to know the mechanism — the agent just "gives it the good model" for the hard parts.

Slash command argument

/expert — one-shot escalate to model_presets.expert /expert deepseek-v4-pro — one-shot escalate with an inline model override /preset deepthink — one-shot escalate to model_presets.deepthink /preset list — show available presets with their models /preset current status — shows the active preset, if any

Implementation sketch

The current AIAgent._call_llm() at run_agent.py line ~8386 takes reasoning_config, max_tokens, etc. The runtime carries self.model, self.provider, self.base_url, self.api_mode and _switch_session_model() already exists for swapping at line ~2290.

The minimal change:

  1. Config: Add model_presets dict to DEFAULT_CONFIG in hermes_cli/config.py. Each key = name, each value = {provider, model, reasoning_effort?, base_url?, api_key?}.

  2. Slash command: Add CommandDef("expert", ...) and CommandDef("preset", ...) to hermes_cli/commands.py. Both set a one-shot flag on the AIAgent (self._one_shot_preset or similar).

  3. Turn dispatch in run_conversation(): Before _call_llm(), check if _one_shot_preset is set. If so:

    • Snapshot the current runtime (self.model, self.provider, etc.)
    • Load the preset's model (via _switch_session_model() or a lighter-weight override that avoids rebuilding prompt cache if the provider is the same)
    • Call LLM
    • After response arrives, restore the snapshot
  4. Auto-detect (optional Phase 2): A lightweight classifier — could be as simple as a regex pattern match on the user's message + prior context length heuristic, or a cheap model call (Gemini Flash costs pennies) that judges "does this need the big model?". The classifier runs before the LLM call in the main loop, returns a confidence score, and if above threshold, triggers the preset. This should be tunable/non-blocking — if the classifier fails, the default model runs, no harm.

Edge cases

  • Nesting: If /expert is invoked and the call itself makes tool calls, all tool calls in that turn also run under the expert model (because the entire turn is one LLM loop). Only the next user turn restores the default.
  • Concurrent presets: last one wins.
  • Provider mismatch: If the expert preset uses a different provider than the default (e.g. Flash via OpenRouter → Opus via Anthropic native), the API mode might differ. _switch_session_model() already handles this for the /model command — reuse that path.
  • Fallback interaction: If the expert model itself fails, the existing fallback chain should kick in (for that turn), then restore to default on next turn.
  • Subagents: /expert applies only to the parent agent's turn. Children via delegate_task should not inherit the one-shot preset unless explicitly configured — they have their own delegation.model config.

What this is NOT

  • Not a replacement for the fallback chain (which is reactive, for failures)
  • Not a replacement for delegation.model (which is per-subagent, not per-turn)
  • Not a full routing layer — it's a simple snapshot/restore around a single LLM call
  • Not automatic model selection on every turn (that's a separate, much harder problem — this is the pragmatic first step)

Related code references

  • run_agent.py:AIAgent.__init__() line ~1211: self.reasoning_config
  • run_agent.py:AIAgent._switch_session_model() line ~2290: runtime model swap
  • run_agent.py:AIAgent._try_activate_fallback() line ~7750: per-turn fallback (useful pattern for restore logic)
  • run_agent.py:AIAgent._restore_primary_runtime() line ~7772: existing per-turn restoration
  • hermes_cli/commands.py: slash command registry (add CommandDef entries)
  • hermes_cli/config.py:DEFAULT_CONFIG line ~386: where new config keys go
  • agent/models_dev.py: ModelInfo.reasoning flag already exists for metadata
  • agent/lmstudio_reasoning.py: reasoning-effort resolution pattern to reuse
  • cli-config.yaml.example: document the new section

Alternatives considered

  • Two persistent profiles and switch between them: Profiles are for fully isolated Hermes instances (config, skills, sessions). Overkill for a single-turn model swap.
  • delegate_task to the big model: That spawns a subagent with no conversation history. The point of /expert is to bring the full context to the powerful model.
  • Just run Opus all the time: Expensive and slow for the 90% of turns that don't need it.
  • Router middleware: Too heavy for v1. A simple snapshot/restore with a regex-based auto-detector is a 200-line change. Router can be built on top later.

extent analysis

TL;DR

Implement a model_presets config section and add slash commands to temporarily escalate to an expert model for a single turn.

Guidance

  1. Add model_presets to config: Define named provider+model combinations in a new model_presets section in hermes_cli/config.py.
  2. Implement slash commands: Add CommandDef entries for /expert and /preset in hermes_cli/commands.py to set a one-shot flag on the AIAgent.
  3. Modify turn dispatch: Check for the one-shot flag in run_conversation() and snapshot the current runtime before loading the preset's model and calling the LLM.
  4. Restore default model: After the response arrives, restore the snapshot to revert to the default model.

Example

# hermes_cli/config.py
DEFAULT_CONFIG = {
    # ...
    'model_presets': {
        'expert': {
            'provider': 'openrouter',
            'model': 'anthropic/claude-opus-4-6',
            'reasoning_effort': 'high'
        }
    }
}

# hermes_cli/commands.py
CommandDef('expert', ...)

# run_agent.py
def run_conversation():
    # ...
    if self._one_shot_preset:
        # Snapshot current runtime
        # Load preset's model
        # Call LLM
        # Restore default model after response
    # ...

Notes

This implementation assumes that the AIAgent class has the necessary methods for switching models and restoring the runtime. The auto-detect feature for complexity can be added in a separate phase.

Recommendation

Apply the workaround by implementing the model_presets config section and slash commands to temporarily escalate to an expert model for a single turn. This approach provides a flexible and efficient solution for handling complex tasks without incurring the cost of

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING