hermes - 💡(How to fix) Fix [Feature]: Add fallback routing from local open-source models to closed-source models after repeated failures [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15176Fetched 2026-04-25 06:24:02
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
labeled ×4

Root Cause

Because of this, many users prefer to use locally hosted open-source models first, such as models served by Ollama, vLLM, LM Studio, or other local inference backends. This helps reduce API costs and makes Hermes Agent more affordable for daily use.

RAW_BUFFERClick to expand / collapse

Problem or Use Case

Hermes Agent can consume a large number of tokens during task execution, especially in multi-step tasks, long conversations, retries, or agentic workflows.

Because of this, many users prefer to use locally hosted open-source models first, such as models served by Ollama, vLLM, LM Studio, or other local inference backends. This helps reduce API costs and makes Hermes Agent more affordable for daily use.

However, due to hardware limitations, local open-source models may not always be capable enough to complete complex tasks reliably. Smaller local models may misunderstand instructions, get stuck in loops, produce invalid outputs, fail tool calls repeatedly, or stop making progress after several rounds of iteration.

In these cases, users often want to escalate the task to a stronger closed-source model such as Google Gemini, OpenAI, or Claude. The goal is to use local models whenever possible, but only spend paid API tokens when the local model is clearly unable to finish the task.

This would help maximize token efficiency and avoid unnecessary spending.

Proposed Solution

Add a configurable fallback or escalation mechanism for model routing.

Hermes Agent could start with a local/open-source model as the primary model, and automatically switch to a stronger cloud model after repeated failures or lack of progress.

Example configuration:

model_routing: primary: provider: ollama model: qwen3.6:9b

fallback: provider: openai model: gpt-5.4

fallback_policy: enabled: true max_local_attempts: 3 trigger_on: - repeated_tool_failure - invalid_output - no_progress - task_timeout - user_request

Expected behavior:

Hermes Agent first attempts to complete the task using the configured local/open-source model. If the local model succeeds, no paid cloud model tokens are used. If the local model fails repeatedly, produces invalid outputs, gets stuck, or cannot make progress after several iterations, Hermes Agent automatically escalates to the configured fallback model. The fallback model should continue from the current task state, using the existing conversation context, tool results, and intermediate progress. Logs should clearly indicate when fallback happened and why.

Useful configuration options could include:

Primary model provider and model name Fallback model provider and model name Maximum local retry attempts before fallback Automatic fallback triggers Manual fallback trigger Per-task-type fallback rules Optional cost limits or confirmation before using paid models Clear logging for fallback decisions

This would allow users to balance cost, performance, and reliability more effectively.

Alternatives Considered

  1. Always use a closed-source model

Users could configure Hermes Agent to always use OpenAI, Claude, or Gemini.

This provides better reliability, but it can be expensive because every task consumes paid API tokens, even simple tasks that could have been handled locally.

  1. Always use a local open-source model

Users could configure Hermes Agent to only use local models.

This minimizes API cost, but complex tasks may fail, loop, or produce poor results depending on the user’s hardware and model size.

  1. Manually switch models when a task fails

Users could manually restart or reconfigure Hermes Agent with a stronger model when the local model fails.

This works, but it is inconvenient. It may also lose task state, conversation context, intermediate tool results, or debugging progress.

  1. Use a larger local model

Users could run a larger local model to improve reliability.

However, many users are limited by GPU VRAM, RAM, CPU performance, or inference speed. Running a larger model is not always practical.

  1. Reduce context length or simplify prompts

Users could reduce token usage by shortening prompts or splitting tasks manually.

This helps in some cases, but it does not solve the core issue: local models may still fail on complex tasks, while using paid cloud models from the beginning may waste money.

Summary

A configurable fallback mechanism would let Hermes Agent use cheap local inference first and only escalate to stronger paid models when necessary.

This would improve cost efficiency, reduce wasted tokens, and make Hermes Agent more practical for users who rely on local open-source models but still need reliable completion for difficult tasks.

Feature Type

New tool

Scope

None

Contribution

  • I'd like to implement this myself and submit a PR

Debug Report (optional)

extent analysis

TL;DR

Implement a configurable fallback mechanism to automatically switch from a local open-source model to a stronger cloud model after repeated failures or lack of progress.

Guidance

  • Introduce a model_routing configuration with primary and fallback models, allowing users to specify local and cloud models.
  • Develop a fallback_policy with triggers such as repeated_tool_failure, invalid_output, and no_progress to determine when to escalate to the fallback model.
  • Implement logic to continue the task from the current state when switching to the fallback model, preserving conversation context and intermediate progress.
  • Add logging to clearly indicate when fallback occurs and why, providing transparency into the decision-making process.

Example

model_routing:
  primary:
    provider: ollama
    model: qwen3.6:9b
  fallback:
    provider: openai
    model: gpt-5.4
  fallback_policy:
    enabled: true
    max_local_attempts: 3
    trigger_on:
      - repeated_tool_failure
      - invalid_output
      - no_progress

Notes

The proposed solution aims to balance cost efficiency and reliability by leveraging local models for simple tasks and escalating to cloud models when necessary. However, the implementation details, such as the specific triggers and fallback logic, may require further refinement based on user feedback and testing.

Recommendation

Apply a workaround by implementing a basic fallback mechanism, allowing users to configure primary and fallback models, and triggering escalation based on repeated failures or lack of progress. This will provide a foundation for further development and refinement of the feature.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Feature]: Add fallback routing from local open-source models to closed-source models after repeated failures [1 participants]