hermes - ✅(Solved) Fix Auxiliary client cache routes wrong model when multiple tasks share provider/base_url [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16387Fetched 2026-04-28 06:53:43
View on GitHub
Comments
2
Participants
3
Timeline
9
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2cross-referenced ×2unsubscribed ×1

agent/auxiliary_client.py caches auxiliary clients keyed by (provider, async_mode, base_url, api_key, api_mode, runtime_key)model is intentionally omitted. On cache hits, _compat_model() then drops any caller-supplied model that contains / for non-OpenRouter clients and falls back to cached_default (the model that happened to be configured the first time the client was created).

Net effect: when several auxiliary tasks point at the same custom provider but different models, all tasks end up using whichever model was set on the first task that warmed the cache, regardless of auxiliary.<task>.model in config.yaml.

Tested on Hermes Agent v0.11.0 (post-update, includes the #15033 / commit b29287258 fix).

Root Cause

Root Cause — two issues compounding

Fix Action

Workaround

Restart the gateway each time the cache might be primed against the wrong default — but the next first call (whichever wins the race) re-poisons the cache. Only durable fix is the source change above.

PR fix notes

PR #16410: fix(auxiliary): include model in client cache key

Description (problem / solution / changelog)

Problem

agent/auxiliary_client._client_cache_key() omits model from the cache key. When multiple auxiliary tasks (vision, compression, title_generation) share the same provider but specify different models, all tasks receive whichever model first warmed the cache.

Reproduction (from #16387):

auxiliary:
  vision:
    provider: myrelay
    model: google/gemini-3.1-flash-image-preview
  compression:
    provider: myrelay
    model: google/gemini-3-flash-preview
  title_generation:
    provider: myrelay
    model: google/gemini-3.1-flash-lite-preview

After vision warms the cache, compression and title_generation also use flash-image-preview.

Fix

Add model to the cache key tuple so each unique (provider, model, ...) combination gets its own cache entry.

Before: (provider, async_mode, base_url, api_key, api_mode, runtime_key) After: (provider, model, async_mode, base_url, api_key, api_mode, runtime_key)

Tests

New regression test in tests/agent/test_auxiliary_cache_key_model.py:

  • Different models produce different cache keys
  • Same model produces same key
  • None model equals empty string
  • Model is independent of provider

Fixes #16387 Fixes #14249

Changed files

  • agent/auxiliary_client.py (modified, +4/-1)
  • tests/agent/test_auxiliary_cache_key_model.py (added, +68/-0)

Code Example

providers:
  myrelay:
    name: myrelay
    base_url: https://example-relay.test/v1
    key_env: MYRELAY_API_KEY
    api_mode: chat_completions

auxiliary:
  vision:
    provider: myrelay
    model: google/gemini-3.1-flash-image-preview     # vision needs image-capable
  compression:
    provider: myrelay
    model: google/gemini-3-flash-preview              # text-only
  title_generation:
    provider: myrelay
    model: google/gemini-3.1-flash-lite-preview       # cheapest text-only

---

INFO agent.auxiliary_client: Auxiliary title_generation: using myrelay (google/gemini-3.1-flash-image-preview)
INFO agent.auxiliary_client: Auxiliary compression: using myrelay (google/gemini-3.1-flash-image-preview)

---

title_generation: provider=myrelay model=google/gemini-3.1-flash-lite-preview
compression:      provider=myrelay model=google/gemini-3-flash-preview
vision:           provider=myrelay model=google/gemini-3.1-flash-image-preview

---

def _client_cache_key(provider, *, async_mode, base_url=None, api_key=None,
                     api_mode=None, main_runtime=None) -> tuple:
    runtime = _normalize_main_runtime(main_runtime)
    runtime_key = tuple(...) if provider == "auto" else ()
    return (provider, async_mode, base_url or "", api_key or "", api_mode or "", runtime_key)
    # model field absent

---

def _compat_model(client, model, cached_default) -> Optional[str]:
    """Drop OpenRouter-format model slugs (with '/') for non-OpenRouter clients.

    Mirrors the guard in resolve_provider_client() which is skipped on cache hits.
    """
    if model and "/" in model and not _is_openrouter_client(client):
        return cached_default     # user-requested model thrown away silently
    return model or cached_default

---

def _client_cache_key(provider, *, async_mode, base_url=None, api_key=None,
                     api_mode=None, model=None, main_runtime=None) -> tuple:
    runtime = _normalize_main_runtime(main_runtime)
    runtime_key = tuple(...) if provider == "auto" else ()
    return (provider, async_mode, base_url or "", api_key or "", api_mode or "",
            (model or "").strip().lower(), runtime_key)
RAW_BUFFERClick to expand / collapse

Auxiliary client cache routes wrong model when multiple tasks share provider/base_url

Summary

agent/auxiliary_client.py caches auxiliary clients keyed by (provider, async_mode, base_url, api_key, api_mode, runtime_key)model is intentionally omitted. On cache hits, _compat_model() then drops any caller-supplied model that contains / for non-OpenRouter clients and falls back to cached_default (the model that happened to be configured the first time the client was created).

Net effect: when several auxiliary tasks point at the same custom provider but different models, all tasks end up using whichever model was set on the first task that warmed the cache, regardless of auxiliary.<task>.model in config.yaml.

Tested on Hermes Agent v0.11.0 (post-update, includes the #15033 / commit b29287258 fix).

Reproduction

config.yaml:

providers:
  myrelay:
    name: myrelay
    base_url: https://example-relay.test/v1
    key_env: MYRELAY_API_KEY
    api_mode: chat_completions

auxiliary:
  vision:
    provider: myrelay
    model: google/gemini-3.1-flash-image-preview     # vision needs image-capable
  compression:
    provider: myrelay
    model: google/gemini-3-flash-preview              # text-only
  title_generation:
    provider: myrelay
    model: google/gemini-3.1-flash-lite-preview       # cheapest text-only

Once an auxiliary call goes through vision first (cache warmed with flash-image-preview), every subsequent compression and title_generation call also hits flash-image-preview. Logs confirm:

INFO agent.auxiliary_client: Auxiliary title_generation: using myrelay (google/gemini-3.1-flash-image-preview)
INFO agent.auxiliary_client: Auxiliary compression: using myrelay (google/gemini-3.1-flash-image-preview)

A fresh Python interpreter (no warm cache) confirms the config is read correctly:

title_generation: provider=myrelay model=google/gemini-3.1-flash-lite-preview
compression:      provider=myrelay model=google/gemini-3-flash-preview
vision:           provider=myrelay model=google/gemini-3.1-flash-image-preview

So the YAML is fine — the live gateway disagrees with the live config.

Root Cause — two issues compounding

Issue 1: _client_cache_key does not include model

agent/auxiliary_client.py:2186-2197:

def _client_cache_key(provider, *, async_mode, base_url=None, api_key=None,
                     api_mode=None, main_runtime=None) -> tuple:
    runtime = _normalize_main_runtime(main_runtime)
    runtime_key = tuple(...) if provider == "auto" else ()
    return (provider, async_mode, base_url or "", api_key or "", api_mode or "", runtime_key)
    # model field absent

Two distinct logical clients (different model selection) collapse to the same cache entry.

Issue 2: _compat_model silently swaps the requested model for cached_default

agent/auxiliary_client.py:2362-2369:

def _compat_model(client, model, cached_default) -> Optional[str]:
    """Drop OpenRouter-format model slugs (with '/') for non-OpenRouter clients.

    Mirrors the guard in resolve_provider_client() which is skipped on cache hits.
    """
    if model and "/" in model and not _is_openrouter_client(client):
        return cached_default     # user-requested model thrown away silently
    return model or cached_default

The docstring acknowledges the design rationale (cache hits skip the resolve_provider_client guard). But the fallback is too aggressive: any aggregator-style model slug (e.g. google/gemini-3.1-flash-lite-preview on a non-OpenRouter OpenAI-compatible base_url) gets reverted to whatever the first warm-up call left in cached_default, with no warning logged.

_is_openrouter_client only matches openrouter.ai, so legitimate OpenAI-compatible aggregators that do use vendor/model slugs in their public model IDs (LiteLLM passthrough, OpenRouter-format mirrors, third-party gateways like ofox, etc.) are all penalised.

Combined symptom

  1. First aux call: vision → builds OpenAI client at myrelay, cached_default=google/gemini-3.1-flash-image-preview.
  2. Second call: title_generation requests google/gemini-3.1-flash-lite-preview. Cache key matches (same provider/base/key/api_mode). Cache hit returns the same (client, cached_default) pair. _compat_model sees / and returns cached_default → caller silently uses flash-image-preview.
  3. Same for compression and any other task pointed at the provider.

Suggested Fix (preferred)

Include model in the cache key. Different model selections deserve different client entries.

def _client_cache_key(provider, *, async_mode, base_url=None, api_key=None,
                     api_mode=None, model=None, main_runtime=None) -> tuple:
    runtime = _normalize_main_runtime(main_runtime)
    runtime_key = tuple(...) if provider == "auto" else ()
    return (provider, async_mode, base_url or "", api_key or "", api_mode or "",
            (model or "").strip().lower(), runtime_key)

Update both call sites (around lines 2244 and 2409) to pass model=. The _compat_model guard then becomes redundant (each cache entry already corresponds to a single model) and can be removed for simplicity.

Alternative

Keep cache key as-is but make _compat_model honour the caller-supplied model when the underlying client's base_url is an OpenAI-compatible aggregator that legitimately uses vendor/model slugs. This requires a maintained allowlist (or detection heuristic) for "aggregator-style" base URLs, which is fragile. The cache-key fix is cleaner.

Impact

Any user with multiple auxiliary tasks pointing at the same custom provider but different models silently runs all of them against whichever model warmed the cache first. Side effects:

  • Wrong model spent for the wrong task (cost / capability mismatch — e.g. paying image-preview rates for title generation).
  • Per-task cost/latency tuning in config.yaml becomes a no-op for the second-and-later cache hit.
  • Hard to diagnose: logs show the wrong model being used, but the config file looks correct.

Workaround

Restart the gateway each time the cache might be primed against the wrong default — but the next first call (whichever wins the race) re-poisons the cache. Only durable fix is the source change above.

extent analysis

TL;DR

The most likely fix is to include the model in the cache key to prevent different model selections from collapsing to the same cache entry.

Guidance

  • Update the _client_cache_key function to include the model parameter and pass it to the function calls.
  • Modify the cache key to include the model field, as shown in the suggested fix.
  • Remove the _compat_model guard as it becomes redundant after the cache key update.
  • Verify the fix by checking the logs to ensure that each task is using the correct model.
  • Test the fix with multiple auxiliary tasks pointing to the same custom provider but different models.

Example

def _client_cache_key(provider, *, async_mode, base_url=None, api_key=None,
                     api_mode=None, model=None, main_runtime=None) -> tuple:
    runtime = _normalize_main_runtime(main_runtime)
    runtime_key = tuple(...) if provider == "auto" else ()
    return (provider, async_mode, base_url or "", api_key or "", api_mode or "",
            (model or "").strip().lower(), runtime_key)

Notes

The suggested fix assumes that the model field is the primary factor in determining the cache key. If there are other factors that need to be considered, additional modifications may be necessary.

Recommendation

Apply the suggested fix to include the model in the cache key, as it is a cleaner and more durable solution compared to maintaining an allowlist of "aggregator-style" base URLs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Auxiliary client cache routes wrong model when multiple tasks share provider/base_url [1 pull requests, 2 comments, 3 participants]