openclaw - ✅(Solved) Fix Session-level provider caching bypasses declared model chain after any fallback success (silent provider drift) [1 pull requests, 2 comments, 2 participants]

openclaw2026-04-21 22:30:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#69855•Fetched 2026-04-22 07:47:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

maxramsay

Participants

maxramsay

rafiki270

Timeline (top)

commented ×2cross-referenced ×2closed ×1

When a session's declared primary + fallback chain all fail and a request succeeds against an undeclared provider (one that exists in models.providers but isn't listed in the agent's .model.fallbacks), the gateway caches that provider as the session's effective model. Every subsequent request in that session goes directly to the cached provider, bypassing the declared chain entirely. This persists until the gateway restarts — sessions are not re-evaluated.

Only indicator is model-snapshot events in ~/.openclaw/agents/*/sessions/*.jsonl. No visible gateway log, no Discord notification, no dashboard indicator. For long-running sessions (heartbeats, always-on agents), this can drift undetected for weeks.

Related to but distinct from #43945 (which is about subagent credential lookup specifically). This issue is about main-session drift after a fallback cascade.

Root Cause

Session JSONL file accumulated 585 calls to the cached non-Ollama provider over 7 days.
Material cost impact routed through a third-party cloud model path we explicitly did not intend.
model-snapshot events in the session file clearly record the drift (provider: ollama → provider: <other>), but no observability surface reflected it.
Triggering failure was a single transient DNS blip during a regional WAN outage. The drift persisted for the entire 7 days because the long-running heartbeat session kept appending rather than rotating.

Fix Action

Fixed

Fixed by PR: Ollama: gate synthetic auth on local baseUrl classification (realigns #59954) (https://github.com/openclaw/openclaw/pull/69857)

PR fix notes

PR #69857: Ollama: gate synthetic auth on local baseUrl classification (realigns #59954)

Repository: openclaw/openclaw
Author: maxramsay
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/69857

Description (problem / solution / changelog)

Summary

Tightens resolveSyntheticAuth in the Ollama extension so that the synthetic local API key ("ollama-local") is only applied when the configured baseUrl points at a local/LAN endpoint. Remote URLs (Ollama Cloud, publicly-hosted Ollama behind a TLS proxy, etc.) must now supply real credentials via env var, auth profile, or explicit config — the synthetic key no longer papers over misconfiguration or routes to endpoints that will reject it.

Partially addresses #43945 (for the Cloud-auth-specific path; the broader subagent credential lookup issue still needs the fix proposed by @Meli73).

Supersedes #59954.

What changes

New isLocalOllamaBaseUrl(baseUrl) helper in extensions/ollama/src/discovery-shared.ts with RFC1918 IPv4, IPv6 ULA/link-local, .local mDNS, bare hostname, and loopback classification.
resolveSyntheticAuth now checks this classifier before returning the synthetic key. Undefined return means the auth pipeline falls through to real-credential resolution (env, profile, or explicit).
Test coverage for the classifier (11 positive + 8 negative URL cases) and hook integration (3 new scenarios).

What does NOT change

Local Ollama (localhost, 127.0.0.1, ::1) — synthetic auth applies as before.
LAN Ollama (192.168.x.x, 10.x.x.x, 172.16-31.x.x) — synthetic auth applies as before.
.local mDNS hosts — synthetic auth applies as before.
Missing baseUrl (ambient discovery) — synthetic auth applies as before.
Bare hostnames (homelab pattern gpu-node-1) — synthetic auth applies as before.
Any other hook or pipeline logic — minimal surgical change.

Backwards compatibility

Breaking for users who were configuring a remote Ollama endpoint with "apiKey": "ollama-local" and expecting it to work. This was never correct behavior (remote Ollama endpoints reject the synthetic marker with 401 anyway) — it just failed silently and cascaded to a different provider. After this change the failure is explicit: "Ollama remote endpoint requires real credentials — set OLLAMA_API_KEY or configure an auth profile."

#43945 — subagent credential lookup regression, partially addressed by this change for the remote-Cloud path.
#69855 — session-level provider drift after fallback, filed separately as a distinct bug in the same "silent cloud fallback" family.

Changed files

extensions/ollama/index.test.ts (modified, +61/-0)
extensions/ollama/index.ts (modified, +4/-0)
extensions/ollama/src/discovery-shared.test.ts (added, +57/-0)
extensions/ollama/src/discovery-shared.ts (modified, +65/-0)

Code Example

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/gemma4:31b",
        "fallbacks": ["ollama/qwen3:32b"]
      }
    }
  },
  "models": {
    "providers": {
      "ollama": {
        "apiKey": "ollama-local",
        "baseUrl": "http://127.0.0.1:11434",
        "api": "ollama"
      },
      "external-provider": {
        "apiKey": "<REAL_KEY>",
        "baseUrl": "https://some.external.api/v1",
        "api": "openai-completions"
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

Related to but distinct from #43945 (which is about subagent credential lookup specifically). This issue is about main-session drift after a fallback cascade.

Severity: privacy regression

An agent configured to use only local/private models can silently route content to any provider configured in models.providers, regardless of whether that provider was intended as a fallback. Users who configure models.providers for manual model selection via the control UI have no reason to expect those providers to become fallback targets after a network blip.

Steps to reproduce

Minimal config:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/gemma4:31b",
        "fallbacks": ["ollama/qwen3:32b"]
      }
    }
  },
  "models": {
    "providers": {
      "ollama": {
        "apiKey": "ollama-local",
        "baseUrl": "http://127.0.0.1:11434",
        "api": "ollama"
      },
      "external-provider": {
        "apiKey": "<REAL_KEY>",
        "baseUrl": "https://some.external.api/v1",
        "api": "openai-completions"
      }
    }
  }
}

Note: external-provider is in models.providers but NOT in any .model.fallbacks. User intent: "I want this available for manual openclaw model set external-provider/foo but not as an automatic fallback."

Reproduction:

Start the gateway. Primary + fallback calls succeed against Ollama. Good.
Induce an Ollama failure (block 127.0.0.1:11434 temporarily, or kill Ollama, or simulate DNS failure with /etc/hosts).
Send a request to the agent. Primary fails. Fallback fails. Gateway reaches into models.providers and finds external-provider. Calls it. Success.
Restore Ollama. Send another request to the agent in the same session.
Observed: request goes directly to external-provider, no Ollama attempt.
Expected: gateway re-tries declared primary + fallback chain on each request (or at minimum on a new turn boundary).

Repeated turns continue to hit external-provider. Only a gateway restart (which creates new sessions) restores declared behavior.

Evidence from production incident

Key data points from a recent production occurrence (full forensics on request):

Session JSONL file accumulated 585 calls to the cached non-Ollama provider over 7 days.
Material cost impact routed through a third-party cloud model path we explicitly did not intend.
model-snapshot events in the session file clearly record the drift (provider: ollama → provider: <other>), but no observability surface reflected it.
Triggering failure was a single transient DNS blip during a regional WAN outage. The drift persisted for the entire 7 days because the long-running heartbeat session kept appending rather than rotating.

Proposed fix shape

Three possible behaviors, in order of caution:

(a) Re-probe declared chain on every new turn. On each user-visible message boundary, re-evaluate the declared primary → fallbacks chain before reusing the cached provider. Cheap (existing model probe), zero silent drift.

(b) Emit a visible provider-drift event when drift persists. Every turn where the effective model differs from the declared primary, emit a model_change or provider_drift event to the gateway log, Discord indicator, and control UI. Doesn't fix the behavior but surfaces it.

(c) Require explicit opt-in for session-level caching. Add gateway.sessionModelCaching: "declared-chain-only" | "last-success" | "off" config flag. Default to "declared-chain-only" — caching applies only to providers that are in the declared chain. Undeclared providers never persist across turns.

(c) would have prevented our incident entirely. (a) would have limited blast radius to a single turn. (b) alone wouldn't prevent the drift but would have made detection take hours instead of days.

Diagnostic tooling suggestion (independent of fix)

A command like openclaw sessions drift that greps all session files for model-snapshot events where provider differs from the agent's declared primary. Would surface existing stuck sessions without requiring a fix.

Other notes

Not a duplicate of #43945 (that's about subagent auth specifically and hits a different code path).
Related family of bugs: "silent fallback to cloud" — see #43945 for another instance.

Happy to provide the session JSONL excerpt (with API keys redacted) and the specific openclaw.json configuration that triggered the drift if helpful for a repro case.

extent analysis

TL;DR

To fix the issue, re-probe the declared chain on every new turn by implementing behavior (a) to prevent silent drift to undeclared providers.

Guidance

Implement behavior (a) Re-probe declared chain on every new turn: On each user-visible message boundary, re-evaluate the declared primary → fallbacks chain before reusing the cached provider to prevent silent drift.
Consider adding a provider-drift event to surface drift occurrences, making it easier to detect and respond to such events.
Evaluate the proposed gateway.sessionModelCaching config flag to control session-level caching behavior, potentially setting it to "declared-chain-only" to prevent caching of undeclared providers.
Utilize the suggested openclaw sessions drift diagnostic tool to identify existing stuck sessions where the provider differs from the agent's declared primary.

Example

No specific code snippet is provided due to the nature of the issue, but implementing behavior (a) would involve modifying the logic that handles new turns to re-evaluate the declared chain before using a cached provider.

Notes

The proposed fixes aim to address the silent drift issue by either preventing it (behavior (a)), making it more visible (behavior (b)), or controlling caching behavior (behavior (c)). Each approach has its benefits and should be considered based on the specific requirements and constraints of the system.

Recommendation

Apply workaround by implementing behavior (a) Re-probe declared chain on every new turn as it directly addresses the issue by ensuring that the declared chain is re-evaluated on each new turn, thus preventing silent drift to undeclared providers. This approach is preferred because it is a direct fix that does not introduce additional complexity or require significant changes to the existing system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Session-level provider caching bypasses declared model chain after any fallback success (silent provider drift) [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #69857: Ollama: gate synthetic auth on local baseUrl classification (realigns #59954)

Description (problem / solution / changelog)

Summary

What changes

What does NOT change

Backwards compatibility

Related

Changed files

Code Example

Summary

Severity: privacy regression

Steps to reproduce

Evidence from production incident

Proposed fix shape

Diagnostic tooling suggestion (independent of fix)

Other notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING