openclaw - 💡(How to fix) Fix [Bug]: Failover writes permanent providerOverride/modelOverride to sessions.json with no self-healing — primary never re-tried [2 comments, 2 participants]

openclaw2026-04-27 07:58:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#72697•Fetched 2026-04-28 06:33:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kibedu

Participants

kibedu

steipete

Timeline (top)

closed ×2commented ×2labeled ×1renamed ×1

When the primary model in a session's agent fails (e.g. ollama timeout, transient network issue, context overflow), OpenClaw 2026.4.8 writes four override fields to the affected session's entry in agents/main/sessions/sessions.json:

{
  "agent:main:main": {
    "providerOverride": "openai",
    "modelOverride":    "gpt-4o-mini",
    "modelProvider":    "openai",
    "model":            "gpt-4o-mini"
  }
}

These values are never cleared by the runtime, even when the primary becomes healthy again. The resolver path in dist/model-selection-DYx7So9J.js (resolveStoredModelOverride → resolvePersistedOverrideModelRef) reads these fields on every subsequent run and silently bypasses the configured agents.defaults.model.primary and agents.list[].model.primary.

The session is permanently pinned to the failover model with no in-product way to recover other than:

Restarting the gateway with --reset flags (loses unrelated session state), OR
Hand-editing sessions.json (what we ended up doing).

Root Cause

The drift is silent because:

No log line is emitted when the override is written
No log line is emitted when subsequent runs read the override
The startup log says agent model: ollama/gemma4:... (from config), which masks the real per-run routing decision

Fix Action

Fix / Workaround

Joanna had 39 of 41 sessions silently pinned to the OpenAI failover model after a single ollama failure, accumulated over ~36 hours. We thought the primary was running; the user-visible symptom was higher-than-expected OpenAI cost and slower responses.
Fritz has 1 session pinned to openai/gpt-5.4-mini despite primary being explicitly switched to anthropic/claude-sonnet-4-6. We only found this through a sweeping audit with the workaround script.

Workaround we deployed

#65200 proposes that /new and /reset clear overrides — this is a UX workaround, not a fix; users have to know to invoke it.
#47705 is about config-file-level sticky-fallback (different storage layer).
This bug is specifically: automatic write on failover + no automatic cleanup ever.

Code Example

{
  "agent:main:main": {
    "providerOverride": "openai",
    "modelOverride":    "gpt-4o-mini",
    "modelProvider":    "openai",
    "model":            "gpt-4o-mini"
  }
}

---

"agents": {
     "defaults": {
       "model": { "primary": "ollama/gemma4:26b-a4b-it-q4_K_M", "fallbacks": [] }
     }
   },
   "models": {
     "providers": {
       "ollama": { "baseUrl": "http://192.168.10.127:11434", "api": "ollama", "models": [...] }
     }
   },
   "auth": {
     "profiles": {
       "openai:manual": { "provider": "openai", "mode": "token" }
     }
   },
   "env": { "OPENAI_API_KEY": "sk-..." }

---

function resolvePersistedOverrideModelRef({ defaultProvider, overrideProvider, overrideModel, /* new */ overrideTimestamp, ttlMs = 3600_000 }) {
  if (!overrideProvider || !overrideModel) return null;
  if (overrideTimestamp && Date.now() - overrideTimestamp > ttlMs) return null;  // self-healing TTL
  return { provider: overrideProvider, model: overrideModel };
}

### Steps to reproduce

 1. Configure a bot with ollama as primary and a working OpenAI key in env:
       agents.defaults.model.primary = "ollama/<any-model>"
       agents.defaults.model.fallbacks = []
       models.providers.ollama.{baseUrl, api: "ollama", models: [...]}
       env.OPENAI_API_KEY = "sk-..."  (any valid key)

  2. Start the gateway. Confirm startup log shows:
       [gateway] agent model: ollama/<model>

  3. Send one message via any channel (Telegram/Discord/CLI).

  4. Force a single primary failure during step 3 — easiest method:
     put the ollama host in sleep mode / disconnect the network briefly /
     or use an ollama timeout config below the prompt-eval time on a
     large prompt. Gateway times out on the primary call.

  5. The gateway falls back to openai/gpt-4o-mini and replies. Wait for
     the primary to recover (wake the host / reconnect / etc.). Verify it
     is reachable: `curl http://<ollama-host>:11434/api/tags` → 200 OK.

  6. Send another message in the same session. Expect the primary to be
     used again now that it is healthy.

  7. Observed: the run still routes to openai/gpt-4o-mini, no attempt
     on the configured primary, no failover-decision log line.

  8. Inspect ~/.openclaw-<bot>/agents/main/sessions/sessions.json:
     the affected session entry contains
       "providerOverride": "openai",
       "modelOverride":    "gpt-4o-mini",
       "modelProvider":    "openai",
       "model":            "gpt-4o-mini"
     These are written silently on first failover and never cleared.

  9. Restart the gateway. Override persists across restarts (it is
     session state, not in-memory).

  10. Manually delete the four override fields from sessions.json
      (without gateway restart). Send a new message. Primary is used
      again. Confirms the override is the sole stuck-state mechanism.

### Expected behavior

  After a primary-model failure that triggers a fallback, the resolver
  should retry the configured primary on subsequent runs once the
  primary is healthy again — either:
    (a) unconditionally on the next run, OR
    (b) after a configurable cooldown (e.g. 5 min), OR
    (c) after a TTL on the override (e.g. 1 h)

  On a successful primary run the override fields in sessions.json
  (`providerOverride`, `modelOverride`, `modelProvider`, `model`) should
  be cleared automatically. A user should never need to hand-edit
  sessions.json or restart the gateway to recover from a transient
  primary failure.

### Actual behavior

  On the first primary failure, OpenClaw writes four override fields
  to the session entry in sessions.json:
    "providerOverride": "<fallback-provider>",
    "modelOverride":    "<fallback-model>",
    "modelProvider":    "<fallback-provider>",
    "model":            "<fallback-model>"

  These are read by `resolveStoredModelOverride` /
  `resolvePersistedOverrideModelRef` (in dist/model-selection-DYx7So9J.js)
  on every subsequent run and force the fallback regardless of the
  configured primary in `agents.defaults.model.primary` and
  `agents.list[].model.primary`.

  The override:
    - is never cleared by the runtime, even when the primary recovers
    - persists across gateway restarts (it is session state, not memory)
    - is silent: no log line on write, no log line on subsequent reads,
      and the startup log still says `agent model: <primary>` from the
      config, which masks the per-run routing decision

  Only manual deletion of the four fields from sessions.json restores
  primary routing. In our deployment, this drift accumulated to 39 of 41
  sessions on one bot and 1 misrouted production session on another
  before discovery via a sweeping audit.

### OpenClaw version

2026.4.8 (build 9ece252) 

### Operating system

Windows 11

### Install method

_No response_

### Model

Gemma 4

### Provider / routing chain

Configured (in openclaw.gemma.json):     primary:    ollama/gemma4:26b-a4b-it-q4_K_M     fallbacks:  []   (explicitly empty in agents.list[0].model.fallbacks                       and agents.defaults.model.fallbacks)    Available auth profiles (resolvable but not in the fallback array):     ollama:local        (provider: ollama,    mode: token)     openai:manual       (provider: openai,    mode: token)     + env.OPENAI_API_KEY = sk-proj-...    Resolved at runtime after the bug triggers:     observed at session 3b23745d (Joanna agent:main:main):       sessions.json.providerOverride = "openai"       sessions.json.modelOverride    = "gpt-4o-mini"     → resolver returns openai/gpt-4o-mini for every subsequent run      observed at session agent:main:main (Fritz):       sessions.json.providerOverride = "openai"       sessions.json.modelOverride    = "gpt-5.4-mini"     → resolver returns openai/gpt-5.4-mini for every subsequent run       despite primary configured as anthropic/claude-sonnet-4-6    Note: openai is selected as the override target even though it is not   in the configured fallbacks array. The presence of auth.profiles.openai:manual   + env.OPENAI_API_KEY appears to make it an implicit fallback target.   This is a related but separate question (how openai gets picked even   without being declared as a fallback) — the core bug being reported is   that once written, the override is never cleared.

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

Yes

Summary

{
  "agent:main:main": {
    "providerOverride": "openai",
    "modelOverride":    "gpt-4o-mini",
    "modelProvider":    "openai",
    "model":            "gpt-4o-mini"
  }
}

The session is permanently pinned to the failover model with no in-product way to recover other than:

Restarting the gateway with --reset flags (loses unrelated session state), OR
Hand-editing sessions.json (what we ended up doing).

Severity / Real-world impact

In our deployment (5 active gateways, ~30-50 sessions per agent on average):

Joanna had 39 of 41 sessions silently pinned to the OpenAI failover model after a single ollama failure, accumulated over ~36 hours. We thought the primary was running; the user-visible symptom was higher-than-expected OpenAI cost and slower responses.
Fritz has 1 session pinned to openai/gpt-5.4-mini despite primary being explicitly switched to anthropic/claude-sonnet-4-6. We only found this through a sweeping audit with the workaround script.

The drift is silent because:

No log line is emitted when the override is written
No log line is emitted when subsequent runs read the override
The startup log says agent model: ollama/gemma4:... (from config), which masks the real per-run routing decision

Reproduction

Tested on OpenClaw 2026.4.8.

Configure a bot with:

"agents": {
  "defaults": {
    "model": { "primary": "ollama/gemma4:26b-a4b-it-q4_K_M", "fallbacks": [] }
  }
},
"models": {
  "providers": {
    "ollama": { "baseUrl": "http://192.168.10.127:11434", "api": "ollama", "models": [...] }
  }
},
"auth": {
  "profiles": {
    "openai:manual": { "provider": "openai", "mode": "token" }
  }
},
"env": { "OPENAI_API_KEY": "sk-..." }

Force a single primary failure on a session — e.g. pmset sleepnow on the Mac running ollama, then send a message; gateway times out → falls back to openai/gpt-4o-mini (despite fallbacks: [] being declared empty).
Wake the ollama host. Verify ollama is reachable: curl http://192.168.10.127:11434/api/tags → 200 OK.
Send another message in the same session. Expected: primary used. Actual: still routes to openai/gpt-4o-mini because of providerOverride/modelOverride in sessions.json.
Inspect ~/.openclaw-<bot>/agents/main/sessions/sessions.json → confirm the four override fields are present on that session entry.
Restart the gateway. Override persists across restarts (it's session state, not in-memory).

Expected behavior

After a configurable cooldown (or unconditionally on next run), the resolver should attempt the configured primary again and, on success, clear the override fields. Alternatively, a TTL on the override (e.g. 1 hour) would be sufficient self-healing for most cases.

Workaround we deployed

Idempotent cleanup script that scans all bots' sessions.json and nulls non-empty override fields. Cron'd to run hourly. Source available at user request.

Files (best guesses)

dist/model-selection-DYx7So9J.js — resolveStoredModelOverride, resolvePersistedOverrideModelRef
dist/store.runtime-BUcpw0Z0.js — session store mutator, presumed write-site of overrides
agents/main/sessions/sessions.json — drift sink

Why it's distinct from the related issues

#65200 proposes that /new and /reset clear overrides — this is a UX workaround, not a fix; users have to know to invoke it.
#47705 is about config-file-level sticky-fallback (different storage layer).
This bug is specifically: automatic write on failover + no automatic cleanup ever.

Environment

OpenClaw 2026.4.8 (9ece252)
Node.js 24.13.0
WSL2 on Windows 11
Ollama backend (separate macOS host on LAN), but reproducible with any provider that can transiently fail

Optional: PR-Skizze

If a PR is welcome, the minimal change in resolvePersistedOverrideModelRef would be:

function resolvePersistedOverrideModelRef({ defaultProvider, overrideProvider, overrideModel, /* new */ overrideTimestamp, ttlMs = 3600_000 }) {
  if (!overrideProvider || !overrideModel) return null;
  if (overrideTimestamp && Date.now() - overrideTimestamp > ttlMs) return null;  // self-healing TTL
  return { provider: overrideProvider, model: overrideModel };
}

### Steps to reproduce

 1. Configure a bot with ollama as primary and a working OpenAI key in env:
       agents.defaults.model.primary = "ollama/<any-model>"
       agents.defaults.model.fallbacks = []
       models.providers.ollama.{baseUrl, api: "ollama", models: [...]}
       env.OPENAI_API_KEY = "sk-..."  (any valid key)

  2. Start the gateway. Confirm startup log shows:
       [gateway] agent model: ollama/<model>

  3. Send one message via any channel (Telegram/Discord/CLI).

  4. Force a single primary failure during step 3 — easiest method:
     put the ollama host in sleep mode / disconnect the network briefly /
     or use an ollama timeout config below the prompt-eval time on a
     large prompt. Gateway times out on the primary call.

  5. The gateway falls back to openai/gpt-4o-mini and replies. Wait for
     the primary to recover (wake the host / reconnect / etc.). Verify it
     is reachable: `curl http://<ollama-host>:11434/api/tags` → 200 OK.

  6. Send another message in the same session. Expect the primary to be
     used again now that it is healthy.

  7. Observed: the run still routes to openai/gpt-4o-mini, no attempt
     on the configured primary, no failover-decision log line.

  8. Inspect ~/.openclaw-<bot>/agents/main/sessions/sessions.json:
     the affected session entry contains
       "providerOverride": "openai",
       "modelOverride":    "gpt-4o-mini",
       "modelProvider":    "openai",
       "model":            "gpt-4o-mini"
     These are written silently on first failover and never cleared.

  9. Restart the gateway. Override persists across restarts (it is
     session state, not in-memory).

  10. Manually delete the four override fields from sessions.json
      (without gateway restart). Send a new message. Primary is used
      again. Confirms the override is the sole stuck-state mechanism.

### Expected behavior

  After a primary-model failure that triggers a fallback, the resolver
  should retry the configured primary on subsequent runs once the
  primary is healthy again — either:
    (a) unconditionally on the next run, OR
    (b) after a configurable cooldown (e.g. 5 min), OR
    (c) after a TTL on the override (e.g. 1 h)

  On a successful primary run the override fields in sessions.json
  (`providerOverride`, `modelOverride`, `modelProvider`, `model`) should
  be cleared automatically. A user should never need to hand-edit
  sessions.json or restart the gateway to recover from a transient
  primary failure.

### Actual behavior

  On the first primary failure, OpenClaw writes four override fields
  to the session entry in sessions.json:
    "providerOverride": "<fallback-provider>",
    "modelOverride":    "<fallback-model>",
    "modelProvider":    "<fallback-provider>",
    "model":            "<fallback-model>"

  These are read by `resolveStoredModelOverride` /
  `resolvePersistedOverrideModelRef` (in dist/model-selection-DYx7So9J.js)
  on every subsequent run and force the fallback regardless of the
  configured primary in `agents.defaults.model.primary` and
  `agents.list[].model.primary`.

  The override:
    - is never cleared by the runtime, even when the primary recovers
    - persists across gateway restarts (it is session state, not memory)
    - is silent: no log line on write, no log line on subsequent reads,
      and the startup log still says `agent model: <primary>` from the
      config, which masks the per-run routing decision

  Only manual deletion of the four fields from sessions.json restores
  primary routing. In our deployment, this drift accumulated to 39 of 41
  sessions on one bot and 1 misrouted production session on another
  before discovery via a sweeping audit.

### OpenClaw version

2026.4.8 (build 9ece252) 

### Operating system

Windows 11

### Install method

_No response_

### Model

Gemma 4

### Provider / routing chain

Configured (in openclaw.gemma.json):     primary:    ollama/gemma4:26b-a4b-it-q4_K_M     fallbacks:  []   (explicitly empty in agents.list[0].model.fallbacks                       and agents.defaults.model.fallbacks)    Available auth profiles (resolvable but not in the fallback array):     ollama:local        (provider: ollama,    mode: token)     openai:manual       (provider: openai,    mode: token)     + env.OPENAI_API_KEY = sk-proj-...    Resolved at runtime after the bug triggers:     observed at session 3b23745d (Joanna agent:main:main):       sessions.json.providerOverride = "openai"       sessions.json.modelOverride    = "gpt-4o-mini"     → resolver returns openai/gpt-4o-mini for every subsequent run      observed at session agent:main:main (Fritz):       sessions.json.providerOverride = "openai"       sessions.json.modelOverride    = "gpt-5.4-mini"     → resolver returns openai/gpt-5.4-mini for every subsequent run       despite primary configured as anthropic/claude-sonnet-4-6    Note: openai is selected as the override target even though it is not   in the configured fallbacks array. The presence of auth.profiles.openai:manual   + env.OPENAI_API_KEY appears to make it an implicit fallback target.   This is a related but separate question (how openai gets picked even   without being declared as a fallback) — the core bug being reported is   that once written, the override is never cleared.

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

Impact and severity

Affected users/systems/channels: - Bot gateway (Telegram main channel + Discord channel-1490284280746868796): 39 of 41 sessions in sessions.json accumulated sticky override fields over ~36 hours (between 2026-04-25 08:15 symlink-switch and 2026-04-27 09:36 rollback). All assistant replies in that window were silently routed to openai/gpt-4o-mini despite primary being configured as ollama/gemma4:26b-a4b-it-q4_K_M. - Bot gateway (production): 1 session (agent:main:main) sticky on openai/gpt-5.4-mini despite primary configured as anthropic/sonnet-4.6, found via cross-bot audit on 2026-04-27. Time of original failover not observed. - Reproducible across at least: ollama, anthropic, openai providers as primaries (any provider whose call path can transiently fail). - Audit scope: 5 OpenClaw deployments scanned. 2 of 5 had sticky overrides at observation time.

Severity: - Blocks the configured-primary workflow indefinitely without user intervention. No in-product way to recover other than manual sessions.json edit. - Cost risk: observed $0.0053 USD per reply on gpt-4o-mini fallback (35k input tokens × $0.15/1M = ~$0.005/reply, confirmed via session usage.cost.total field). Primary in the Bot case was a flat-fee subscription model, so every silent fallback is direct out-of-pocket cost. Multiplied by ~30+ runs/day × 36h, we observed a non-trivial unanticipated billing impact. - Observability risk: startup log emits [gateway] agent model: <primary> even when every actual run routes to the fallback. Operators believe the primary is in use when it is not. - No data loss or corruption observed.

Frequency: - Always sticky once written. The sessions.json override fields are not cleared by any code path observed in dist/ — no TTL, no auto-recovery on primary health check, no log of subsequent reads. - Triggered by every single primary failure (timeout, network interruption, context overflow). Probability of being triggered over a multi-week deployment approaches 1. - Distribution skews toward older sessions: in the Bot audit, older sessions (across full 36h window) had higher drift than sessions created post-rollback.

Consequence: - Silent unanticipated cloud-API cost when primary is a paid-Sub or free local model (observed: 1 Bot). - Misrouted production traffic to a different model family than configured (observed: Fritz on openai instead of anthropic), causing tool-calling-style and persona-drift mismatch with the deployment's intended configuration. - Operator confidence collapse: hours spent diagnosing why a "configured" primary appears unused — observability gives wrong signal. - Without a manual cleanup script, drift accumulates monotonically. No self-healing. Recommended workaround (until upstream fix): a cron'd cleanup script that nulls override fields not matching the config primary.

Additional information

No response

extent analysis

TL;DR

The most likely fix is to introduce a TTL (time-to-live) mechanism for the override fields in sessions.json to automatically clear them after a configurable time period, allowing the resolver to retry the configured primary model.

Guidance

Introduce a TTL (e.g., 1 hour) for the override fields in sessions.json to enable automatic cleanup.
Modify the resolvePersistedOverrideModelRef function to check the TTL and clear the override fields if exceeded.
Consider adding a cooldown period or a health check for the primary model before retrying it.
Implement a logging mechanism to track when override fields are written and read to improve observability.

Example

function resolvePersistedOverrideModelRef({ defaultProvider, overrideProvider, overrideModel, overrideTimestamp, ttlMs = 3600_000 }) {
  if (!overrideProvider || !overrideModel) return null;
  if (overrideTimestamp && Date.now() - overrideTimestamp > ttlMs) {
    // Clear override fields if TTL exceeded
    return null;
  }
  return { provider: overrideProvider, model: overrideModel };
}

Notes

The introduction of a TTL mechanism requires careful consideration of the trade-off between automatic cleanup and potential false positives (i.e., clearing override fields too soon). The chosen TTL value should balance the need for self-healing with the risk of premature override field clearance.

Recommendation

Apply a workaround by introducing a TTL mechanism for the override fields, as this will allow for automatic cleanup and retrying of the configured primary model. This approach is preferable to manual editing of sessions.json or restarting the gateway, which can lead to data loss or operator confusion.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

After a primary-model failure that triggers a fallback, the resolver should retry the configured primary on subsequent runs once the primary is healthy again — either: (a) unconditionally on the next run, OR (b) after a configurable cooldown (e.g. 5 min), OR (c) after a TTL on the override (e.g. 1 h)

On a successful primary run the override fields in sessions.json (providerOverride, modelOverride, modelProvider, model) should be cleared automatically. A user should never need to hand-edit sessions.json or restart the gateway to recover from a transient primary failure.

#api #model loading #dependency error #configuration error #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: Failover writes permanent providerOverride/modelOverride to sessions.json with no self-healing — primary never re-tried [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround we deployed

Code Example

Bug type

Beta release blocker

Summary

Severity / Real-world impact

Reproduction

Expected behavior

Workaround we deployed

Files (best guesses)

Why it's distinct from the related issues

Environment

Optional: PR-Skizze

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING