openclaw - ✅(Solved) Fix [Bug]: Automatic model fallback does not recover cleanly from invalid primary model ID; user-facing HTTP 400 is surfaced instead of silent failover [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#50017Fetched 2026-04-08 01:00:15
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
labeled ×2commented ×1cross-referenced ×1

When OpenClaw is configured with a primary model plus valid fallback models, and the primary is intentionally or accidentally set to an invalid model ID, OpenClaw surfaces the provider error directly to the user:

HTTP 400: openrouter/invalid_test_model is not a valid model ID

Instead of transparently failing over to the first valid fallback model, the chat interaction is interrupted and repeated user messages continue to fail until the config is manually restored.

Error Message

When OpenClaw is configured with a primary model plus valid fallback models, and the primary is intentionally or accidentally set to an invalid model ID, OpenClaw surfaces the provider error directly to the user: 3. suppress the primary error from the user-facing reply if fallback succeeds

  • fallback does not rescue the interaction before the error is surfaced

Root Cause

When OpenClaw is configured with a primary model plus valid fallback models, and the primary is intentionally or accidentally set to an invalid model ID, OpenClaw surfaces the provider error directly to the user:

HTTP 400: openrouter/invalid_test_model is not a valid model ID

Instead of transparently failing over to the first valid fallback model, the chat interaction is interrupted and repeated user messages continue to fail until the config is manually restored.

Fix Action

Fixed

PR fix notes

PR #50028: fix: classify invalid-model fallback errors

Description (problem / solution / changelog)

Summary

  • Problem: invalid primary model IDs returned as HTTP 400/422 were classified as generic format failures instead of model_not_found, which weakened the automatic fallback path for issue-backed "… is not a valid model ID" responses.
  • Why it matters: the fallback system, warning path, and retry semantics all treat model_not_found as a first-class reason; misclassifying these provider responses makes invalid-model failover behavior less reliable and less observable.
  • What changed: isModelNotFoundErrorMessage() now recognizes "not a valid model ID", and HTTP 400/422 classification now prefers model_not_found before the generic format fallback.
  • What did NOT change (scope boundary): generic 400/422 request-shape errors still stay in the format lane, and billing overrides on 400/422 are still preserved.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #50017
  • Related #

User-visible / Behavior Changes

Invalid primary model IDs that come back as HTTP 400/422 ... is not a valid model ID now stay in the model_not_found failover lane instead of being downgraded to a generic format error.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Windows 11
  • Runtime/container: Node 20 / pnpm 10 in local worktree validation
  • Model/provider: invalid model IDs under openrouter-style HTTP 400/422 responses
  • Integration/channel (if any): agent model failover
  • Relevant config (redacted): primary model set to an invalid model ID with valid configured fallbacks

Steps

  1. Configure a primary model ref that does not exist and at least one valid fallback.
  2. Trigger a provider error shaped like HTTP 400: openrouter/__invalid_test_model__ is not a valid model ID.
  3. Observe failover classification and fallback attempt recording.

Expected

  • Invalid-model failures are classified as model_not_found.
  • Automatic fallback keeps the not-found semantics instead of treating the error as a generic format failure.

Actual

  • Before this change, the same HTTP 400/422 responses were classified as format.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: added a regression in failover-error.test.ts for raw and HTTP 400/422 invalid-model messages, plus a model-fallback.test.ts case that now records the attempt as model_not_found during automatic fallback.
  • Edge cases checked: status-only 400/422 still map to format; 400/422 billing payloads still map to billing.
  • What you did not verify: full pnpm check / full repo test lanes did not complete locally in this Windows environment before the local 10s guard, so full coverage is left to CI.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this commit to restore the previous 400/422 -> format behavior.
  • Files/config to restore: src/agents/pi-embedded-helpers/errors.ts, src/agents/failover-error.test.ts, src/agents/model-fallback.test.ts
  • Known bad symptoms reviewers should watch for: generic request-shape 400/422 errors being misclassified as model_not_found.

Risks and Mitigations

  • Risk: broadening the matcher too far could misclassify unrelated 400/422 payloads.
  • Mitigation: the new match is narrow (not a valid model ID) and is still gated behind the existing model-not-found helper.

AI-assisted: yes. I verified the changed code paths and targeted tests locally.

Changed files

  • src/agents/failover-error.test.ts (modified, +20/-0)
  • src/agents/model-fallback.test.ts (modified, +26/-0)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +4/-0)
  • src/commands/models/list.probe.test.ts (modified, +2/-1)
  • src/commands/models/list.probe.ts (modified, +3/-0)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Summary

When OpenClaw is configured with a primary model plus valid fallback models, and the primary is intentionally or accidentally set to an invalid model ID, OpenClaw surfaces the provider error directly to the user:

HTTP 400: openrouter/invalid_test_model is not a valid model ID

Instead of transparently failing over to the first valid fallback model, the chat interaction is interrupted and repeated user messages continue to fail until the config is manually restored.

Steps to reproduce

Reproduction steps: Configure valid fallback models in agents.defaults.model.fallbacks Temporarily set primary to an invalid model ID, e.g.: openrouter/invalid_test_model Send a normal user request Observe the reply

Expected behavior

If a primary model fails, OpenClaw should:

  1. detect the primary failure
  2. attempt fallback automatically
  3. suppress the primary error from the user-facing reply if fallback succeeds
  4. return the fallback model’s answer normally

At minimum, if certain failure classes are intentionally not fallback-eligible, that behavior should be documented clearly.

Actual behavior

Actual behavior:

  • invalid primary model ID returns a user-visible HTTP 400
  • fallback does not rescue the interaction before the error is surfaced
  • subsequent messages continue failing until config is manually corrected

OpenClaw version

v2026.3.13

Operating system

Ubuntu 24.04.4 LTS

Install method

npm global

Model

openai-codex/gpt-5.4

Provider / routing chain

openclaw -> openai-codex (primary) -> openrouter (fallbacks)

Config file / key location

~/.openclaw/openclaw.json

Additional provider/model setup details

Additional observations

  • Explicitly selecting a fallback model for a session works fine
  • Fallback model path itself is functional
  • The failure appears to be in automatic failover UX/behavior, not in the fallback models themselves
  • Previous testing also suggested user-visible errors may leak through during other fallback-triggering cases such as 429s

Logs, screenshots, and evidence

Impact and severity

It happens when the primary model tokens have been exhausted and OpenClaw needs to fall back to the fallback models.

The severity is not critical as there is a work around by manually modifying the order of models in the openclaw.json config file.

Additional information

Suggested improvement A more robust behavior would be:

  • if primary fails and a fallback succeeds, only return the fallback result
  • log the primary failure internally
  • optionally expose failure details in logs/status, not in the user-facing chat path

extent analysis

Fix Plan

To address the issue of OpenClaw not automatically failing over to a valid fallback model when the primary model ID is invalid, we need to modify the code to handle this scenario. Here are the steps:

  • Modify the openclaw.js file to catch the error when the primary model fails and attempt to use the first valid fallback model.
  • Add a check to ensure that the primary model ID is valid before attempting to use it.
  • If the primary model ID is invalid, log the error internally and use the first valid fallback model.

Example code changes:

// openclaw.js
const primaryModelId = 'openrouter/__invalid_test_model__';
const fallbackModels = ['openai-codex/gpt-5.4', 'other-fallback-model'];

try {
  // Attempt to use primary model
  const primaryModel = getModel(primaryModelId);
  const result = await primaryModel.respond(userInput);
  // If primary model succeeds, return result
  return result;
} catch (error) {
  // If primary model fails, log error and attempt to use first valid fallback model
  console.error(`Primary model ${primaryModelId} failed: ${error}`);
  const fallbackModelId = fallbackModels[0];
  const fallbackModel = getModel(fallbackModelId);
  const fallbackResult = await fallbackModel.respond(userInput);
  // Return fallback result
  return fallbackResult;
}
  • Update the openclaw.json config file to include a list of valid fallback models.
{
  "model": {
    "primary": "openrouter/__invalid_test_model__",
    "fallbacks": ["openai-codex/gpt-5.4", "other-fallback-model"]
  }
}

Verification

To verify that the fix worked, follow these steps:

  • Set the primary model ID to an invalid value in the openclaw.json config file.
  • Send a user request to OpenClaw.
  • Verify that the response is generated by the first valid fallback model.
  • Check the logs to ensure that the primary model failure is logged internally.

Extra Tips

  • Make sure to test the fallback behavior with different types of primary model failures, such as 429 errors.
  • Consider adding additional logging and monitoring to detect and respond to primary model failures.
  • Review the OpenClaw documentation to ensure that the fallback behavior is clearly documented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

If a primary model fails, OpenClaw should:

  1. detect the primary failure
  2. attempt fallback automatically
  3. suppress the primary error from the user-facing reply if fallback succeeds
  4. return the fallback model’s answer normally

At minimum, if certain failure classes are intentionally not fallback-eligible, that behavior should be documented clearly.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING