openclaw - 💡(How to fix) Fix before_model_resolve hook fires once per fallback iteration in runWithModelFallback, defeating runtime failover for routing plugins [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63139Fetched 2026-04-09 07:57:57
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Author
Timeline (top)
commented ×1cross-referenced ×1

The before_model_resolve hook is invoked once per fallback iteration inside runWithModelFallbackrunAgentAttemptrunEmbeddedPiAgent, not once per logical agent run. Routing plugins that unconditionally rewrite providerOverride / modelOverride therefore defeat the runtime's model fallback chain entirely: every candidate the iterator advances to is silently rewritten back to the plugin's chosen model, and the user sees All models failed (N) even when downstream candidates are healthy.

The hook's public docstring (PluginHookBeforeModelResolveEvent) says "User prompt for this run", reinforcing the natural reading that the hook fires once per turn. The runtime's actual behavior contradicts that contract.

This is the inverse symptom of #41487 — that issue reports plugin overrides being silently dropped at execution; this issue reports plugin overrides being applied so aggressively (re-requested on every iteration) that they defeat fallback. Both issues touch the same before_model_resolverunWithModelFallback boundary, and a clean fix would likely resolve both.

Affects: [email protected] (current). Build hash 7ffe7e4.

Error Message

  • Final error: All models failed (N): openai-codex/gpt-5.2-codex ... | anthropic/claude-haiku-4-5: ... — but anthropic was never actually called

Root Cause

Plugins can then memoize per-run or skip on retries. Still backwards-compatible because the new fields are optional. Concretely:

Fix Action

Fix / Workaround

Plugin-side workaround we shipped (not blocking, FYI for the maintainer)

The workaround works, but it's load-bearing on a 30-second timing assumption that wouldn't be needed if the runtime called the hook at the right level. Strategy A would activate automatically on any release that adds runId to hookCtx (Option 2 above).

  • #41487 — providerOverride/modelOverride is not consistently honored at final execution. Same hook ↔ fallback boundary, opposite symptom (override silently dropped vs override over-applied). Both could be fixed by Option 1.

Code Example

async function runWithModelFallback(params) {
  const candidates = resolveFallbackCandidates({ ... });
  ...
  for (let i = 0; i < candidates.length; i += 1) {
    const candidate = candidates[i];
    ...
    const attemptRun = await runFallbackAttempt({
      run: params.run,
      ...candidate,
      ...
    });
    ...
  }
}

---

const fallbackResult = await runWithModelFallback({
  cfg, provider, model, runId, agentDir,
  fallbacksOverride: effectiveFallbacksOverride,
  run: (providerOverride, modelOverride, runOptions) => {
    const isFallbackRetry = fallbackAttemptIndex > 0;
    fallbackAttemptIndex += 1;
    return runAgentAttempt({
      providerOverride,    // ← candidate from the fallback iterator
      modelOverride,       // ← candidate from the fallback iterator
      ...
      isFallbackRetry,
      ...
    });
  }
});

---

return runEmbeddedPiAgent({
  ...
  provider: params.providerOverride,
  model: params.modelOverride,
  ...
});

---

async function runEmbeddedPiAgent(params) {
  ...
  let provider = (params.provider ?? "anthropic").trim() || "anthropic";
  let modelId = (params.model ?? "claude-opus-4-6").trim() || "claude-opus-4-6";
  ...
  const hookCtx = { agentId, sessionKey, sessionId, workspaceDir,
                    messageProvider, trigger, channelId };  // ← no runId, no isFallbackRetry, no attempt index

  if (hookRunner?.hasHooks("before_model_resolve")) try {
    modelResolveOverride = await hookRunner.runBeforeModelResolve(
      { prompt: params.prompt }, hookCtx);
  } catch (hookErr) { ... }
  ...
  if (modelResolveOverride?.providerOverride) {
    provider = modelResolveOverride.providerOverride;
    log$26.info(`[hooks] provider overridden to ${provider}`);
  }
  if (modelResolveOverride?.modelOverride) {
    modelId = modelResolveOverride.modelOverride;
    log$26.info(`[hooks] model overridden to ${modelId}`);
  }
  ...
}

---

agentId, sessionKey, sessionId, workspaceDir, messageProvider, trigger, channelId

---

export type PluginHookAgentContext = {
    agentId?: string;
    sessionKey?: string;
    sessionId?: string;
    workspaceDir?: string;
    messageProvider?: string;
    /** What initiated this agent run: \"user\", \"heartbeat\", \"cron\", or \"memory\". */
    trigger?: string;
    /** Channel identifier (e.g. \"telegram\", \"discord\", \"whatsapp\"). */
    channelId?: string;
};
export type PluginHookBeforeModelResolveEvent = {
    /** User prompt for this run. No session messages are available yet in this phase. */
    prompt: string;
};

---

\"agents.defaults.subagents.model.fallbacks\": [
     \"openai-codex/gpt-5.4\",
     \"openai-codex/gpt-5.2-codex\",
     \"anthropic/claude-haiku-4-5\"
   ]

---

export type PluginHookAgentContext = {
    agentId?: string;
    sessionKey?: string;
    sessionId?: string;
    workspaceDir?: string;
    messageProvider?: string;
    trigger?: string;
    channelId?: string;
    /** Unique identifier for this logical agent run. Stable across fallback retries. */
    runId?: string;
    /** Zero-based index of the current fallback attempt. 0 = first attempt. */
    attempt?: number;
    /** True when this hook invocation is a model-fallback retry, not the first attempt. */
    isFallbackRetry?: boolean;
};
RAW_BUFFERClick to expand / collapse

Summary

The before_model_resolve hook is invoked once per fallback iteration inside runWithModelFallbackrunAgentAttemptrunEmbeddedPiAgent, not once per logical agent run. Routing plugins that unconditionally rewrite providerOverride / modelOverride therefore defeat the runtime's model fallback chain entirely: every candidate the iterator advances to is silently rewritten back to the plugin's chosen model, and the user sees All models failed (N) even when downstream candidates are healthy.

The hook's public docstring (PluginHookBeforeModelResolveEvent) says "User prompt for this run", reinforcing the natural reading that the hook fires once per turn. The runtime's actual behavior contradicts that contract.

This is the inverse symptom of #41487 — that issue reports plugin overrides being silently dropped at execution; this issue reports plugin overrides being applied so aggressively (re-requested on every iteration) that they defeat fallback. Both issues touch the same before_model_resolverunWithModelFallback boundary, and a clean fix would likely resolve both.

Affects: [email protected] (current). Build hash 7ffe7e4.

Code path

runWithModelFallback is the outer fallback loop. It iterates over resolveFallbackCandidates(...) and invokes a run() callback per candidate:

pi-embedded-CbCYZxIb.js:84829-84957 (runWithModelFallback):

async function runWithModelFallback(params) {
  const candidates = resolveFallbackCandidates({ ... });
  ...
  for (let i = 0; i < candidates.length; i += 1) {
    const candidate = candidates[i];
    ...
    const attemptRun = await runFallbackAttempt({
      run: params.run,
      ...candidate,
      ...
    });
    ...
  }
}

The run callback is supplied by agentCommandInternal and constructs a fresh runAgentAttempt per candidate:

pi-embedded-CbCYZxIb.js:126916-126957 (caller wiring):

const fallbackResult = await runWithModelFallback({
  cfg, provider, model, runId, agentDir,
  fallbacksOverride: effectiveFallbacksOverride,
  run: (providerOverride, modelOverride, runOptions) => {
    const isFallbackRetry = fallbackAttemptIndex > 0;
    fallbackAttemptIndex += 1;
    return runAgentAttempt({
      providerOverride,    // ← candidate from the fallback iterator
      modelOverride,       // ← candidate from the fallback iterator
      ...
      isFallbackRetry,
      ...
    });
  }
});

runAgentAttempt (line 126354) forwards the override into runEmbeddedPiAgent as provider / model:

pi-embedded-CbCYZxIb.js:126427-126470:

return runEmbeddedPiAgent({
  ...
  provider: params.providerOverride,
  model: params.modelOverride,
  ...
});

runEmbeddedPiAgent then unconditionally calls runBeforeModelResolve on every entry, before applying the override:

pi-embedded-CbCYZxIb.js:177235-177309:

async function runEmbeddedPiAgent(params) {
  ...
  let provider = (params.provider ?? "anthropic").trim() || "anthropic";
  let modelId = (params.model ?? "claude-opus-4-6").trim() || "claude-opus-4-6";
  ...
  const hookCtx = { agentId, sessionKey, sessionId, workspaceDir,
                    messageProvider, trigger, channelId };  // ← no runId, no isFallbackRetry, no attempt index

  if (hookRunner?.hasHooks("before_model_resolve")) try {
    modelResolveOverride = await hookRunner.runBeforeModelResolve(
      { prompt: params.prompt }, hookCtx);
  } catch (hookErr) { ... }
  ...
  if (modelResolveOverride?.providerOverride) {
    provider = modelResolveOverride.providerOverride;
    log$26.info(`[hooks] provider overridden to ${provider}`);
  }
  if (modelResolveOverride?.modelOverride) {
    modelId = modelResolveOverride.modelOverride;
    log$26.info(`[hooks] model overridden to ${modelId}`);
  }
  ...
}

Net effect: The fallback iterator picks candidate N, calls runAgentAttempt(N)runEmbeddedPiAgent(N). Inside runEmbeddedPiAgent, the plugin's before_model_resolve runs, rewrites the model back to its router-chosen model M, and the actual API call goes to M instead of N. The fallback iterator's intent is invisibly discarded. When M fails again, the iterator advances to candidate N+1 → same dance → still M. Eventually the iterator exhausts and surfaces All models failed listing N, N+1, N+2, ... — none of which were actually called.

Hook context is missing the fields plugins would need to detect retries

The ctx object built at pi-embedded-CbCYZxIb.js:177279-177287 carries:

agentId, sessionKey, sessionId, workspaceDir, messageProvider, trigger, channelId

It does not carry:

  • runId — although params.runId is in scope right there
  • attempt / fallbackAttemptIndex — although runAgentAttempt already tracks this
  • isFallbackRetry — although runAgentAttempt already passes it explicitly to runEmbeddedPiAgent

The public type declaration confirms this is the documented surface:

plugin-sdk/src/plugins/types.d.ts:1241-1255:

export type PluginHookAgentContext = {
    agentId?: string;
    sessionKey?: string;
    sessionId?: string;
    workspaceDir?: string;
    messageProvider?: string;
    /** What initiated this agent run: \"user\", \"heartbeat\", \"cron\", or \"memory\". */
    trigger?: string;
    /** Channel identifier (e.g. \"telegram\", \"discord\", \"whatsapp\"). */
    channelId?: string;
};
export type PluginHookBeforeModelResolveEvent = {
    /** User prompt for this run. No session messages are available yet in this phase. */
    prompt: string;
};

The docstring on PluginHookBeforeModelResolveEvent.prompt ("User prompt for this run") is the misleading part — it implies one invocation per run.

Reproduction

  1. Configure a fallback chain with at least two distinct providers, e.g.:
    \"agents.defaults.subagents.model.fallbacks\": [
      \"openai-codex/gpt-5.4\",
      \"openai-codex/gpt-5.2-codex\",
      \"anthropic/claude-haiku-4-5\"
    ]
  2. Install a before_model_resolve plugin that unconditionally returns { providerOverride: \"openai-codex\", modelOverride: \"gpt-5.4-mini\" } for any call (the simplest possible router plugin).
  3. Force the openai-codex provider into a failure state (rate-limited, ChatGPT Plus weekly cap reached, OAuth refresh failure, etc.)
  4. Send any user message.

Observed:

  • Gateway journal shows model-fallback decision events advancing through the configured chain (candidate=openai-codex/gpt-5.2-codex ... next=anthropic/claude-haiku-4-5)
  • [hooks] model overridden to gpt-5.4-mini fires multiple times for the same runId
  • All agent end events for that runId report model=gpt-5.4-mini provider=openai-codex — never anthropic, never the candidates the iterator advanced through
  • Final error: All models failed (N): openai-codex/gpt-5.2-codex ... | anthropic/claude-haiku-4-5: ... — but anthropic was never actually called

Expected:

  • before_model_resolve should fire once per logical agent run, before runWithModelFallback's loop begins. The plugin's choice becomes the primary model. If that fails, the runtime's fallback iterator advances to the next configured candidate without re-asking the plugin.
  • Alternatively: the hook may fire per iteration but the context must give plugins enough information to distinguish retries (runId + isFallbackRetry + attempt) so they can opt into single-shot behavior.

Suggested fixes (any one works; 1 is cleanest)

Option 1 — Move runBeforeModelResolve outside runWithModelFallback (preferred)

Hoist the hook call to the layer that owns the logical run boundary (agentCommandInternal or wherever runId is minted), capture the override there, and pass the resolved (provider, model) into runWithModelFallback as the starting candidate. The fallback iterator then operates on the configured chain anchored at the plugin's choice — failover works as designed.

This matches the natural reading of the docstring ("for this run") and eliminates an entire class of plugin foot-guns.

Option 2 — Add runId, attempt, isFallbackRetry to PluginHookAgentContext

Plugins can then memoize per-run or skip on retries. Still backwards-compatible because the new fields are optional. Concretely:

export type PluginHookAgentContext = {
    agentId?: string;
    sessionKey?: string;
    sessionId?: string;
    workspaceDir?: string;
    messageProvider?: string;
    trigger?: string;
    channelId?: string;
    /** Unique identifier for this logical agent run. Stable across fallback retries. */
    runId?: string;
    /** Zero-based index of the current fallback attempt. 0 = first attempt. */
    attempt?: number;
    /** True when this hook invocation is a model-fallback retry, not the first attempt. */
    isFallbackRetry?: boolean;
};

The values are already in scope at the call site (pi-embedded-CbCYZxIb.js:177279). Wiring them through is a one-line change to the hookCtx object literal.

Option 3 — Both

Option 1 fixes the architectural smell; Option 2 hardens the contract for plugins that want per-iteration visibility (e.g. for telemetry, cost tracking, or A/B testing). They're not mutually exclusive.

Plugin-side workaround we shipped (not blocking, FYI for the maintainer)

We mitigated this in our task-router plugin via dual memoization:

  • Strategy A (forward-compat, dormant today): Map<runId, {ts}> keyed on runId with 10-minute TTL. Wired and ready for the day the runtime starts passing runId in hookCtx.
  • Strategy B (active today): (sessionId, promptKey) memoization with a 30-second window. Fallback iterations of the same user turn share session ID and prompt text, so they collapse to a single route_applied event. logSkip(\"already_routed_for_run\") fires on suppressed iterations.

Code reference (downstream):

The workaround works, but it's load-bearing on a 30-second timing assumption that wouldn't be needed if the runtime called the hook at the right level. Strategy A would activate automatically on any release that adds runId to hookCtx (Option 2 above).

Related

  • #41487 — providerOverride/modelOverride is not consistently honored at final execution. Same hook ↔ fallback boundary, opposite symptom (override silently dropped vs override over-applied). Both could be fixed by Option 1.

extent analysis

TL;DR

The most likely fix for the issue is to move the runBeforeModelResolve hook call outside the runWithModelFallback loop, so it fires once per logical agent run, allowing plugins to make a single choice for the primary model.

Guidance

  • Identify the layer that owns the logical run boundary (agentCommandInternal or wherever runId is minted) and hoist the runBeforeModelResolve hook call to that layer.
  • Capture the override at that layer and pass the resolved (provider, model) into runWithModelFallback as the starting candidate.
  • Alternatively, add runId, attempt, and isFallbackRetry to PluginHookAgentContext to allow plugins to distinguish retries and opt into single-shot behavior.
  • Verify the fix by checking that the before_model_resolve hook fires only once per logical agent run and that the fallback iterator operates correctly.

Example

// Option 1: Move runBeforeModelResolve outside runWithModelFallback
const override = await runBeforeModelResolve({ prompt: params.prompt }, hookCtx);
const candidates = resolveFallbackCandidates({ ... });
const startingCandidate = override || candidates[0];
const fallbackResult = await runWithModelFallback({
  cfg, provider, model, runId, agentDir,
  fallbacksOverride: effectiveFallbacksOverride,
  run: (providerOverride, modelOverride, runOptions) => {
    // ...
  },
  startingCandidate,
});

Notes

  • The current implementation of runBeforeModelResolve inside runWithModelFallback causes the hook to fire once per fallback iteration, leading to unexpected behavior.
  • The suggested fixes (Option 1, Option 2, or both) aim to address this issue and provide a more consistent and predictable behavior for plugins.

Recommendation

Apply the workaround by moving the runBeforeModelResolve hook call outside the runWithModelFallback loop (Option 1), as it is the cleanest and most straightforward solution. This change will ensure that the hook fires once per logical agent run, allowing plugins to make a single choice for the primary model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING