openclaw - ๐Ÿ’ก(How to fix) Fix [Bug]: media-understanding silently routes images to user-declared vision models without validating declared capabilities [3 comments, 2 participants]

Official PRs (โ€ฆ)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful ยท Quick feedback

Loadingโ€ฆ
GitHub stats
openclaw/openclaw#81525โ€ขFetched 2026-05-14 03:31:13
View on GitHub
Comments
3
Participants
2
Timeline
3
Reactions
2
Timeline (top)
commented ร—3

The media-understanding runtime silently routes inbound images to a user-declared vision model without ever validating that the model actually supports images, so a single incorrect "input": ["text", "image"] entry in models.providers.* causes every inbound image to fail with the unhelpful warning Image model failed (<provider>/<model>) and 0/1 attempts.

Error Message

  1. Actionable error message โ€” currently the warning reads Image model failed (...). It should distinguish "model rejected the image payload" from "transport/auth error" and point at the offending input declaration.
  2. Replace the silent 0/1 failure with a structured error code (e.g. IMAGE_MODEL_REJECTED_PAYLOAD vs IMAGE_MODEL_TRANSPORT) so operators can grep meaningfully.

Root Cause

This report covers the next layer: the chosen model is found, dispatched, and immediately rejected by the provider because the locally declared "input" capability disagrees with reality. There is no catalog cross-check, no capability probe, no deterministic ordering, and the warning gives no actionable hint.

Fix Action

Fix / Workaround

This is not another instance of the Unknown model cluster tracked in #33185. The recently merged fixes #71500 and #71806 resolve the lookup/resolution gap (the chosen model is now correctly found and dispatched). That part works in 2026.5.7.

This report covers the next layer: the chosen model is found, dispatched, and immediately rejected by the provider because the locally declared "input" capability disagrees with reality. There is no catalog cross-check, no capability probe, no deterministic ordering, and the warning gives no actionable hint.

User config wins, the request is dispatched, NVIDIA rejects it because Kimi K2.5 is text-only, and OpenClaw surfaces only the opaque Image model failed warning. The downstream agent then never receives a media-understanding text description for the inbound image. In our case the agent had an independent Vision path available, so the user did not notice a functional regression โ€” only the recurring warning in logs.

Code Example

{
     "models": {
       "providers": {
         "nvidia": {
           "baseUrl": "https://integrate.api.nvidia.com/v1",
           "apiKey": "...",
           "api": "openai-completions",
           "models": [
             {
               "id": "moonshotai/kimi-k2.5",
               "input": ["text", "image"]
             }
           ]
         }
       }
     }
   }

---

[media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

---

return normalizeOptionalString(
  (providerCfg?.models ?? [])
    .find((model) =>
      Boolean(normalizeOptionalString(model?.id)) &&
      Array.isArray(model?.input) &&
      model.input.includes("image")
    )?.id
);

---

{
  "id": "moonshotai/kimi-k2.5",
  "name": "Kimi K2.5",
  "input": ["text"],
  ...
}

---

May 13 21:45:19 hog-xcloud-server-1 node[1006456]: [media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

---

May 13 21:45:17 ... [ws] โ‡„ res โœ“ logs.tail 256ms ...
May 13 21:45:19 ... [media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

---

function resolveConfiguredImageProviderModel({ cfg, providerId }) {
  // ...
  return normalizeOptionalString(
    (providerCfg?.models ?? [])
      .find((model) => Boolean(normalizeOptionalString(model?.id)) &&
                       Array.isArray(model?.input) &&
                       model.input.includes("image"))?.id
  );
}

function resolveAutoMediaKeyProviders(params) {
  // priority registry first, then:
  return [...new Set([...prioritized, ...resolveConfiguredImageProviderIds(params.cfg)])];
}

---

{ "id": "moonshotai/kimi-k2.5", "input": ["text"], ... }
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The media-understanding runtime silently routes inbound images to a user-declared vision model without ever validating that the model actually supports images, so a single incorrect "input": ["text", "image"] entry in models.providers.* causes every inbound image to fail with the unhelpful warning Image model failed (<provider>/<model>) and 0/1 attempts.

What this report is about (and what it is not)

This is not another instance of the Unknown model cluster tracked in #33185. The recently merged fixes #71500 and #71806 resolve the lookup/resolution gap (the chosen model is now correctly found and dispatched). That part works in 2026.5.7.

This report covers the next layer: the chosen model is found, dispatched, and immediately rejected by the provider because the locally declared "input" capability disagrees with reality. There is no catalog cross-check, no capability probe, no deterministic ordering, and the warning gives no actionable hint.

The unmerged draft PR #71458 (closed as superseded by #71806) did not address this layer either.

Steps to reproduce

  1. Run OpenClaw 2026.5.7 (build eeef486) on Linux (Ubuntu 24.04, Node v22.22.2).
  2. Configure the bundled NVIDIA provider in ~/.openclaw/openclaw.json, but declare the text-only Kimi K2.5 model as vision-capable:
    {
      "models": {
        "providers": {
          "nvidia": {
            "baseUrl": "https://integrate.api.nvidia.com/v1",
            "apiKey": "...",
            "api": "openai-completions",
            "models": [
              {
                "id": "moonshotai/kimi-k2.5",
                "input": ["text", "image"]
              }
            ]
          }
        }
      }
    }
    Note: the bundled provider catalog (dist/provider-catalog-D7DMM0zI2.js) ships this model with "input": ["text"] only โ€” the user override silently wins.
  3. Send any image to an agent whose imageModel is not explicitly set, so resolveDefaultMediaModel / resolveAutoMediaKeyProviders fall back to user-configured providers.
  4. Observe the warning in journalctl:
    [media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

Expected behavior

At least one of the following safety nets should kick in, in priority order:

  1. Catalog cross-check at load time โ€” if a configured model id exists in the bundled provider catalog with a narrower input set, emit a startup warning (or refuse the override unless an explicit opt-in flag is set).
  2. Capability probe before routing โ€” before selecting a user-declared vision model in resolveAutoMediaKeyProviders, verify the provider actually advertises image support for that model id.
  3. Deterministic ordering across multiple user-declared vision models, not the current "first hit in JSON insertion order" of Object.entries(providers).
  4. Actionable error message โ€” currently the warning reads Image model failed (...). It should distinguish "model rejected the image payload" from "transport/auth error" and point at the offending input declaration.

Actual behavior

resolveDefaultMediaModel in dist/defaults-DF3zyacx.js (lines ~60โ€“105 in 2026.5.7) walks the configured providers and accepts any model where Array.isArray(model.input) && model.input.includes("image") is true. The first such match wins. No cross-validation against the shipped catalog, no capability probe, and the per-attempt loop runs only 0/1 before giving up:

return normalizeOptionalString(
  (providerCfg?.models ?? [])
    .find((model) =>
      Boolean(normalizeOptionalString(model?.id)) &&
      Array.isArray(model?.input) &&
      model.input.includes("image")
    )?.id
);

The shipped catalog (dist/provider-catalog-D7DMM0zI2.js) for the same provider:

{
  "id": "moonshotai/kimi-k2.5",
  "name": "Kimi K2.5",
  "input": ["text"],
  ...
}

User config wins, the request is dispatched, NVIDIA rejects it because Kimi K2.5 is text-only, and OpenClaw surfaces only the opaque Image model failed warning. The downstream agent then never receives a media-understanding text description for the inbound image. In our case the agent had an independent Vision path available, so the user did not notice a functional regression โ€” only the recurring warning in logs.

OpenClaw version

2026.5.7 (build eeef486)

Operating system

Ubuntu 24.04 (Linux x86_64), Node v22.22.2

Install method

npm global (/usr/lib/node_modules/openclaw)

Model

nvidia/moonshotai/kimi-k2.5 (auto-selected by resolveAutoMediaKeyProviders for the image capability)

Provider / routing chain

openclaw โ†’ media-understanding registry โ†’ user-configured nvidia provider โ†’ NVIDIA Integrate API (https://integrate.api.nvidia.com/v1)

Additional provider/model setup details

  • agents.defaults.imageModel.primary is not set, so the runtime relies entirely on resolveAutoMediaKeyProviders fallback ordering.
  • Other configured providers (openai, openrouter, anthropic) either have no image-capable entries in models.providers.* or are not declared as image-capable, so NVIDIA wins by default.
  • The agent's primary text model is unrelated (openai/gpt-5.5).
  • Redacted full config available on request; relevant excerpt shown above.

Logs, screenshots, and evidence

May 13 21:45:19 hog-xcloud-server-1 node[1006456]: [media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

Surrounding context (1-second window):

May 13 21:45:17 ... [ws] โ‡„ res โœ“ logs.tail 256ms ...
May 13 21:45:19 ... [media-understanding] image: failed (0/1) reason=Image model failed (nvidia/moonshotai/kimi-k2.5)

Selection algorithm reference (verbatim from dist/defaults-DF3zyacx.js, 2026.5.7):

function resolveConfiguredImageProviderModel({ cfg, providerId }) {
  // ...
  return normalizeOptionalString(
    (providerCfg?.models ?? [])
      .find((model) => Boolean(normalizeOptionalString(model?.id)) &&
                       Array.isArray(model?.input) &&
                       model.input.includes("image"))?.id
  );
}

function resolveAutoMediaKeyProviders(params) {
  // priority registry first, then:
  return [...new Set([...prioritized, ...resolveConfiguredImageProviderIds(params.cfg)])];
}

Catalog reference (verbatim from dist/provider-catalog-D7DMM0zI2.js, 2026.5.7):

{ "id": "moonshotai/kimi-k2.5", "input": ["text"], ... }

Impact and severity

  • Affected: any user who maintains a custom models.providers.* block and accidentally declares "input": ["text", "image"] on a text-only model (copy-paste from a vision-capable entry, stale catalog snapshot, third-party provider docs). With the NVIDIA Integrate provider this is particularly easy because the bundled NVIDIA entry contains only text-only models today, so any operator who manually added vision support has no in-catalog reference.
  • Severity: medium. Image understanding silently breaks; the agent never receives the auto-generated description. Users with a separately wired Vision path (e.g. Claude/Gemini direct) may not notice degraded behavior for a long time โ€” only the recurring warning in logs.
  • Frequency: every inbound image, deterministically, until the operator notices.
  • Consequence: lost media-understanding context, misleading log noise, no actionable diagnostic.

Additional information

Related work and prior art:

  • #33185 โ€” canonical cluster issue for imageModel resolution problems (closed). The merged fixes only address the lookup side of the cluster.
  • #71500 (merged 2026-04-25) โ€” fix(image): prepare dynamic models before image tool registry lookup. Resolves the dynamic-models-not-ready timing window.
  • #71806 (merged 2026-04-25) โ€” fix(image): resolve provider-prefixed configured models (steipete). Fixes namespace-prefix matching for configured custom vision models. Does not validate declared capabilities.
  • #71458 (closed 2026-04-25, never merged) โ€” fix(image): resolve configured provider models (vincentkoc). Closed as superseded by #71806. Also did not address the capability-validation gap.
  • #62924 (open) โ€” Expose actual media-understanding chosen model in inbound body to avoid guessed media model reporting. Closely related: better observability would have surfaced this misconfiguration much earlier.
  • #68272 (closed) โ€” Image attachments dropped even when model supports images. Different symptom (parse-time drop), but same underlying theme: capability metadata and runtime behavior disagree.
  • #77090 (open) โ€” Auto-revert to primary model after image analysis. Adjacent feature in the same selection pathway.

Possible fix directions (suggestions, not prescriptions):

  1. In resolveConfiguredImageProviderModel, after picking a candidate, cross-check against the bundled provider catalog: if the catalog has the same provider/model id with a narrower input set, emit a clear warning at config-load time and (optionally) refuse the override unless allowCapabilityOverride: true is set on the model.
  2. Replace the silent 0/1 failure with a structured error code (e.g. IMAGE_MODEL_REJECTED_PAYLOAD vs IMAGE_MODEL_TRANSPORT) so operators can grep meaningfully.
  3. Document the precedence rules for imageModel selection (priority registry โ†’ user-configured providers โ†’ first JSON-order match) in docs/gateway/configuration-reference.md. Today the behavior is only observable from the source.

Happy to test patches against this reproducer. Thanks for the great work on OpenClaw โ€” the recent #71500/#71806 fixes already improved a lot in this area, and closing this last validation gap would make the configuration loop much friendlier.

Vote matrix ยท Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loadingโ€ฆ

FAQ

Expected behavior

At least one of the following safety nets should kick in, in priority order:

  1. Catalog cross-check at load time โ€” if a configured model id exists in the bundled provider catalog with a narrower input set, emit a startup warning (or refuse the override unless an explicit opt-in flag is set).
  2. Capability probe before routing โ€” before selecting a user-declared vision model in resolveAutoMediaKeyProviders, verify the provider actually advertises image support for that model id.
  3. Deterministic ordering across multiple user-declared vision models, not the current "first hit in JSON insertion order" of Object.entries(providers).
  4. Actionable error message โ€” currently the warning reads Image model failed (...). It should distinguish "model rejected the image payload" from "transport/auth error" and point at the offending input declaration.

Still need to ship something?

ร—6

Another batch ranked right after the header list โ€” different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ๐Ÿ’ก(How to fix) Fix [Bug]: media-understanding silently routes images to user-declared vision models without validating declared capabilities [3 comments, 2 participants]