openclaw - 💡(How to fix) Fix [Bug]: Ollama models with vision capability not recognized as supporting images [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62519Fetched 2026-04-08 03:03:09
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
labeled ×2closed ×1commented ×1

Root Cause

buildOllamaModelDefinition (in compiled dist: stream-DaZ9JB7F.js) hardcodes input: ["text"] for all Ollama models:

function buildOllamaModelDefinition(modelId, contextWindow) {
    return {
        input: ["text"],  // ← always text-only, never checks vision
        ...
    };
}
Additionally, enrichOllamaModelsWithContext already calls Ollama's /api/show for each model to get context_length, but ignores the capabilities field which contains "vision".

The mergeProviderModels function always overwrites input with the implicit (auto-discovered) value, so manual edits to models.json are also lost on every startup.

Suggested Fix
When calling /api/show in the enrichment phase, also read data.capabilities
If capabilities includes "vision", set input: ["text", "image"]
Pass supportsVision through enrichOllamaModelsWithContext → buildOllamaModelDefinition
No extra HTTP requests needed — the context window enrichment already hits /api/show.
***********************************************************************************************
Environment:
OpenClaw version: 2026.4.5
Ollama version: 0.17.1
Affected models: any Ollama model with vision capability

Code Example

function buildOllamaModelDefinition(modelId, contextWindow) {
    return {
        input: ["text"],  // ← always text-only, never checks vision
        ...
    };
}
Additionally, enrichOllamaModelsWithContext already calls Ollama's /api/show for each model to get context_length, but ignores the capabilities field which contains "vision".

The mergeProviderModels function always overwrites input with the implicit (auto-discovered) value, so manual edits to models.json are also lost on every startup.

Suggested Fix
When calling /api/show in the enrichment phase, also read data.capabilities
If capabilities includes "vision", set input: ["text", "image"]
Pass supportsVision through enrichOllamaModelsWithContext → buildOllamaModelDefinition
No extra HTTP requests needed — the context window enrichment already hits /api/show.
***********************************************************************************************
Environment:
OpenClaw version: 2026.4.5
Ollama version: 0.17.1
Affected models: any Ollama model with vision capability

### Steps to reproduce

1/Using ollama local models that we all know for certain that has "vision" capavilities;
2/Attatch an image in a message
3/Openclaw acts like it never seen it. I can not find any attached image in your message“

### Expected behavior

Ollama models that support vision capability should be able to parse images.

### Actual behavior

When using local Ollama models that support `vision` capability (e.g. `qwen3.5:35b`, `gemma4:31b`), OpenClaw drops all image attachments with the warning:

[gateway] parseMessageWithAttachments: 1 attachment(s) dropped — model does not support images

### OpenClaw version

2026.4.5

### Operating system

Windows11

### Install method

npm global

### Model

any Ollama model with vision capability

### Provider / routing chain

ollama

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Bug Description

When using local Ollama models that support vision capability (e.g. qwen3.5:35b, gemma4:31b), OpenClaw drops all image attachments with the warning:

[gateway] parseMessageWithAttachments: 1 attachment(s) dropped — model does not support images

The model correctly reports vision support via ollama show:

Capabilities completion vision tools thinking

Root Cause

buildOllamaModelDefinition (in compiled dist: stream-DaZ9JB7F.js) hardcodes input: ["text"] for all Ollama models:

function buildOllamaModelDefinition(modelId, contextWindow) {
    return {
        input: ["text"],  // ← always text-only, never checks vision
        ...
    };
}
Additionally, enrichOllamaModelsWithContext already calls Ollama's /api/show for each model to get context_length, but ignores the capabilities field which contains "vision".

The mergeProviderModels function always overwrites input with the implicit (auto-discovered) value, so manual edits to models.json are also lost on every startup.

Suggested Fix
When calling /api/show in the enrichment phase, also read data.capabilities
If capabilities includes "vision", set input: ["text", "image"]
Pass supportsVision through enrichOllamaModelsWithContext → buildOllamaModelDefinition
No extra HTTP requests needed — the context window enrichment already hits /api/show.
***********************************************************************************************
Environment:
OpenClaw version: 2026.4.5
Ollama version: 0.17.1
Affected models: any Ollama model with vision capability

### Steps to reproduce

1/Using ollama local models that we all know for certain that has "vision" capavilities;
2/Attatch an image in a message
3/Openclaw acts like it never seen it. I can not find any attached image in your message“

### Expected behavior

Ollama models that support vision capability should be able to parse images.

### Actual behavior

When using local Ollama models that support `vision` capability (e.g. `qwen3.5:35b`, `gemma4:31b`), OpenClaw drops all image attachments with the warning:

[gateway] parseMessageWithAttachments: 1 attachment(s) dropped — model does not support images

### OpenClaw version

2026.4.5

### Operating system

Windows11

### Install method

npm global

### Model

any Ollama model with vision capability

### Provider / routing chain

ollama

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Modify the buildOllamaModelDefinition function to check the model's capabilities and set input to ["text", "image"] if the model supports vision.

Guidance

  • Check the capabilities field in the response from Ollama's /api/show endpoint to determine if the model supports vision.
  • Pass the supportsVision flag from enrichOllamaModelsWithContext to buildOllamaModelDefinition to set the correct input type.
  • Update the buildOllamaModelDefinition function to conditionally set input to ["text", "image"] if the model supports vision.
  • Verify that the mergeProviderModels function does not overwrite the updated input value.

Example

function buildOllamaModelDefinition(modelId, contextWindow, supportsVision) {
    const input = supportsVision ? ["text", "image"] : ["text"];
    return {
        input,
        // ...
    };
}

Notes

This fix assumes that the enrichOllamaModelsWithContext function is correctly calling Ollama's /api/show endpoint and parsing the response. Additional logging or debugging may be necessary to verify this.

Recommendation

Apply the suggested fix to update the buildOllamaModelDefinition function to correctly handle models with vision capability. This will allow OpenClaw to parse image attachments for models that support vision.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Ollama models that support vision capability should be able to parse images.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING