openclaw - 💡(How to fix) Fix Feature: Support local vision model preprocessing for inbound images [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#51483Fetched 2026-04-08 01:10:38
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
commented ×1mentioned ×1subscribed ×1

Fix Action

Fix / Workaround

Workaround (Current)

The current workaround is manual:

  1. User saves the image to a known path
  2. Agent explicitly calls a local vision model via exec/tool
  3. Agent uses the description in its response
RAW_BUFFERClick to expand / collapse

Problem

Currently, when users upload images via webchat (or any channel), OpenClaw forwards them directly to the configured LLM as base64 image data. This creates two issues:

  1. Vision-limited models can't process images — If the configured model doesn't support vision input (e.g., MiniMax-M2.7, many text-only models), uploaded images are effectively invisible to the agent.

  2. No interception point for image preprocessing — There's no hook or plugin mechanism to intercept inbound images and process them through a different pipeline (e.g., a local vision model like LLaVA or Qwen2-VL before forwarding descriptions to the main model).

Proposed Solution

Introduce a preprocessing hook or plugin slot that runs before images are injected into the model's input:

// Example concept (not actual API) api.registerHook('before_image_inject', async ({ images }) => { // images: Array<{ type: 'image', data: base64string, mimeType: string }> // Plugin can replace with text description from a local vision model return { images: [...] } })

Or alternatively, a model-side fallback where images are automatically processed through a secondary vision-capable model and the text description is prepended to the prompt when the primary model doesn't support vision.

Use Case

A user with a powerful text model (like MiniMax-M2.7) paired with a local vision model (like qwen3.5-2b running on LM Studio at 127.0.0.1:1234) could have images automatically described by the local model, then feed those descriptions to the main text model — all without changing their primary model choice.

Workaround (Current)

The current workaround is manual:

  1. User saves the image to a known path
  2. Agent explicitly calls a local vision model via exec/tool
  3. Agent uses the description in its response

This works but breaks the natural conversational flow.


Labels: enhancement, vision, images, plugin-system

extent analysis

Fix Plan

To address the issue, we will introduce a preprocessing hook that allows for image interception and processing.

Here are the steps:

  • Introduce a new API endpoint for registering hooks: api.registerHook('before_image_inject', callback).
  • Modify the image upload handler to call the before_image_inject hook before forwarding images to the model.
  • Implement a plugin mechanism to allow users to register custom image processing callbacks.

Example Code

// api.js
class API {
  constructor() {
    this.hooks = {};
  }

  registerHook(name, callback) {
    if (!this.hooks[name]) {
      this.hooks[name] = [];
    }
    this.hooks[name].push(callback);
  }

  callHook(name, args) {
    if (this.hooks[name]) {
      return Promise.all(this.hooks[name].map(callback => callback(args)));
    }
    return [];
  }
}

// image-upload-handler.js
class ImageUploadHandler {
  async handleImageUpload(images) {
    const hookResults = await api.callHook('before_image_inject', { images });
    // Process hook results and forward images to the model
  }
}

// plugin-example.js
api.registerHook('before_image_inject', async ({ images }) => {
  // Process images using a local vision model
  const descriptions = await processImagesWithVisionModel(images);
  return { images: descriptions.map(description => ({ type: 'text', data: description })) };
});

Verification

To verify that the fix worked, test the following scenarios:

  • Upload an image and verify that the before_image_inject hook is called.
  • Register a custom plugin and verify that it is executed correctly.
  • Test the conversational flow with a user who has a powerful text model paired with a local vision model.

Extra Tips

  • Document the new API endpoint and plugin mechanism to ensure users can take advantage of the new feature.
  • Provide example plugins to demonstrate the usage of the before_image_inject hook.
  • Consider adding support for multiple hook callbacks to allow for more complex image processing pipelines.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature: Support local vision model preprocessing for inbound images [1 comments, 2 participants]