openclaw - 💡(How to fix) Fix Telegram inbound images use describer-only path; never attached as native vision blocks (model receives `<media:image>` placeholder text) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62292Fetched 2026-04-08 03:06:32
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either:

  1. The literal placeholder string <media:image> (when tools.media.image is disabled or fails), or
  2. A text caption produced by the image-describer (when tools.media.image.enabled: true).

The same image sent via the webchat / Control UI works perfectly — the model sees and analyzes the actual image bytes.

Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them.

Root Cause

OpenClaw has two separate inbound image pipelines, and Telegram is wired to the wrong one:

PipelineChannel(s) using itWhat it produces
A. Native vision attachmentWebchat / Control UI (via parseMessageWithAttachments in attachment-normalize-CnmzeVAm.js)Native {type: "image", data, mimeType} content block delivered to the Anthropic API. Model sees the image.
B. Describe-as-textTelegram inbound (via describeImageWithModel in runner-Bo7fJw79.js)A separate model call generates a TEXT description of the image, which is injected into the prompt. Model never sees the image bytes.

Fix Action

Workaround

Currently the only workaround for vision use is to ask the user to send the image via webchat instead of Telegram, OR have them upload the file to the workspace and use the image tool from there.

The describer (tools.media.image.enabled: true) does provide some understanding of Telegram-attached images, but only as third-party text — the model cannot perform any actual vision tasks (OCR, layout reasoning, comparing images, fine detail) because it never sees the bytes.

Code Example

// Pseudocode
const attachments = normalizeAttachments(ctx);
const imageAttachments = attachments.filter(isImageAttachment);

if (imageAttachments.length > 0) {
  if (modelSupportsImages(model)) {
    // NEW: native vision path — read files, base64, attach as content blocks
    const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
    appendNativeImageBlocksToPrompt(imageBlocks);
  } else {
    // EXISTING: describer fallback for text-only models
    const captions = await Promise.all(imageAttachments.map(describeImageWithModel));
    appendCaptionsToPromptText(captions);
  }
}
RAW_BUFFERClick to expand / collapse

Summary

When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either:

  1. The literal placeholder string <media:image> (when tools.media.image is disabled or fails), or
  2. A text caption produced by the image-describer (when tools.media.image.enabled: true).

The same image sent via the webchat / Control UI works perfectly — the model sees and analyzes the actual image bytes.

Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them.

Environment

  • OpenClaw 2026.4.5 (commit 3e72c03)
  • Node v22.22.1
  • OS: Linux x86_64
  • Channel: Telegram (single bot, DM)
  • Model: claude-cli/claude-opus-4-6 (vision-capable)
  • Gateway: loopback, systemd
  • Config: tools.media.image.enabled: true

Reproduction

  1. Configure an OpenClaw agent with a vision-capable Claude model and Telegram channel.
  2. Send an image to the bot via Telegram DM with the caption "what do you see?"
  3. Observe in logs: gateway downloads the image successfully, resizes it, sets MediaPaths on the inbound context (agents/tool-images processing visible).
  4. Model reply describes the placeholder text or returns generic confusion ("I can't see any image"), confirming it never received the image as a content block.
  5. Send the same image to the same agent via webchat (http://127.0.0.1:18789/).
  6. Model accurately describes the image. ✅

Root Cause

OpenClaw has two separate inbound image pipelines, and Telegram is wired to the wrong one:

PipelineChannel(s) using itWhat it produces
A. Native vision attachmentWebchat / Control UI (via parseMessageWithAttachments in attachment-normalize-CnmzeVAm.js)Native {type: "image", data, mimeType} content block delivered to the Anthropic API. Model sees the image.
B. Describe-as-textTelegram inbound (via describeImageWithModel in runner-Bo7fJw79.js)A separate model call generates a TEXT description of the image, which is injected into the prompt. Model never sees the image bytes.

Pipeline trace (Telegram, broken)

  1. extensions/telegram/src/bot-message-context.tsbuildTelegramInboundContextPayload() collects media into contextMedia (bot-message-context-ID6QScNo.js, ~line 343).
  2. The payload is built with MediaPath, MediaPaths, MediaTypes, MediaUrls set to local file paths from the Telegram download (~lines 383–390).
  3. Initial bodyText is set to the literal placeholder <media:image> when there is no user-supplied text (bot-message-context-ID6QScNo.js:136).
  4. The agent runner reads ctx.MediaPaths via normalizeAttachments() in runner-Bo7fJw79.js (line 51), classifies as image via isImageAttachment() (line 121).
  5. Runner imports describeImageWithModel from image-runtime-C6QbxTR_.js (line 17) and uses it to generate a text caption.
  6. No code path on the Telegram side ever calls parseMessageWithAttachments or constructs a native vision content block.
  7. The model receives the prompt text with either the describer's caption or the raw <media:image> placeholder. Image bytes are never sent.

Pipeline trace (Webchat, working)

  1. server-Cv5hzFG4.js (gateway server) calls parseMessageWithAttachments(message, normalizedAttachments, ...) at lines 10759 and 12932.
  2. parseMessageWithAttachments (in attachment-normalize-CnmzeVAm.js, ~line 893) returns { images: [{type: "image", data, mimeType}], ... }.
  3. Those images are passed downstream to the model adapter as native content blocks. ✅

Why the runner-side path exists

The describer pipeline appears to predate native vision support and is what tools.media.image was originally designed for: it enables a separate lightweight model to caption an image so that text-only or non-vision models can still get some signal from images. Useful as a fallback, but the wrong default for vision-capable models.

Expected Behavior

When the configured agent model supports native vision (modelSupportsImages(model) === true), Telegram inbound images should be loaded from ctx.MediaPaths, base64-encoded, and attached to the model call as native {type: "image", source: {...}} content blocks — the same way webchat does it via parseMessageWithAttachments.

The describer pipeline should remain available as a fallback for text-only models, or as an opt-in.

Suggested Fix

In the agent runner (TypeScript source corresponding to runner-Bo7fJw79.js), where image attachments are detected:

// Pseudocode
const attachments = normalizeAttachments(ctx);
const imageAttachments = attachments.filter(isImageAttachment);

if (imageAttachments.length > 0) {
  if (modelSupportsImages(model)) {
    // NEW: native vision path — read files, base64, attach as content blocks
    const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
    appendNativeImageBlocksToPrompt(imageBlocks);
  } else {
    // EXISTING: describer fallback for text-only models
    const captions = await Promise.all(imageAttachments.map(describeImageWithModel));
    appendCaptionsToPromptText(captions);
  }
}

loadAsNativeImageBlock would mirror the behavior of parseMessageWithAttachments for the offload-or-inline decision (OFFLOAD_THRESHOLD_BYTES, SUPPORTED_OFFLOAD_MIMES), but starting from a file path instead of a base64 buffer.

Ideally, the two pipelines would be unified into a single inbound media handler shared by all channels (webchat, Telegram, WhatsApp/Watson, Signal, etc.) so this class of bug doesn't recur per-channel.

Workaround

Currently the only workaround for vision use is to ask the user to send the image via webchat instead of Telegram, OR have them upload the file to the workspace and use the image tool from there.

The describer (tools.media.image.enabled: true) does provide some understanding of Telegram-attached images, but only as third-party text — the model cannot perform any actual vision tasks (OCR, layout reasoning, comparing images, fine detail) because it never sees the bytes.

Related

  • File: dist/runner-Bo7fJw79.js (agent runner, media-understanding/attachments.normalize.ts namespace)
  • File: dist/bot-message-context-ID6QScNo.js (Telegram inbound context builder)
  • File: dist/attachment-normalize-CnmzeVAm.js (working webchat path, parseMessageWithAttachments)
  • File: dist/server-Cv5hzFG4.js (calls webchat path at lines 10759, 12932)

If a similar gap exists for other inbound channels (Signal, WhatsApp via Watson, etc.), it would be worth auditing them in the same fix.


Reporter context

Found while debugging Dani (a personal OpenClaw agent on 2026.4.5). Telegram users are sending images to the bot expecting the model to see them; the model receives only text. Webchat works for the same agent and model.

Happy to test a fix or PR, and can provide gateway logs from a repro session if helpful.

extent analysis

TL;DR

The most likely fix is to modify the agent runner to use the native vision pipeline for Telegram inbound images when the model supports native vision.

Guidance

  • Identify the agent runner code corresponding to runner-Bo7fJw79.js and modify it to use the native vision pipeline for image attachments when modelSupportsImages(model) returns true.
  • Implement the loadAsNativeImageBlock function to read image files, base64-encode them, and attach them to the model call as native content blocks.
  • Test the modified agent runner with Telegram inbound images to verify that the model receives the image bytes correctly.
  • Consider unifying the inbound media handlers for all channels to prevent similar bugs in the future.

Example

const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
appendNativeImageBlocksToPrompt(imageBlocks);

This code snippet illustrates the proposed modification to the agent runner, where loadAsNativeImageBlock is a new function that loads image files and attaches them to the model call as native content blocks.

Notes

The current workaround is to ask users to send images via webchat instead of Telegram, but this is not a scalable solution. The proposed fix requires modifying the agent runner code and may involve additional testing and validation to ensure correct functionality.

Recommendation

Apply the workaround of sending images via webchat instead of Telegram until the agent runner code can be modified to support native vision for Telegram inbound images. This will allow users to utilize the model's vision capabilities while a more permanent fix is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING