openclaw - 💡(How to fix) Fix Telegram inbound images use describer-only path; never attached as native vision blocks (model receives `<media:image>` placeholder text) [1 participants]

gaydarov · 2026-04-07T04:21:55Z

[openclaw] When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the… When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either: 1. The literal placeholder string ` ` (when `tools.media.image` is disabled or fails), or 2. A text caption produced by the image-describer (when `tools.media.image.enabled: true`). The same image sent via the **webchat / Control UI** works perfectly — the model sees and analyzes the actual image bytes. Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them. ## Workaround Currently the only workaround for vision use is to ask the user to **send the image via webchat instead of Telegram**, OR have them upload the file to the workspace and use the `image` tool from there. The describer (`tools.media.image.enabled: true`) does provide *some* understanding of Telegram-attached images, but only as third-party text — the model cannot perform any actual vision tasks (OCR, layout reasoning, comparing images, fine detail) because it never sees the bytes. ## Summary When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either: 1. The literal placeholder string ` ` (when `tools.media.image` is disabled or fails), or 2. A text caption produced by the image-describer (when `tools.media.image.enabled: true`). The same image sent via the **webchat / Control UI** works perfectly — the model sees and analyzes the actual image bytes. Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them. ## Environment - OpenClaw `2026.4.5` (commit `3e72c03`) - Node `v22.22.1` - OS: Linux x86_64 - Channel: Telegram (single bot, DM) - Model: `claude-cli/claude-opus-4-6` (vision-capable) - Gateway: loopback, systemd - Config: `tools.media.image.enabled: true` ## Reproduction 1. Configure an OpenClaw agent with a vision-capable Claude model and Telegram channel. 2. Send an image to the bot via Telegram DM with the caption "what do you see?" 3. Observe in logs: gateway downloads the image successfully, resizes it, sets `MediaPaths` on the inbound context (`agents/tool-images` processing visible). 4. Model reply describes the placeholder text or returns generic confusion ("I can't see any image"), confirming it never received the image as a content block. 5. Send the same image to the same agent via webchat (`http://127.0.0.1:18789/`). 6. Model accurately describes the image. ✅ ## Root Cause OpenClaw has **two separate inbound image pipelines**, and Telegram is wired to the wrong one: | Pipeline | Channel(s) using it | What it produces | |---|---|---| | **A. Native vision attachment** | Webchat / Control UI (via `parseMessageWithAttachments` in `attachment-normalize-CnmzeVAm.js`) | Native `{type: "image", data, mimeType}` content block delivered to the Anthropic API. Model sees the image. | | **B. Describe-as-text** | Telegram inbound (via `describeImageWithModel` in `runner-Bo7fJw79.js`) | A separate model call generates a TEXT description of the image, which is injected into the prompt. Model never sees the image bytes. | ### Pipeline trace (Telegram, broken) 1. `extensions/telegram/src/bot-message-context.ts` → `buildTelegramInboundContextPayload()` collects media into `contextMedia` (`bot-message-context-ID6QScNo.js`, ~line 343). 2. The payload is built with `MediaPath`, `MediaPaths`, `MediaTypes`, `MediaUrls` set to local file paths from the Telegram download (~lines 383–390). 3. Initial `bodyText` is set to the literal placeholder ` ` when there is no user-supplied text (`bot-message-context-ID6QScNo.js:136`). 4. The agent runner reads `ctx.MediaPaths` via `normalizeAttachments()` in `runner-Bo7fJw79.js` (line 51), classifies as image via `isImageAttachment()` (line 121). 5. Runner imports `describeImageWithModel` from `image-runtime-C6QbxTR_.js` (line 17) and uses it to generate a text caption. 6. **No code path on the Telegram side ever calls `parseMessageWithAttachments` or constructs a native vision content block.** 7. The model receives the prompt text with either the describer's caption or the raw ` ` placeholder. Image bytes are never sent. ### Pipeline trace (Webchat, working) 1. `server-Cv5hzFG4.js` (gateway server) calls `parseMessageWithAttachments(message, normalizedAttachments, ...)` at lines `10759` and `12932`. 2. `parseMessageWithAttachments` (in `attachment-normalize-CnmzeVAm.js`, ~line 893) returns `{ images: [{type: "image", data, mimeType}], ... }`. 3. Those `images` are passed downstream to the model adapter as native content blocks. ✅ ### Why the runner-side pa

openclaw2026-04-07 04:21:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62292•Fetched 2026-04-08 03:06:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

gaydarov

Participants

gaydarov

When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either:

The literal placeholder string <media:image> (when tools.media.image is disabled or fails), or
A text caption produced by the image-describer (when tools.media.image.enabled: true).

The same image sent via the webchat / Control UI works perfectly — the model sees and analyzes the actual image bytes.

Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them.

Root Cause

OpenClaw has two separate inbound image pipelines, and Telegram is wired to the wrong one:

Pipeline	Channel(s) using it	What it produces
A. Native vision attachment	Webchat / Control UI (via `parseMessageWithAttachments` in `attachment-normalize-CnmzeVAm.js`)	Native `{type: "image", data, mimeType}` content block delivered to the Anthropic API. Model sees the image.
B. Describe-as-text	Telegram inbound (via `describeImageWithModel` in `runner-Bo7fJw79.js`)	A separate model call generates a TEXT description of the image, which is injected into the prompt. Model never sees the image bytes.

Fix Action

Workaround

Currently the only workaround for vision use is to ask the user to send the image via webchat instead of Telegram, OR have them upload the file to the workspace and use the image tool from there.

The describer (tools.media.image.enabled: true) does provide some understanding of Telegram-attached images, but only as third-party text — the model cannot perform any actual vision tasks (OCR, layout reasoning, comparing images, fine detail) because it never sees the bytes.

Code Example

// Pseudocode
const attachments = normalizeAttachments(ctx);
const imageAttachments = attachments.filter(isImageAttachment);

if (imageAttachments.length > 0) {
  if (modelSupportsImages(model)) {
    // NEW: native vision path — read files, base64, attach as content blocks
    const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
    appendNativeImageBlocksToPrompt(imageBlocks);
  } else {
    // EXISTING: describer fallback for text-only models
    const captions = await Promise.all(imageAttachments.map(describeImageWithModel));
    appendCaptionsToPromptText(captions);
  }
}

RAW_BUFFERClick to expand / collapse

Summary

When a user sends an image to a Claude-backed agent via Telegram, the model never receives the image as a native vision content block. Instead, the prompt contains either:

The literal placeholder string <media:image> (when tools.media.image is disabled or fails), or
A text caption produced by the image-describer (when tools.media.image.enabled: true).

The same image sent via the webchat / Control UI works perfectly — the model sees and analyzes the actual image bytes.

Result: Telegram users on a vision-capable model effectively cannot send images, even though the gateway successfully downloads and resizes them.

Environment

OpenClaw 2026.4.5 (commit 3e72c03)
Node v22.22.1
OS: Linux x86_64
Channel: Telegram (single bot, DM)
Model: claude-cli/claude-opus-4-6 (vision-capable)
Gateway: loopback, systemd
Config: tools.media.image.enabled: true

Reproduction

Configure an OpenClaw agent with a vision-capable Claude model and Telegram channel.
Send an image to the bot via Telegram DM with the caption "what do you see?"
Observe in logs: gateway downloads the image successfully, resizes it, sets MediaPaths on the inbound context (agents/tool-images processing visible).
Model reply describes the placeholder text or returns generic confusion ("I can't see any image"), confirming it never received the image as a content block.
Send the same image to the same agent via webchat (http://127.0.0.1:18789/).
Model accurately describes the image. ✅

Root Cause

OpenClaw has two separate inbound image pipelines, and Telegram is wired to the wrong one:

Pipeline	Channel(s) using it	What it produces
A. Native vision attachment	Webchat / Control UI (via `parseMessageWithAttachments` in `attachment-normalize-CnmzeVAm.js`)	Native `{type: "image", data, mimeType}` content block delivered to the Anthropic API. Model sees the image.
B. Describe-as-text	Telegram inbound (via `describeImageWithModel` in `runner-Bo7fJw79.js`)	A separate model call generates a TEXT description of the image, which is injected into the prompt. Model never sees the image bytes.

Pipeline trace (Telegram, broken)

extensions/telegram/src/bot-message-context.ts → buildTelegramInboundContextPayload() collects media into contextMedia (bot-message-context-ID6QScNo.js, ~line 343).
The payload is built with MediaPath, MediaPaths, MediaTypes, MediaUrls set to local file paths from the Telegram download (~lines 383–390).
Initial bodyText is set to the literal placeholder <media:image> when there is no user-supplied text (bot-message-context-ID6QScNo.js:136).
The agent runner reads ctx.MediaPaths via normalizeAttachments() in runner-Bo7fJw79.js (line 51), classifies as image via isImageAttachment() (line 121).
Runner imports describeImageWithModel from image-runtime-C6QbxTR_.js (line 17) and uses it to generate a text caption.
No code path on the Telegram side ever calls parseMessageWithAttachments or constructs a native vision content block.
The model receives the prompt text with either the describer's caption or the raw <media:image> placeholder. Image bytes are never sent.

Pipeline trace (Webchat, working)

server-Cv5hzFG4.js (gateway server) calls parseMessageWithAttachments(message, normalizedAttachments, ...) at lines 10759 and 12932.
parseMessageWithAttachments (in attachment-normalize-CnmzeVAm.js, ~line 893) returns { images: [{type: "image", data, mimeType}], ... }.
Those images are passed downstream to the model adapter as native content blocks. ✅

Why the runner-side path exists

The describer pipeline appears to predate native vision support and is what tools.media.image was originally designed for: it enables a separate lightweight model to caption an image so that text-only or non-vision models can still get some signal from images. Useful as a fallback, but the wrong default for vision-capable models.

Expected Behavior

When the configured agent model supports native vision (modelSupportsImages(model) === true), Telegram inbound images should be loaded from ctx.MediaPaths, base64-encoded, and attached to the model call as native {type: "image", source: {...}} content blocks — the same way webchat does it via parseMessageWithAttachments.

The describer pipeline should remain available as a fallback for text-only models, or as an opt-in.

Suggested Fix

In the agent runner (TypeScript source corresponding to runner-Bo7fJw79.js), where image attachments are detected:

// Pseudocode
const attachments = normalizeAttachments(ctx);
const imageAttachments = attachments.filter(isImageAttachment);

if (imageAttachments.length > 0) {
  if (modelSupportsImages(model)) {
    // NEW: native vision path — read files, base64, attach as content blocks
    const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
    appendNativeImageBlocksToPrompt(imageBlocks);
  } else {
    // EXISTING: describer fallback for text-only models
    const captions = await Promise.all(imageAttachments.map(describeImageWithModel));
    appendCaptionsToPromptText(captions);
  }
}

loadAsNativeImageBlock would mirror the behavior of parseMessageWithAttachments for the offload-or-inline decision (OFFLOAD_THRESHOLD_BYTES, SUPPORTED_OFFLOAD_MIMES), but starting from a file path instead of a base64 buffer.

Ideally, the two pipelines would be unified into a single inbound media handler shared by all channels (webchat, Telegram, WhatsApp/Watson, Signal, etc.) so this class of bug doesn't recur per-channel.

Workaround

File: dist/runner-Bo7fJw79.js (agent runner, media-understanding/attachments.normalize.ts namespace)
File: dist/bot-message-context-ID6QScNo.js (Telegram inbound context builder)
File: dist/attachment-normalize-CnmzeVAm.js (working webchat path, parseMessageWithAttachments)
File: dist/server-Cv5hzFG4.js (calls webchat path at lines 10759, 12932)

If a similar gap exists for other inbound channels (Signal, WhatsApp via Watson, etc.), it would be worth auditing them in the same fix.

Reporter context

Found while debugging Dani (a personal OpenClaw agent on 2026.4.5). Telegram users are sending images to the bot expecting the model to see them; the model receives only text. Webchat works for the same agent and model.

Happy to test a fix or PR, and can provide gateway logs from a repro session if helpful.

extent analysis

TL;DR

The most likely fix is to modify the agent runner to use the native vision pipeline for Telegram inbound images when the model supports native vision.

Guidance

Identify the agent runner code corresponding to runner-Bo7fJw79.js and modify it to use the native vision pipeline for image attachments when modelSupportsImages(model) returns true.
Implement the loadAsNativeImageBlock function to read image files, base64-encode them, and attach them to the model call as native content blocks.
Test the modified agent runner with Telegram inbound images to verify that the model receives the image bytes correctly.
Consider unifying the inbound media handlers for all channels to prevent similar bugs in the future.

Example

const imageBlocks = await Promise.all(imageAttachments.map(loadAsNativeImageBlock));
appendNativeImageBlocksToPrompt(imageBlocks);

This code snippet illustrates the proposed modification to the agent runner, where loadAsNativeImageBlock is a new function that loads image files and attaches them to the model call as native content blocks.

Notes

The current workaround is to ask users to send images via webchat instead of Telegram, but this is not a scalable solution. The proposed fix requires modifying the agent runner code and may involve additional testing and validation to ensure correct functionality.

Recommendation

Apply the workaround of sending images via webchat instead of Telegram until the agent runner code can be modified to support native vision for Telegram inbound images. This will allow users to utilize the model's vision capabilities while a more permanent fix is implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Telegram inbound images use describer-only path; never attached as native vision blocks (model receives `<media:image>` placeholder text) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Root Cause

Pipeline trace (Telegram, broken)

Pipeline trace (Webchat, working)

Why the runner-side path exists

Expected Behavior

Suggested Fix

Workaround

Related

Reporter context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Telegram inbound images use describer-only path; never attached as native vision blocks (model receives `<media:image>` placeholder text) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Root Cause

Pipeline trace (Telegram, broken)

Pipeline trace (Webchat, working)

Why the runner-side path exists

Expected Behavior

Suggested Fix

Workaround

Related

Reporter context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING