openclaw - 💡(How to fix) Fix Telegram image attachments stored but not passed to LLM as vision input [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#69808Fetched 2026-04-22 07:48:02
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
1
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1

When a Telegram user sends an image to a bot running on OpenClaw, the image file is downloaded and written to /sandbox/.openclaw-data/media/inbound/ correctly, but the agent loop does not include the image content in the LLM request. The model receives only the text portion of the message, and responds with generic disclaimers like "I can't analyze images directly" or "I don't have image recognition or computer vision abilities built in."

Error Message

If the active model is text-only, the agent should return a clear "the current model doesn't support vision input" error rather than hallucinating generic disclaimers.

Root Cause

When a Telegram user sends an image to a bot running on OpenClaw, the image file is downloaded and written to /sandbox/.openclaw-data/media/inbound/ correctly, but the agent loop does not include the image content in the LLM request. The model receives only the text portion of the message, and responds with generic disclaimers like "I can't analyze images directly" or "I don't have image recognition or computer vision abilities built in."

Code Example

ls /sandbox/.openclaw-data/media/inbound/

---

file_0---d2f7df5c-bade-41cd-a3f8-83456fb7e98f.jpg
  file_1---1981cc25-ae97-422e-a38f-971126e0b69b.jpg

---

{
  "role": "user",
  "content": [
    {"type": "text", "text": "..."},
    {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
  ]
}
RAW_BUFFERClick to expand / collapse

Summary

When a Telegram user sends an image to a bot running on OpenClaw, the image file is downloaded and written to /sandbox/.openclaw-data/media/inbound/ correctly, but the agent loop does not include the image content in the LLM request. The model receives only the text portion of the message, and responds with generic disclaimers like "I can't analyze images directly" or "I don't have image recognition or computer vision abilities built in."

Environment

  • OpenClaw: 2026.4.2 (d74a122)
  • Node.js: v22.22.2
  • npm: 10.9.7
  • Running inside NVIDIA NemoClaw sandbox (v0.0.18) on OpenShell CLI 0.0.26
  • Host: Brev (Linux, Ubuntu)
  • Channel: Telegram

Reproduction Steps

  1. Configure OpenClaw with Telegram enabled and a running bot (any standard onboarding).
  2. From Telegram, send the bot a JPEG or PNG image.
  3. Wait up to 30 seconds for the reply.
  4. Inspect the media path:
    ls /sandbox/.openclaw-data/media/inbound/

Actual Result

  • Agent text reply: "I can't analyze images directly" / "I don't have image recognition or computer vision abilities built in."
  • Image files are present on disk:
    file_0---d2f7df5c-bade-41cd-a3f8-83456fb7e98f.jpg
    file_1---1981cc25-ae97-422e-a38f-971126e0b69b.jpg
  • No MediaFetchError or network-policy errors in logs (I confirmed the sandbox's egress policy allows /file/bot*/** so download works).

Expected Result

If a vision-capable inference model is configured, the agent should include the downloaded image as a vision content block in the chat completion request — e.g.:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "..."},
    {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
  ]
}

If the active model is text-only, the agent should return a clear "the current model doesn't support vision input" error rather than hallucinating generic disclaimers.

Boundary analysis (why this is an OpenClaw issue, not NemoClaw)

I work on NVIDIA NemoClaw. This was originally filed against us at NVIDIA/NemoClaw#2009. I traced the chain:

StepLocationStatus
Sandbox receives Telegram eventOpenClaw channel code✓ works
Download attachment via api.telegram.org/file/bot*/**OpenClaw → host auth proxy✓ works (NemoClaw's telegram.yaml policy allows it)
Write to /sandbox/.openclaw-data/media/inbound/OpenClaw✓ works (NemoClaw configures writable dir)
Pass image to LLM as vision inputOpenClaw agent loop✗ missing

I grep'd NemoClaw and confirmed there is zero code in NemoClaw that touches image/vision/multimodal handling. All the failing logic lives in OpenClaw's agent pipeline.

Tracking

  • Upstream: NVIDIA/NemoClaw#2009 (closing there with a pointer here)

extent analysis

TL;DR

The OpenClaw agent loop likely needs to be updated to include the downloaded image content in the LLM request as a vision content block.

Guidance

  • Verify that the image files are being downloaded and written to the correct directory (/sandbox/.openclaw-data/media/inbound/) and that the file paths are correctly formatted.
  • Check the OpenClaw agent loop code to ensure it is properly handling the image files and including them in the LLM request as vision content blocks.
  • Review the configuration of the vision-capable inference model to ensure it is correctly set up and enabled.
  • Test the agent loop with a text-only model to verify that it returns a clear error message indicating that the model does not support vision input.

Example

{
  "role": "user",
  "content": [
    {"type": "text", "text": "..."},
    {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
  ]
}

This example shows how the image content should be included in the LLM request.

Notes

The issue seems to be specific to the OpenClaw agent loop and not related to NemoClaw. The fact that the image files are being downloaded and written to the correct directory suggests that the issue is with how the agent loop is handling the image files.

Recommendation

Apply a workaround to update the OpenClaw agent loop to include the image content in the LLM request. This will likely involve modifying the agent loop code to properly handle the image files and include them in the request as vision content blocks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Telegram image attachments stored but not passed to LLM as vision input [1 comments, 2 participants]