openclaw - 💡(How to fix) Fix [Feature]:Expose image-tool to codex kernel so ChatGPT Plus OAuth agents can read images

StepCodex · 2026-05-21T09:59:08Z

[openclaw] codex kernel does not expose the image tool that pi kernel has, so agents running under ChatGPT Plus OAuth cannot natively read screenshots even whe… codex kernel does not expose the `image` tool that pi kernel has, so agents running under ChatGPT Plus OAuth cannot natively read screenshots even when the underlying model supports vision. ## Fix / Workaround Current workaround in use: an external shell wrapper that reads `~/.openclaw/agents/main/agent/auth-profiles.json`, writes the OAuth token to `~/.codex/auth.json`, and invokes `codex exec --skip-git-repo-check --model gpt-5.4 --image -` per image. - **Affected:** Any agent flow that needs to read images from disk (competitor screenshot analysis, OCR of saved PNGs, multi-image visual QA). Specifically affects ChatGPT Plus OAuth users because they cannot switch to a pi-kernel-eligible provider without giving up the OAuth subscription. - **Severity:** Medium — workaround exists but is ~10x slower than native should be. - **Frequency:** Every multi-image analysis turn. In our workflow: every time we run product competitor analysis (3 ASIN × ~15 screenshots). - **Consequence:** ~45 minutes per product run on the wrapper path vs an estimated ~5 minutes with native image-tool / auto-injection. ### Summary codex kernel does not expose the `image` tool that pi kernel has, so agents running under ChatGPT Plus OAuth cannot natively read screenshots even when the underlying model supports vision. ### Problem to solve `createImageTool` (in `src/agents/tools/image-tool.ts`) is registered via `createOpenClawTools` and consumed by pi kernel (`src/agents/pi-tools.ts`) plus the auto-reply and gateway HTTP paths. The codex kernel (used when `agents.defaults.model.primary = openai-codex/*`) never receives it. The same kernel also bypasses `detectAndLoadPromptImages` auto-injection in `src/agents/pi-embedded-runner/run/attempt.ts`. Result: ChatGPT Plus OAuth agents cannot read PNGs from the workspace from inside the agent loop. Models like gpt-5.4 / gpt-5.5 clearly support vision, but the kernel does not forward image content as multimodal input. Users have to write external shell wrappers that re-auth and call `codex exec --image` per image, paying ~75 s overhead per call (codex subprocess spawn + 78 k-token system prompt reload each time). ### Proposed solution Any one of the following would solve it: 1. Register `imageTool` (and `pdfTool` while at it) when building the codex kernel tool set the way pi kernel does. 2. Enable `detectAndLoadPromptImages` on the codex kernel input path so bare image paths in agent prompts/tool results get auto-injected as multimodal content. 3. Failing both, document explicitly that ChatGPT Plus OAuth users do not have native vision and recommend a wrapper pattern. Happy to take a PR shot at option 1 or 2 if a maintainer can sanity-check the right injection point — codex kernel's tool plumbing is less obvious than pi's `createOpenClawTools` chain. ### Alternatives considered Current workaround in use: an external shell wrapper that reads `~/.openclaw/agents/main/agent/auth-profiles.json`, writes the OAuth token to `~/.codex/auth.json`, and invokes `codex exec --skip-git-repo-check --model gpt-5.4 --image -` per image. Works, but slow (~75 s per call) and wasteful (each call re-loads 78 k tokens of codex system prompt that should have been part of the existing long session). Alternative: tell users to drag images into chat manually — defeats the point of agent automation when there are 30+ screenshots to analyze. ### Impact - **Affected:** Any agent flow that needs to read images from disk (competitor screenshot analysis, OCR of saved PNGs, multi-image visual QA). Specifically affects ChatGPT Plus OAuth users because they cannot switch to a pi-kernel-eligible provider without giving up the OAuth subscription. - **Severity:** Medium — workaround exists but is ~10x slower than native should be. - **Frequency:** Every multi-image analysis turn. In our workflow: every time we run product competitor analysis (3 ASIN × ~15 screenshots). - **Consequence:** ~45 minutes per product run on the wrapper path vs an estimated ~5 minutes with native image-tool / auto-injection. ### Evidence/examples Repro: 1. Set `agents.defaults.model.primary = openai-codex/gpt-5.4`. 2. In chat: "read this PNG and describe it" pointing at a workspace file. 3. Agent has no native tool; it either fakes the answer or shells out via a wrapper. Token cost per wrapper call (from our logs): - input_token_count ≈ 78 k (cached system prompt re-loaded every time) - output_token_count ≈ 200-300 - response time ≈ 70-75 s ### Additional information Environment: - OpenClaw image: `ghcr.io/openclaw/openclaw:2026.5.18` - codex CLI: 0.130.0 - Model: `openai-codex/gpt-5.4` - Auth: ChatGPT Plus OAuth

Root Cause

Affected: Any agent flow that needs to read images from disk (competitor screenshot analysis, OCR of saved PNGs, multi-image visual QA). Specifically affects ChatGPT Plus OAuth users because they cannot switch to a pi-kernel-eligible provider without giving up the OAuth subscription.
Severity: Medium — workaround exists but is ~10x slower than native should be.
Frequency: Every multi-image analysis turn. In our workflow: every time we run product competitor analysis (3 ASIN × ~15 screenshots).
Consequence: ~45 minutes per product run on the wrapper path vs an estimated ~5 minutes with native image-tool / auto-injection.

Fix Action

Fix / Workaround

Current workaround in use: an external shell wrapper that reads ~/.openclaw/agents/main/agent/auth-profiles.json, writes the OAuth token to ~/.codex/auth.json, and invokes codex exec --skip-git-repo-check --model gpt-5.4 --image <path> - per image.

Affected: Any agent flow that needs to read images from disk (competitor screenshot analysis, OCR of saved PNGs, multi-image visual QA). Specifically affects ChatGPT Plus OAuth users because they cannot switch to a pi-kernel-eligible provider without giving up the OAuth subscription.
Severity: Medium — workaround exists but is ~10x slower than native should be.
Frequency: Every multi-image analysis turn. In our workflow: every time we run product competitor analysis (3 ASIN × ~15 screenshots).
Consequence: ~45 minutes per product run on the wrapper path vs an estimated ~5 minutes with native image-tool / auto-injection.

Summary

codex kernel does not expose the image tool that pi kernel has, so agents running under ChatGPT Plus OAuth cannot natively read screenshots even when the underlying model supports vision.

Problem to solve

createImageTool (in src/agents/tools/image-tool.ts) is registered via createOpenClawTools and consumed by pi kernel (src/agents/pi-tools.ts) plus the auto-reply and gateway HTTP paths. The codex kernel (used when agents.defaults.model.primary = openai-codex/*) never receives it. The same kernel also bypasses detectAndLoadPromptImages auto-injection in src/agents/pi-embedded-runner/run/attempt.ts.

Result: ChatGPT Plus OAuth agents cannot read PNGs from the workspace from inside the agent loop. Models like gpt-5.4 / gpt-5.5 clearly support vision, but the kernel does not forward image content as multimodal input. Users have to write external shell wrappers that re-auth and call codex exec --image per image, paying ~75 s overhead per call (codex subprocess spawn + 78 k-token system prompt reload each time).

Proposed solution

Any one of the following would solve it:

Register imageTool (and pdfTool while at it) when building the codex kernel tool set the way pi kernel does.
Enable detectAndLoadPromptImages on the codex kernel input path so bare image paths in agent prompts/tool results get auto-injected as multimodal content.
Failing both, document explicitly that ChatGPT Plus OAuth users do not have native vision and recommend a wrapper pattern.

Happy to take a PR shot at option 1 or 2 if a maintainer can sanity-check the right injection point — codex kernel's tool plumbing is less obvious than pi's createOpenClawTools chain.

Alternatives considered

Works, but slow (~75 s per call) and wasteful (each call re-loads 78 k tokens of codex system prompt that should have been part of the existing long session).

Alternative: tell users to drag images into chat manually — defeats the point of agent automation when there are 30+ screenshots to analyze.

Impact

Affected: Any agent flow that needs to read images from disk (competitor screenshot analysis, OCR of saved PNGs, multi-image visual QA). Specifically affects ChatGPT Plus OAuth users because they cannot switch to a pi-kernel-eligible provider without giving up the OAuth subscription.
Severity: Medium — workaround exists but is ~10x slower than native should be.
Frequency: Every multi-image analysis turn. In our workflow: every time we run product competitor analysis (3 ASIN × ~15 screenshots).
Consequence: ~45 minutes per product run on the wrapper path vs an estimated ~5 minutes with native image-tool / auto-injection.

Evidence/examples

Repro:

Set agents.defaults.model.primary = openai-codex/gpt-5.4.
In chat: "read this PNG and describe it" pointing at a workspace file.
Agent has no native tool; it either fakes the answer or shells out via a wrapper.

Token cost per wrapper call (from our logs):

input_token_count ≈ 78 k (cached system prompt re-loaded every time)
output_token_count ≈ 200-300
response time ≈ 70-75 s

Additional information

Environment:

OpenClaw image: ghcr.io/openclaw/openclaw:2026.5.18
codex CLI: 0.130.0
Model: openai-codex/gpt-5.4
Auth: ChatGPT Plus OAuth

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering