openclaw - 💡(How to fix) Fix [Feature]: Add describe_view camera frame support to OpenAI Realtime Talk

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Add a describe_view tool for OpenAI Realtime Talk sessions so the voice model can request the user's current camera frame and respond with visual context during a live voice conversation.

Error Message

  • "Help me understand this object/error/screen."

Root Cause

Add a describe_view tool for OpenAI Realtime Talk sessions so the voice model can request the user's current camera frame and respond with visual context during a live voice conversation.

RAW_BUFFERClick to expand / collapse

Summary

Add a describe_view tool for OpenAI Realtime Talk sessions so the voice model can request the user's current camera frame and respond with visual context during a live voice conversation.

Problem to solve

Realtime Talk can hear the user, but it cannot currently inspect what the user is pointing at or showing on camera. This makes common voice interactions fail or require awkward manual follow-up, for example:

  • "What is this?"
  • "Can you see what's wrong here?"
  • "Describe what I'm looking at."
  • "Help me understand this object/error/screen."

The browser already owns camera permissions and the OpenAI Realtime WebRTC session, so visual context can be supplied without adding server-side camera access or a new gateway media pipeline.

Proposed solution

Add a describe_view realtime tool for OpenAI Realtime Talk sessions.

When the OpenAI Realtime model calls describe_view, the Control UI should:

  1. Capture the current camera frame in the browser with getUserMedia({ video: true }).
  2. Encode the frame as a JPEG base64 image.
  3. Inject it directly into the OpenAI Realtime conversation over the existing WebRTC data channel using a conversation.item.create message containing input_image.
  4. Send the tool result and trigger response.create so the realtime model can describe the view aloud.

The tool should be registered for OpenAI WebRTC realtime sessions alongside the existing openclaw_agent_consult tool.

Non-WebRTC realtime transports should fail gracefully:

  • Google Live should report that describe_view is not supported there.
  • Gateway relay should report that camera capture requires WebRTC/direct browser transport.

The gateway should stay mostly transport-agnostic. It only needs to expose/register the tool and preserve protocol compatibility; browser-owned camera capture and provider-specific image injection stay in the Control UI transport implementation.

Alternatives considered

Alternative 1: Route the camera frame through openclaw_agent_consult.

This reuses the full OpenClaw agent's existing vision capability, but it adds an extra agent round trip and makes a quick "what am I looking at?" voice interaction slower and more complex.

Alternative 2: Add a new gateway-side vision/media pipeline.

This would require new server-side media infrastructure, frame samplers, provider routing, and possibly extra provider credentials. That is heavier than needed for the first OpenAI Realtime use case, where the browser already has both the camera frame and the active Realtime data channel.

Alternative 3: Do nothing.

Users must manually upload images or describe what they see, which breaks the natural voice workflow.

Impact

Affected users/systems: Users of Control UI Realtime Talk with OpenAI WebRTC sessions.

Severity: Medium. This does not break existing chat or voice, but it blocks a natural multimodal voice workflow.

Frequency: Occurs whenever a user asks the realtime voice assistant about something visible on camera.

Consequence: The assistant cannot answer visually grounded questions during live voice sessions, forcing users to switch modes, upload images manually, or describe the scene in text.

Evidence/examples

Example user flow:

  1. Open Control UI.
  2. Start Video Talk.
  3. Ask: "What is this?" while pointing the camera at an object.
  4. The realtime model calls describe_view.
  5. The browser captures one frame and injects it into the OpenAI Realtime conversation.
  6. The model describes the image aloud.

Prototype implementation notes:

  • Adds src/talk/describe-view-tool.ts.
  • Registers describe_view in talk.client.create / talk.session.create.
  • Adds a Video Talk entry point in Control UI.
  • Captures frames in ui/src/ui/chat/realtime-talk-webrtc.ts.
  • Sends unsupported-tool responses in Google Live and gateway relay transports.
  • Updates the default Permissions-Policy header from camera=() to camera=(self) so the served Control UI can request camera access.

Additional information

Out of scope for the initial implementation:

  • Server-side camera/screen capture.
  • A generic gateway media-call namespace.
  • Frame streaming or continuous video analysis.
  • Provider-independent image injection for every realtime transport.
  • Direct gateway calls to a separate Vision API.

The initial scope should be one-shot frame capture for OpenAI Realtime WebRTC sessions, with explicit graceful degradation for unsupported transports.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING