One of: 1. Add first-class support for xAI `reference_images` mode in the shared video abstraction, or 2. Add a provider-specific switch so `video_generate` can route xAI requests to `reference_images` instead of standard image-to-video when requested

openclaw - 💡(How to fix) Fix video_generate: expose xAI reference_images mode separately from image-to-video [1 participants]

openclaw2026-04-07 05:58:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62325•Fetched 2026-04-08 03:05:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zqchris

Participants

zqchris

xAI's video API supports a reference_images generation mode that preserves character/style cues without forcing the input image to become the first frame.

OpenClaw currently appears to expose only these shared video modes:

generate
imageToVideo
videoToVideo

As a result, when using video_generate with xAI and an input image, the behavior is effectively first-frame image-to-video, not xAI's dedicated reference-image mode.

Root Cause

For character-consistent video workflows, there is a big difference between:

image-to-video: input image becomes the starting frame
reference-images mode: input image guides identity/style/content without locking frame 1

Right now this makes it hard to use xAI for "keep the face/character, but generate a different scene / outfit / shot" workflows.

RAW_BUFFERClick to expand / collapse

Summary

xAI's video API supports a reference_images generation mode that preserves character/style cues without forcing the input image to become the first frame.

OpenClaw currently appears to expose only these shared video modes:

generate
imageToVideo
videoToVideo

As a result, when using video_generate with xAI and an input image, the behavior is effectively first-frame image-to-video, not xAI's dedicated reference-image mode.

Why this matters

For character-consistent video workflows, there is a big difference between:

image-to-video: input image becomes the starting frame
reference-images mode: input image guides identity/style/content without locking frame 1

Right now this makes it hard to use xAI for "keep the face/character, but generate a different scene / outfit / shot" workflows.

xAI docs

xAI's docs explicitly describe three generation modes:

text-to-video: prompt
image-to-video: prompt + image
reference images: prompt + reference_images

And they explicitly note that reference_images is different from image-to-video because the source image does not become the starting frame.

Reference: https://docs.x.ai/developers/model-capabilities/video/generation

Current OpenClaw evidence

From local source inspection:

src/video-generation/types.ts defines only:
- generate
- imageToVideo
- videoToVideo
src/agents/tools/video-generate-tool.ts gathers image inputs into loadedReferenceImages / inputImages, but the shared abstraction does not seem to distinguish xAI reference_images from normal image-to-video.
A repo search did not show an obvious reference_images / referenceImages path in the video generation implementation.

This suggests OpenClaw currently has no explicit way to request xAI's reference-image mode.

Repro

Configure xAI video generation (xai/grok-imagine-video)
Call video_generate with a prompt plus an input image
Try to generate a video where the image should act only as identity reference (new background / framing / outfit / action)
Actual behavior looks like first-frame-driven image-to-video rather than reference-guided generation

Expected behavior

One of:

Add first-class support for xAI reference_images mode in the shared video abstraction, or
Add a provider-specific switch so video_generate can route xAI requests to reference_images instead of standard image-to-video when requested

Possible implementation direction

Extend VideoGenerationMode / provider capabilities with a reference-image mode
Or add a request flag such as referenceMode: "first-frame" | "reference-images" (or equivalent) that xAI can honor
Surface this in video_generate without breaking existing providers

Notes

This is not a request to change current image-to-video behavior. That path is useful as-is. The missing piece is a way to access xAI's documented non-first-frame reference-image behavior through OpenClaw.

extent analysis

TL;DR

To fix the issue, OpenClaw needs to add support for xAI's reference_images mode, either by extending the shared video abstraction or by adding a provider-specific switch.

Guidance

Review the xAI documentation to understand the differences between imageToVideo and reference_images modes.
Investigate extending the VideoGenerationMode enum in src/video-generation/types.ts to include a referenceImages mode.
Consider adding a request flag, such as referenceMode, to the video_generate function to allow users to specify the desired mode.
Update the video-generate-tool.ts file to handle the new referenceImages mode and pass the correct parameters to the xAI API.

Example

// Example of how the VideoGenerationMode enum could be extended
enum VideoGenerationMode {
  generate,
  imageToVideo,
  videoToVideo,
  referenceImages, // New mode for xAI's reference_images
}

// Example of how the video_generate function could be updated
interface VideoGenerateOptions {
  referenceMode?: 'first-frame' | 'reference-images';
}

function videoGenerate(options: VideoGenerateOptions) {
  // ...
  if (options.referenceMode === 'reference-images') {
    // Use xAI's reference_images mode
  } else {
    // Use standard image-to-video mode
  }
  // ...
}

Notes

The implementation details will depend on the specific requirements of the OpenClaw project and the xAI API. It's essential to ensure that the new referenceImages mode is properly handled and that the existing imageToVideo mode is not affected.

Recommendation

Apply a workaround by extending the VideoGenerationMode enum and adding a request flag to the video_generate function to support xAI's reference_images mode. This will allow users to access the desired behavior without breaking existing functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

One of:

Add first-class support for xAI reference_images mode in the shared video abstraction, or
Add a provider-specific switch so video_generate can route xAI requests to reference_images instead of standard image-to-video when requested

#api #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix video_generate: expose xAI reference_images mode separately from image-to-video [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Why this matters

xAI docs

Current OpenClaw evidence

Repro

Expected behavior

Possible implementation direction

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix video_generate: expose xAI reference_images mode separately from image-to-video [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Why this matters

xAI docs

Current OpenClaw evidence

Repro

Expected behavior

Possible implementation direction

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING