openclaw - 💡(How to fix) Fix video_generate: expose xAI reference_images mode separately from image-to-video [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62325Fetched 2026-04-08 03:05:58
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

xAI's video API supports a reference_images generation mode that preserves character/style cues without forcing the input image to become the first frame.

OpenClaw currently appears to expose only these shared video modes:

  • generate
  • imageToVideo
  • videoToVideo

As a result, when using video_generate with xAI and an input image, the behavior is effectively first-frame image-to-video, not xAI's dedicated reference-image mode.

Root Cause

For character-consistent video workflows, there is a big difference between:

  • image-to-video: input image becomes the starting frame
  • reference-images mode: input image guides identity/style/content without locking frame 1

Right now this makes it hard to use xAI for "keep the face/character, but generate a different scene / outfit / shot" workflows.

RAW_BUFFERClick to expand / collapse

Summary

xAI's video API supports a reference_images generation mode that preserves character/style cues without forcing the input image to become the first frame.

OpenClaw currently appears to expose only these shared video modes:

  • generate
  • imageToVideo
  • videoToVideo

As a result, when using video_generate with xAI and an input image, the behavior is effectively first-frame image-to-video, not xAI's dedicated reference-image mode.

Why this matters

For character-consistent video workflows, there is a big difference between:

  • image-to-video: input image becomes the starting frame
  • reference-images mode: input image guides identity/style/content without locking frame 1

Right now this makes it hard to use xAI for "keep the face/character, but generate a different scene / outfit / shot" workflows.

xAI docs

xAI's docs explicitly describe three generation modes:

  • text-to-video: prompt
  • image-to-video: prompt + image
  • reference images: prompt + reference_images

And they explicitly note that reference_images is different from image-to-video because the source image does not become the starting frame.

Reference: https://docs.x.ai/developers/model-capabilities/video/generation

Current OpenClaw evidence

From local source inspection:

  • src/video-generation/types.ts defines only:
    • generate
    • imageToVideo
    • videoToVideo
  • src/agents/tools/video-generate-tool.ts gathers image inputs into loadedReferenceImages / inputImages, but the shared abstraction does not seem to distinguish xAI reference_images from normal image-to-video.
  • A repo search did not show an obvious reference_images / referenceImages path in the video generation implementation.

This suggests OpenClaw currently has no explicit way to request xAI's reference-image mode.

Repro

  1. Configure xAI video generation (xai/grok-imagine-video)
  2. Call video_generate with a prompt plus an input image
  3. Try to generate a video where the image should act only as identity reference (new background / framing / outfit / action)
  4. Actual behavior looks like first-frame-driven image-to-video rather than reference-guided generation

Expected behavior

One of:

  1. Add first-class support for xAI reference_images mode in the shared video abstraction, or
  2. Add a provider-specific switch so video_generate can route xAI requests to reference_images instead of standard image-to-video when requested

Possible implementation direction

  • Extend VideoGenerationMode / provider capabilities with a reference-image mode
  • Or add a request flag such as referenceMode: "first-frame" | "reference-images" (or equivalent) that xAI can honor
  • Surface this in video_generate without breaking existing providers

Notes

This is not a request to change current image-to-video behavior. That path is useful as-is. The missing piece is a way to access xAI's documented non-first-frame reference-image behavior through OpenClaw.

extent analysis

TL;DR

To fix the issue, OpenClaw needs to add support for xAI's reference_images mode, either by extending the shared video abstraction or by adding a provider-specific switch.

Guidance

  • Review the xAI documentation to understand the differences between imageToVideo and reference_images modes.
  • Investigate extending the VideoGenerationMode enum in src/video-generation/types.ts to include a referenceImages mode.
  • Consider adding a request flag, such as referenceMode, to the video_generate function to allow users to specify the desired mode.
  • Update the video-generate-tool.ts file to handle the new referenceImages mode and pass the correct parameters to the xAI API.

Example

// Example of how the VideoGenerationMode enum could be extended
enum VideoGenerationMode {
  generate,
  imageToVideo,
  videoToVideo,
  referenceImages, // New mode for xAI's reference_images
}

// Example of how the video_generate function could be updated
interface VideoGenerateOptions {
  referenceMode?: 'first-frame' | 'reference-images';
}

function videoGenerate(options: VideoGenerateOptions) {
  // ...
  if (options.referenceMode === 'reference-images') {
    // Use xAI's reference_images mode
  } else {
    // Use standard image-to-video mode
  }
  // ...
}

Notes

The implementation details will depend on the specific requirements of the OpenClaw project and the xAI API. It's essential to ensure that the new referenceImages mode is properly handled and that the existing imageToVideo mode is not affected.

Recommendation

Apply a workaround by extending the VideoGenerationMode enum and adding a request flag to the video_generate function to support xAI's reference_images mode. This will allow users to access the desired behavior without breaking existing functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

One of:

  1. Add first-class support for xAI reference_images mode in the shared video abstraction, or
  2. Add a provider-specific switch so video_generate can route xAI requests to reference_images instead of standard image-to-video when requested

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix video_generate: expose xAI reference_images mode separately from image-to-video [1 participants]