litellm - 💡(How to fix) Fix [Feature]: `gemini-embedding-2-preview` multimodal input support (image/video/audio/PDF) in `batchEmbedContents` [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24393Fetched 2026-04-08 01:18:02
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
0
Author
Timeline (top)
commented ×2labeled ×2subscribed ×1

Code Example

def transform_openai_input_gemini_content(input, model, optional_params):
    for i in input:
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[PartType(text=i)]),  # ← always text
            **optional_params
        )

---

class PartType(TypedDict, total=False):
    text: str
    inline_data: BlobType      # ← for base64 images/audio
    file_data: FileDataType    # ← for GCS URIs

---

class BlobType(TypedDict, total=False):
    mime_type: Required[str]
    data: Required[str]  # base64-encoded
RAW_BUFFERClick to expand / collapse

Feature Description

gemini-embedding-2-preview is Google's first natively multimodal embedding model, supporting text, images, video, audio, and PDF inputs. However, the current LiteLLM implementation only handles text-only input — all other modalities are silently treated as text strings.

Current Behavior

In litellm/llms/vertex_ai/gemini_embeddings/batch_embed_content_transformation.py, the transform_openai_input_gemini_content function wraps all inputs as PartType(text=i):

def transform_openai_input_gemini_content(input, model, optional_params):
    for i in input:
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[PartType(text=i)]),  # ← always text
            **optional_params
        )

This means:

  • Base64-encoded images → sent as text string (garbage embedding)
  • data:image/png;base64,... URIs → sent as text string
  • Video/audio references → sent as text string
  • Mixed text+image inputs → not supported

Expected Behavior

The function should detect input type and construct appropriate PartType:

Input formatShould produce
Plain stringPartType(text=input)
data:image/png;base64,...PartType(inline_data=BlobType(mime_type="image/png", data=base64_data))
Raw base64 (detected)PartType(inline_data=BlobType(mime_type="image/png", data=base64_data))
gs://... GCS URIPartType(file_data=FileDataType(mime_type=..., file_uri=uri))
Dict {"text": ..., "inline_data": ...}Multi-part content with both text and media

Type Definitions Already Support This

The PartType TypedDict already has the necessary fields:

class PartType(TypedDict, total=False):
    text: str
    inline_data: BlobType      # ← for base64 images/audio
    file_data: FileDataType    # ← for GCS URIs

And BlobType:

class BlobType(TypedDict, total=False):
    mime_type: Required[str]
    data: Required[str]  # base64-encoded

Supported Modalities per Google Docs

Per Google's documentation:

  • Text: up to 8,192 tokens
  • Images: up to 6 per request (PNG, JPEG)
  • Video: max 2 minutes (MP4, MOV)
  • Audio: max 80 seconds (MP3, WAV)
  • PDF: up to 6 pages per file

Motivation

Without this, gemini-embedding-2-preview in LiteLLM is essentially a text-only embedding model, missing its primary differentiator — native multimodal support. Users who want image/video/audio embeddings have to bypass LiteLLM entirely.

Suggested Implementation

  1. Add input-type detection logic in transform_openai_input_gemini_content (similar to what VertexAIMultimodalEmbeddingConfig._process_input_element does for the old multimodalembedding model)
  2. Construct PartType with appropriate fields based on detected input type
  3. Support data:mime_type;base64,... URI format for inline media
  4. Support dict inputs like {"inline_data": {"mime_type": "image/png", "data": "base64..."}} for explicit specification

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

main branch

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To fix the issue, we need to modify the transform_openai_input_gemini_content function to detect the input type and construct the appropriate PartType. Here are the steps:

  • Add input-type detection logic
  • Construct PartType with appropriate fields based on detected input type
  • Support data:mime_type;base64,... URI format for inline media
  • Support dict inputs like {"inline_data": {"mime_type": "image/png", "data": "base64..."}} for explicit specification

Here's an example of how the modified function could look:

import base64
import json
from typing import Dict, Any

def transform_openai_input_gemini_content(input, model, optional_params):
    def detect_input_type(i):
        if isinstance(i, str):
            if i.startswith('data:'):
                # Handle data URI
                mime_type, data = i.split(',', 1)
                mime_type = mime_type.split(':')[1].split(';')[0]
                return PartType(inline_data=BlobType(mime_type=mime_type, data=data))
            elif i.startswith('gs://'):
                # Handle GCS URI
                return PartType(file_data=FileDataType(mime_type='application/octet-stream', file_uri=i))
            else:
                # Handle plain text
                return PartType(text=i)
        elif isinstance(i, dict):
            # Handle dict input
            if 'text' in i:
                text = i['text']
                part_type = PartType(text=text)
            else:
                part_type = PartType()
            if 'inline_data' in i:
                inline_data = i['inline_data']
                if isinstance(inline_data, str):
                    # Handle base64-encoded string
                    mime_type = 'image/png'  # default mime type
                    data = inline_data
                elif isinstance(inline_data, dict):
                    # Handle dict with mime type and data
                    mime_type = inline_data.get('mime_type', 'image/png')
                    data = inline_data.get('data', '')
                part_type.inline_data = BlobType(mime_type=mime_type, data=data)
            return part_type
        else:
            # Handle unknown input type
            raise ValueError(f'Unknown input type: {type(i)}')

    requests = []
    for i in input:
        part_type = detect_input_type(i)
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[part_type]),
            **optional_params
        )
        requests.append(request)
    return requests

Verification

To verify that the fix worked, you can test the transform_openai_input_gemini_content function with different input types, such as plain text, data URI, GCS URI, and dict input. Check that the constructed PartType is correct for each input type.

Extra Tips

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING