litellm - 💡(How to fix) Fix [Feature]: `gemini-embedding-2-preview` multimodal input support (image/video/audio/PDF) in `batchEmbedContents` [2 comments, 3 participants]

Code Example

def transform_openai_input_gemini_content(input, model, optional_params):
    for i in input:
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[PartType(text=i)]),  # ← always text
            **optional_params
        )

---

class PartType(TypedDict, total=False):
    text: str
    inline_data: BlobType      # ← for base64 images/audio
    file_data: FileDataType    # ← for GCS URIs

---

class BlobType(TypedDict, total=False):
    mime_type: Required[str]
    data: Required[str]  # base64-encoded

Feature Description

gemini-embedding-2-preview is Google's first natively multimodal embedding model, supporting text, images, video, audio, and PDF inputs. However, the current LiteLLM implementation only handles text-only input — all other modalities are silently treated as text strings.

Current Behavior

In litellm/llms/vertex_ai/gemini_embeddings/batch_embed_content_transformation.py, the transform_openai_input_gemini_content function wraps all inputs as PartType(text=i):

def transform_openai_input_gemini_content(input, model, optional_params):
    for i in input:
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[PartType(text=i)]),  # ← always text
            **optional_params
        )

This means:

Base64-encoded images → sent as text string (garbage embedding)
data:image/png;base64,... URIs → sent as text string
Video/audio references → sent as text string
Mixed text+image inputs → not supported

Expected Behavior

The function should detect input type and construct appropriate PartType:

Input format	Should produce
Plain string	`PartType(text=input)`
`data:image/png;base64,...`	`PartType(inline_data=BlobType(mime_type="image/png", data=base64_data))`
Raw base64 (detected)	`PartType(inline_data=BlobType(mime_type="image/png", data=base64_data))`
`gs://...` GCS URI	`PartType(file_data=FileDataType(mime_type=..., file_uri=uri))`
Dict `{"text": ..., "inline_data": ...}`	Multi-part content with both text and media

Type Definitions Already Support This

The PartType TypedDict already has the necessary fields:

class PartType(TypedDict, total=False):
    text: str
    inline_data: BlobType      # ← for base64 images/audio
    file_data: FileDataType    # ← for GCS URIs

And BlobType:

class BlobType(TypedDict, total=False):
    mime_type: Required[str]
    data: Required[str]  # base64-encoded

Supported Modalities per Google Docs

Per Google's documentation:

Text: up to 8,192 tokens
Images: up to 6 per request (PNG, JPEG)
Video: max 2 minutes (MP4, MOV)
Audio: max 80 seconds (MP3, WAV)
PDF: up to 6 pages per file

Motivation

Without this, gemini-embedding-2-preview in LiteLLM is essentially a text-only embedding model, missing its primary differentiator — native multimodal support. Users who want image/video/audio embeddings have to bypass LiteLLM entirely.

Suggested Implementation

Add input-type detection logic in transform_openai_input_gemini_content (similar to what VertexAIMultimodalEmbeddingConfig._process_input_element does for the old multimodalembedding model)
Construct PartType with appropriate fields based on detected input type
Support data:mime_type;base64,... URI format for inline media
Support dict inputs like {"inline_data": {"mime_type": "image/png", "data": "base64..."}} for explicit specification

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

main branch

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To fix the issue, we need to modify the transform_openai_input_gemini_content function to detect the input type and construct the appropriate PartType. Here are the steps:

Add input-type detection logic
Construct PartType with appropriate fields based on detected input type
Support data:mime_type;base64,... URI format for inline media
Support dict inputs like {"inline_data": {"mime_type": "image/png", "data": "base64..."}} for explicit specification

Here's an example of how the modified function could look:

import base64
import json
from typing import Dict, Any

def transform_openai_input_gemini_content(input, model, optional_params):
    def detect_input_type(i):
        if isinstance(i, str):
            if i.startswith('data:'):
                # Handle data URI
                mime_type, data = i.split(',', 1)
                mime_type = mime_type.split(':')[1].split(';')[0]
                return PartType(inline_data=BlobType(mime_type=mime_type, data=data))
            elif i.startswith('gs://'):
                # Handle GCS URI
                return PartType(file_data=FileDataType(mime_type='application/octet-stream', file_uri=i))
            else:
                # Handle plain text
                return PartType(text=i)
        elif isinstance(i, dict):
            # Handle dict input
            if 'text' in i:
                text = i['text']
                part_type = PartType(text=text)
            else:
                part_type = PartType()
            if 'inline_data' in i:
                inline_data = i['inline_data']
                if isinstance(inline_data, str):
                    # Handle base64-encoded string
                    mime_type = 'image/png'  # default mime type
                    data = inline_data
                elif isinstance(inline_data, dict):
                    # Handle dict with mime type and data
                    mime_type = inline_data.get('mime_type', 'image/png')
                    data = inline_data.get('data', '')
                part_type.inline_data = BlobType(mime_type=mime_type, data=data)
            return part_type
        else:
            # Handle unknown input type
            raise ValueError(f'Unknown input type: {type(i)}')

    requests = []
    for i in input:
        part_type = detect_input_type(i)
        request = EmbedContentRequest(
            model=gemini_model_name,
            content=ContentType(parts=[part_type]),
            **optional_params
        )
        requests.append(request)
    return requests

Verification

To verify that the fix worked, you can test the transform_openai_input_gemini_content function with different input types, such as plain text, data URI, GCS URI, and dict input. Check that the constructed PartType is correct for each input type.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Feature]: `gemini-embedding-2-preview` multimodal input support (image/video/audio/PDF) in `batchEmbedContents` [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Feature Description

Current Behavior

Expected Behavior

Type Definitions Already Support This

Supported Modalities per Google Docs

Motivation

Suggested Implementation

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Feature]: `gemini-embedding-2-preview` multimodal input support (image/video/audio/PDF) in `batchEmbedContents` [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Feature Description

Current Behavior

Expected Behavior

Type Definitions Already Support This

Supported Modalities per Google Docs

Motivation

Suggested Implementation

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING