litellm - 💡(How to fix) Fix [Bug]: Anthropic input_audio base64 leaks into Gemini text tokens causing inflated input token counts

litellm2026-05-28 10:18:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

{
  "model": "gemini/gemini-flash-3",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "source": {
            "type": "base64",
            "media_type": "audio/mp3",
            "data": "<base64_encoded_audio>"
          }
        }
      ]
    }
  ]
}

---

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Description When using LiteLLM to proxy requests from Anthropic format to Gemini Flash (via Google AI Studio), we are intermittently seeing a massive spike in input text tokens. The base64-encoded audio string from the input_audio block appears to be leaking into the text content of the request sent to Gemini, causing it to be tokenized as text in addition to being correctly processed as audio.

What happened? We pass audio as inline base64 in the Anthropic input_audio block. LiteLLM translates this to Gemini's native format. Intermittently (~25% of calls on peak days), the translation fails and the raw base64 string leaks into the text content of the Gemini request. Gemini then charges both audio tokens AND text tokens for the same audio data.

Expected text tokens per call: ~1,750 (system prompt only) Observed text tokens on spiked calls: ~8,500–10,000 Spurious extra tokens per call: ~6,800–8,500 Output quality is NOT affected — extraction results remain accurate

Steps to Reproduce

Use LiteLLM v1.84.0 with a Gemini Flash model (gemini/gemini-flash-3 or similar)
Send a request in Anthropic format with audio passed as inline base64 in the input_audio block:

{
  "model": "gemini/gemini-flash-3",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "source": {
            "type": "base64",
            "media_type": "audio/mp3",
            "data": "<base64_encoded_audio>"
          }
        }
      ]
    }
  ]
}

Check the usage.prompt_tokens / input_tokens breakdown in the response
Observe text token count inflated to ~8,500–10,000 on affected calls

Expected Behavior Text tokens should only reflect system prompt (~1,750 tokens). Audio should be counted separately as audio tokens only. Base64 audio data should be correctly translated into Gemini's inline_data block, NOT leaked into text content.

Actual Behavior On ~25% of calls, the base64 audio string is passed as text content to Gemini, resulting in ~6,800–8,500 spurious text tokens per call. The issue is non-deterministic — the same payload sometimes works correctly and sometimes spikes.

Confirmation / Isolation

Tested the same calls directly on Google AI Studio (bypassing LiteLLM) — text tokens were consistently normal (~1,750) The issue is not related to: call duration, audio file size, mp3 format, retries, or payload construction (identical payloads show both normal and spiked behavior) Using Gemini File API / Vertex API directly shows no inflation — confirming the issue is specific to the inline base64 path through LiteLLM

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

1.84.0

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering