litellm - 💡(How to fix) Fix [Feature]: Cache Token Cost & Tracking for Custom Pricing and OpenAI-Compatible Providers [1 comments, 2 participants]

litellm2026-05-05 13:21:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#27191•Fetched 2026-05-06 06:15:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

escon1004

Participants

escon1004

XidaoApi

Timeline (top)

labeled ×3commented ×1

Two related but independent bugs prevented LiteLLM from correctly handling prompt-cache tokens when using:

Custom pricing (custom_cost_per_token with cache_read_input_token_cost / cache_creation_input_token_cost) — cache tokens were billed at the full input_cost_per_token rate instead of the configured cache rate.
OpenAI-compatible providers (moonshotai, openai, deepseek, kimi-k2, etc.) that report cache info via usage.prompt_tokens_details — daily spend aggregation always recorded Cache Read Tokens = 0 and Cache Write Tokens = 0 in the dashboard.

This patch fixes both, while preserving existing behavior when no custom cache pricing is configured (cache tokens fall back to input_cost_per_token).

References:

Root Cause

The daily spend aggregator reads cache tokens from the serialized metadata.usage_object dict using only the Anthropic field names:

# litellm/proxy/db/db_spend_update_writer.py (before)
cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0) or 0,
cache_creation_input_tokens=usage_obj.get("cache_creation_input_tokens", 0) or 0,

But for OpenAI-compatible providers:

Usage.__init__ stores cached tokens at prompt_tokens_details.cached_tokens and never sets the top-level cache_read_input_tokens field (that field is Anthropic-specific).

Usage.model_dump() therefore produces a dict like:

{
  "prompt_tokens": 22583,
  "prompt_tokens_details": { "cached_tokens": 22016 }
}

usage_obj.get("cache_read_input_tokens", 0) returns 0.
BaseDailySpendTransaction(cache_read_input_tokens=0) is queued and incremented into the daily aggregate as 0.

Same applies to cache_creation_input_tokens versus prompt_tokens_details.cache_write_tokens (or cache_creation_tokens).

Fix Action

Fix

File 5: `litellm/proxy/db/db_spend_update_writer.py`

Add two module-level helpers that normalize across both provider conventions, and use them in _common_add_spend_log_transaction_to_daily_transaction.

Add at module level (between imports and class DBSpendUpdateWriter):

def _extract_cache_read_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_read_input_tokens field.
    OpenAI-compatible (moonshotai, openai, deepseek, etc.): prompt_tokens_details.cached_tokens.
    """
    explicit = usage_obj.get("cache_read_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(details.get("cached_tokens", 0) or 0)


def _extract_cache_creation_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_creation_input_tokens field.
    OpenAI-compatible (kimi-k2 etc.): prompt_tokens_details.cache_write_tokens
    or prompt_tokens_details.cache_creation_tokens.
    """
    explicit = usage_obj.get("cache_creation_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(
        details.get("cache_write_tokens", 0)
        or details.get("cache_creation_tokens", 0)
        or 0
    )

Update the BaseDailySpendTransaction constructor call inside _common_add_spend_log_transaction_to_daily_transaction:

Before:

cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0)
or 0,
cache_creation_input_tokens=usage_obj.get(
    "cache_creation_input_tokens", 0
)
or 0,

After:

cache_read_input_tokens=_extract_cache_read_tokens(usage_obj),
cache_creation_input_tokens=_extract_cache_creation_tokens(usage_obj),

Code Example

litellm.completion_cost(
    completion_response=response,
    model="openai/gpt-5.4",
    custom_llm_provider="openai",
    custom_cost_per_token={
        "input_cost_per_token": 0.0000025,
        "output_cost_per_token": 0.000015,
        "cache_read_input_token_cost": 0.00000025,
    },
)

---

class CostPerToken(TypedDict):
    input_cost_per_token: float
    output_cost_per_token: float

---

class CostPerToken(TypedDict, total=False):
    input_cost_per_token: float
    output_cost_per_token: float
    cache_read_input_token_cost: float
    cache_creation_input_token_cost: float

---

def _cost_per_token_custom_pricing_helper(
    prompt_tokens: float = 0,
    completion_tokens: float = 0,
    response_time_ms: Optional[float] = 0.0,
    custom_cost_per_token: Optional[CostPerToken] = None,
    custom_cost_per_second: Optional[float] = None,
) -> Optional[Tuple[float, float]]:
    if custom_cost_per_token is None and custom_cost_per_second is None:
        return None
    if custom_cost_per_token is not None:
        input_cost = custom_cost_per_token["input_cost_per_token"] * prompt_tokens
        output_cost = custom_cost_per_token["output_cost_per_token"] * completion_tokens
        return input_cost, output_cost
    elif custom_cost_per_second is not None:
        output_cost = custom_cost_per_second * response_time_ms / 1000
        return 0, output_cost
    return None

---

def _cost_per_token_custom_pricing_helper(
    prompt_tokens: float = 0,
    completion_tokens: float = 0,
    response_time_ms: Optional[float] = 0.0,
    cached_tokens: float = 0,
    cache_creation_tokens: float = 0,
    custom_cost_per_token: Optional[CostPerToken] = None,
    custom_cost_per_second: Optional[float] = None,
) -> Optional[Tuple[float, float]]:
    """Internal helper function for calculating cost, if custom pricing given.

    prompt_tokens is assumed to include both cached_tokens and cache_creation_tokens
    (OpenAI-compatible convention). Anthropic-style usage where prompt_tokens excludes
    cache tokens is handled at the caller (cost_per_token) before invoking this helper.
    """
    if custom_cost_per_token is None and custom_cost_per_second is None:
        return None

    if custom_cost_per_token is not None:
        input_cost_per_token = custom_cost_per_token["input_cost_per_token"]
        output_cost_per_token = custom_cost_per_token["output_cost_per_token"]

        cache_read_input_token_cost = custom_cost_per_token.get(
            "cache_read_input_token_cost",
            input_cost_per_token,
        )
        cache_creation_input_token_cost = custom_cost_per_token.get(
            "cache_creation_input_token_cost",
            input_cost_per_token,
        )

        regular_prompt_tokens = max(
            prompt_tokens - cached_tokens - cache_creation_tokens,
            0,
        )

        input_cost = (
            regular_prompt_tokens * input_cost_per_token
            + cached_tokens * cache_read_input_token_cost
            + cache_creation_tokens * cache_creation_input_token_cost
        )
        output_cost = completion_tokens * output_cost_per_token
        return input_cost, output_cost
    elif custom_cost_per_second is not None:
        output_cost = custom_cost_per_second * response_time_ms / 1000
        return 0, output_cost

    return None

---

## CUSTOM PRICING ##
response_cost = _cost_per_token_custom_pricing_helper(
    prompt_tokens=prompt_tokens,
    completion_tokens=completion_tokens,
    response_time_ms=response_time_ms,
    custom_cost_per_second=custom_cost_per_second,
    custom_cost_per_token=custom_cost_per_token,
)

---

## CUSTOM PRICING ##
# Normalize cache token counts across providers:
#   - OpenAI-compatible: usage.prompt_tokens_details.cached_tokens
#     (prompt_tokens already INCLUDES cached_tokens)
#   - Anthropic: usage.cache_read_input_tokens / cache_creation_input_tokens
#     (prompt_tokens does NOT include these — adjust before calling helper)
_cache_read_tokens: float = 0
_cache_creation_tokens: float = 0
_is_anthropic_style = False

if usage_object is not None:
    _pt_details = getattr(usage_object, "prompt_tokens_details", None)
    if _pt_details is not None:
        _cache_read_tokens = float(
            getattr(_pt_details, "cached_tokens", 0) or 0
        )
        _cache_creation_tokens = float(
            getattr(_pt_details, "cache_creation_tokens", 0) or 0
        )

    _anthropic_read = getattr(usage_object, "cache_read_input_tokens", None)
    _anthropic_create = getattr(usage_object, "cache_creation_input_tokens", None)
    if _anthropic_read or _anthropic_create:
        _is_anthropic_style = True
        if _anthropic_read:
            _cache_read_tokens = float(_anthropic_read)
        if _anthropic_create:
            _cache_creation_tokens = float(_anthropic_create)

if not _cache_read_tokens and cache_read_input_tokens:
    _cache_read_tokens = float(cache_read_input_tokens)
    _is_anthropic_style = True
if not _cache_creation_tokens and cache_creation_input_tokens:
    _cache_creation_tokens = float(cache_creation_input_tokens)
    _is_anthropic_style = True

# Anthropic reports prompt_tokens as input_tokens (excluding cache tokens).
# Adjust so the helper's "prompt_tokens includes cache tokens" invariant holds.
_normalized_prompt_tokens = float(prompt_tokens)
if _is_anthropic_style:
    _normalized_prompt_tokens += _cache_read_tokens + _cache_creation_tokens

response_cost = _cost_per_token_custom_pricing_helper(
    prompt_tokens=_normalized_prompt_tokens,
    completion_tokens=completion_tokens,
    response_time_ms=response_time_ms,
    cached_tokens=_cache_read_tokens,
    cache_creation_tokens=_cache_creation_tokens,
    custom_cost_per_second=custom_cost_per_second,
    custom_cost_per_token=custom_cost_per_token,
)

---

import os
import sys

import pytest

sys.path.insert(0, os.path.abspath("../.."))

import litellm
from litellm.types.utils import ModelResponse, PromptTokensDetailsWrapper, Usage


def test_custom_pricing_applies_cache_read_input_cost():
    usage = Usage(
        prompt_tokens=6074,
        completion_tokens=285,
        total_tokens=6359,
        prompt_tokens_details=PromptTokensDetailsWrapper(
            cached_tokens=3456,
            audio_tokens=0,
        ),
    )

    response = ModelResponse(
        id="test-id",
        created=1234567890,
        model="openai/gpt-5.4",
        object="chat.completion",
        choices=[],
        usage=usage,
    )

    cost = litellm.completion_cost(
        completion_response=response,
        model="openai/gpt-5.4",
        custom_llm_provider="openai",
        custom_cost_per_token={
            "input_cost_per_token": 0.0000025,
            "output_cost_per_token": 0.000015,
            "cache_read_input_token_cost": 0.00000025,
        },
    )

    expected = (
        (6074 - 3456) * 0.0000025
        + 3456 * 0.00000025
        + 285 * 0.000015
    )

    assert cost == pytest.approx(expected)

---

uv run pytest tests/litellm/test_cost_calculator.py -v

---

Total Tokens: 348,633,697
Input Tokens: 338,690,721
Output Tokens:   9,942,976
Cache Read Tokens:        0   ← always 0
Cache Write Tokens:       0   ← always 0

---

# litellm/proxy/db/db_spend_update_writer.py (before)
cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0) or 0,
cache_creation_input_tokens=usage_obj.get("cache_creation_input_tokens", 0) or 0,

---

{
     "prompt_tokens": 22583,
     "prompt_tokens_details": { "cached_tokens": 22016 }
   }

---

def _extract_cache_read_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_read_input_tokens field.
    OpenAI-compatible (moonshotai, openai, deepseek, etc.): prompt_tokens_details.cached_tokens.
    """
    explicit = usage_obj.get("cache_read_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(details.get("cached_tokens", 0) or 0)


def _extract_cache_creation_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_creation_input_tokens field.
    OpenAI-compatible (kimi-k2 etc.): prompt_tokens_details.cache_write_tokens
    or prompt_tokens_details.cache_creation_tokens.
    """
    explicit = usage_obj.get("cache_creation_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(
        details.get("cache_write_tokens", 0)
        or details.get("cache_creation_tokens", 0)
        or 0
    )

---

cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0)
or 0,
cache_creation_input_tokens=usage_obj.get(
    "cache_creation_input_tokens", 0
)
or 0,

---

cache_read_input_tokens=_extract_cache_read_tokens(usage_obj),
cache_creation_input_tokens=_extract_cache_creation_tokens(usage_obj),

---

uv run pytest tests/litellm/test_cost_calculator.py -v
# tests/litellm/test_cost_calculator.py::test_custom_pricing_applies_cache_read_input_cost PASSED

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

The Feature

Summary

Two related but independent bugs prevented LiteLLM from correctly handling prompt-cache tokens when using:

Custom pricing (custom_cost_per_token with cache_read_input_token_cost / cache_creation_input_token_cost) — cache tokens were billed at the full input_cost_per_token rate instead of the configured cache rate.
OpenAI-compatible providers (moonshotai, openai, deepseek, kimi-k2, etc.) that report cache info via usage.prompt_tokens_details — daily spend aggregation always recorded Cache Read Tokens = 0 and Cache Write Tokens = 0 in the dashboard.

This patch fixes both, while preserving existing behavior when no custom cache pricing is configured (cache tokens fall back to input_cost_per_token).

References:

Bug 1: `custom_cost_per_token` Ignores Cache Token Pricing

Symptom

When a user configures custom pricing with cache discount rates:

litellm.completion_cost(
    completion_response=response,
    model="openai/gpt-5.4",
    custom_llm_provider="openai",
    custom_cost_per_token={
        "input_cost_per_token": 0.0000025,
        "output_cost_per_token": 0.000015,
        "cache_read_input_token_cost": 0.00000025,
    },
)

…the cache_read_input_token_cost was silently ignored. All prompt tokens (including cached ones) were billed at input_cost_per_token.

For a request with prompt_tokens=6074, cached_tokens=3456, completion_tokens=285:

Reported cost: $0.01946 (cached billed at full rate)
Correct cost: $0.011684 (cached at 10% rate)
Overcharge: 67%

Root Cause

litellm/cost_calculator.py::_cost_per_token_custom_pricing_helper only reads input_cost_per_token and output_cost_per_token from the CostPerToken TypedDict. It had no awareness of cache pricing, and the CostPerToken type itself did not declare cache fields.

Additionally, provider conventions differ:

Provider class	`prompt_tokens` semantics	Cache token location
OpenAI-compatible	Includes cached_tokens	`prompt_tokens_details.cached_tokens`
Anthropic	Excludes cache tokens	`cache_read_input_tokens` (top-level)

A naive prompt_tokens - cached_tokens would double-count for Anthropic.

Fix

File 1: `litellm/types/utils.py`

Extend the CostPerToken TypedDict to include optional cache pricing fields.

Before:

class CostPerToken(TypedDict):
    input_cost_per_token: float
    output_cost_per_token: float

After:

class CostPerToken(TypedDict, total=False):
    input_cost_per_token: float
    output_cost_per_token: float
    cache_read_input_token_cost: float
    cache_creation_input_token_cost: float

total=False makes all keys optional, preserving backward compatibility.

File 2: `litellm/cost_calculator.py` — `_cost_per_token_custom_pricing_helper`

Accept cached_tokens and cache_creation_tokens parameters; compute regular input cost as prompt_tokens - cached_tokens - cache_creation_tokens; apply cache rates separately. Fall back to input_cost_per_token when cache rates are absent (preserves existing behavior).

Before:

def _cost_per_token_custom_pricing_helper(
    prompt_tokens: float = 0,
    completion_tokens: float = 0,
    response_time_ms: Optional[float] = 0.0,
    custom_cost_per_token: Optional[CostPerToken] = None,
    custom_cost_per_second: Optional[float] = None,
) -> Optional[Tuple[float, float]]:
    if custom_cost_per_token is None and custom_cost_per_second is None:
        return None
    if custom_cost_per_token is not None:
        input_cost = custom_cost_per_token["input_cost_per_token"] * prompt_tokens
        output_cost = custom_cost_per_token["output_cost_per_token"] * completion_tokens
        return input_cost, output_cost
    elif custom_cost_per_second is not None:
        output_cost = custom_cost_per_second * response_time_ms / 1000
        return 0, output_cost
    return None

After:

def _cost_per_token_custom_pricing_helper(
    prompt_tokens: float = 0,
    completion_tokens: float = 0,
    response_time_ms: Optional[float] = 0.0,
    cached_tokens: float = 0,
    cache_creation_tokens: float = 0,
    custom_cost_per_token: Optional[CostPerToken] = None,
    custom_cost_per_second: Optional[float] = None,
) -> Optional[Tuple[float, float]]:
    """Internal helper function for calculating cost, if custom pricing given.

    prompt_tokens is assumed to include both cached_tokens and cache_creation_tokens
    (OpenAI-compatible convention). Anthropic-style usage where prompt_tokens excludes
    cache tokens is handled at the caller (cost_per_token) before invoking this helper.
    """
    if custom_cost_per_token is None and custom_cost_per_second is None:
        return None

    if custom_cost_per_token is not None:
        input_cost_per_token = custom_cost_per_token["input_cost_per_token"]
        output_cost_per_token = custom_cost_per_token["output_cost_per_token"]

        cache_read_input_token_cost = custom_cost_per_token.get(
            "cache_read_input_token_cost",
            input_cost_per_token,
        )
        cache_creation_input_token_cost = custom_cost_per_token.get(
            "cache_creation_input_token_cost",
            input_cost_per_token,
        )

        regular_prompt_tokens = max(
            prompt_tokens - cached_tokens - cache_creation_tokens,
            0,
        )

        input_cost = (
            regular_prompt_tokens * input_cost_per_token
            + cached_tokens * cache_read_input_token_cost
            + cache_creation_tokens * cache_creation_input_token_cost
        )
        output_cost = completion_tokens * output_cost_per_token
        return input_cost, output_cost
    elif custom_cost_per_second is not None:
        output_cost = custom_cost_per_second * response_time_ms / 1000
        return 0, output_cost

    return None

When cache_read_input_token_cost is not provided in custom_cost_per_token, it defaults to input_cost_per_token, so the formula collapses to prompt_tokens * input_cost_per_token — exactly the previous behavior.

File 3: `litellm/cost_calculator.py` — `cost_per_token` (call site)

Normalize cache token counts across the two provider conventions, then pass both cached_tokens and cache_creation_tokens to the helper. For Anthropic-style usage, adjust prompt_tokens upward so the helper's invariant ("prompt_tokens includes cache tokens") holds for both conventions.

Before:

## CUSTOM PRICING ##
response_cost = _cost_per_token_custom_pricing_helper(
    prompt_tokens=prompt_tokens,
    completion_tokens=completion_tokens,
    response_time_ms=response_time_ms,
    custom_cost_per_second=custom_cost_per_second,
    custom_cost_per_token=custom_cost_per_token,
)

After:

## CUSTOM PRICING ##
# Normalize cache token counts across providers:
#   - OpenAI-compatible: usage.prompt_tokens_details.cached_tokens
#     (prompt_tokens already INCLUDES cached_tokens)
#   - Anthropic: usage.cache_read_input_tokens / cache_creation_input_tokens
#     (prompt_tokens does NOT include these — adjust before calling helper)
_cache_read_tokens: float = 0
_cache_creation_tokens: float = 0
_is_anthropic_style = False

if usage_object is not None:
    _pt_details = getattr(usage_object, "prompt_tokens_details", None)
    if _pt_details is not None:
        _cache_read_tokens = float(
            getattr(_pt_details, "cached_tokens", 0) or 0
        )
        _cache_creation_tokens = float(
            getattr(_pt_details, "cache_creation_tokens", 0) or 0
        )

    _anthropic_read = getattr(usage_object, "cache_read_input_tokens", None)
    _anthropic_create = getattr(usage_object, "cache_creation_input_tokens", None)
    if _anthropic_read or _anthropic_create:
        _is_anthropic_style = True
        if _anthropic_read:
            _cache_read_tokens = float(_anthropic_read)
        if _anthropic_create:
            _cache_creation_tokens = float(_anthropic_create)

if not _cache_read_tokens and cache_read_input_tokens:
    _cache_read_tokens = float(cache_read_input_tokens)
    _is_anthropic_style = True
if not _cache_creation_tokens and cache_creation_input_tokens:
    _cache_creation_tokens = float(cache_creation_input_tokens)
    _is_anthropic_style = True

# Anthropic reports prompt_tokens as input_tokens (excluding cache tokens).
# Adjust so the helper's "prompt_tokens includes cache tokens" invariant holds.
_normalized_prompt_tokens = float(prompt_tokens)
if _is_anthropic_style:
    _normalized_prompt_tokens += _cache_read_tokens + _cache_creation_tokens

response_cost = _cost_per_token_custom_pricing_helper(
    prompt_tokens=_normalized_prompt_tokens,
    completion_tokens=completion_tokens,
    response_time_ms=response_time_ms,
    cached_tokens=_cache_read_tokens,
    cache_creation_tokens=_cache_creation_tokens,
    custom_cost_per_second=custom_cost_per_second,
    custom_cost_per_token=custom_cost_per_token,
)

Effect

Custom pricing now respects cache_read_input_token_cost and cache_creation_input_token_cost for both Anthropic-style and OpenAI-compatible-style usage objects.
When cache pricing keys are omitted, behavior is identical to before (cache tokens billed at input_cost_per_token).
The Anthropic double-counting trap (where prompt_tokens excludes cache tokens) is handled centrally at the call site, so the helper stays provider-agnostic.

Test

A new test file is added that reproduces the issue and validates the fix.

File 4 (new): `tests/litellm/test_cost_calculator.py`

import os
import sys

import pytest

sys.path.insert(0, os.path.abspath("../.."))

import litellm
from litellm.types.utils import ModelResponse, PromptTokensDetailsWrapper, Usage


def test_custom_pricing_applies_cache_read_input_cost():
    usage = Usage(
        prompt_tokens=6074,
        completion_tokens=285,
        total_tokens=6359,
        prompt_tokens_details=PromptTokensDetailsWrapper(
            cached_tokens=3456,
            audio_tokens=0,
        ),
    )

    response = ModelResponse(
        id="test-id",
        created=1234567890,
        model="openai/gpt-5.4",
        object="chat.completion",
        choices=[],
        usage=usage,
    )

    cost = litellm.completion_cost(
        completion_response=response,
        model="openai/gpt-5.4",
        custom_llm_provider="openai",
        custom_cost_per_token={
            "input_cost_per_token": 0.0000025,
            "output_cost_per_token": 0.000015,
            "cache_read_input_token_cost": 0.00000025,
        },
    )

    expected = (
        (6074 - 3456) * 0.0000025
        + 3456 * 0.00000025
        + 285 * 0.000015
    )

    assert cost == pytest.approx(expected)

Run with:

uv run pytest tests/litellm/test_cost_calculator.py -v

Bug 2: Dashboard Shows `Cache Read Tokens = 0` / `Cache Write Tokens = 0` for OpenAI-Compatible Providers

Symptom

After serving thousands of requests through the LiteLLM proxy to OpenAI-compatible providers (kimi-k2, moonshotai, deepseek, etc.) that return cache tokens in usage.prompt_tokens_details.cached_tokens, the proxy dashboard's Usage Metrics panel reports:

Total Tokens: 348,633,697
Input Tokens: 338,690,721
Output Tokens:   9,942,976
Cache Read Tokens:        0   ← always 0
Cache Write Tokens:       0   ← always 0

…even though individual response payloads clearly contain non-zero values (e.g. prompt_tokens_details.cached_tokens: 22016).

Root Cause

The daily spend aggregator reads cache tokens from the serialized metadata.usage_object dict using only the Anthropic field names:

# litellm/proxy/db/db_spend_update_writer.py (before)
cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0) or 0,
cache_creation_input_tokens=usage_obj.get("cache_creation_input_tokens", 0) or 0,

But for OpenAI-compatible providers:

Usage.__init__ stores cached tokens at prompt_tokens_details.cached_tokens and never sets the top-level cache_read_input_tokens field (that field is Anthropic-specific).

Usage.model_dump() therefore produces a dict like:

{
  "prompt_tokens": 22583,
  "prompt_tokens_details": { "cached_tokens": 22016 }
}

usage_obj.get("cache_read_input_tokens", 0) returns 0.
BaseDailySpendTransaction(cache_read_input_tokens=0) is queued and incremented into the daily aggregate as 0.

Same applies to cache_creation_input_tokens versus prompt_tokens_details.cache_write_tokens (or cache_creation_tokens).

Fix

File 5: `litellm/proxy/db/db_spend_update_writer.py`

Add two module-level helpers that normalize across both provider conventions, and use them in _common_add_spend_log_transaction_to_daily_transaction.

Add at module level (between imports and class DBSpendUpdateWriter):

def _extract_cache_read_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_read_input_tokens field.
    OpenAI-compatible (moonshotai, openai, deepseek, etc.): prompt_tokens_details.cached_tokens.
    """
    explicit = usage_obj.get("cache_read_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(details.get("cached_tokens", 0) or 0)


def _extract_cache_creation_tokens(usage_obj: dict) -> int:
    """
    Anthropic: top-level cache_creation_input_tokens field.
    OpenAI-compatible (kimi-k2 etc.): prompt_tokens_details.cache_write_tokens
    or prompt_tokens_details.cache_creation_tokens.
    """
    explicit = usage_obj.get("cache_creation_input_tokens", 0) or 0
    if explicit:
        return int(explicit)
    details = usage_obj.get("prompt_tokens_details") or {}
    return int(
        details.get("cache_write_tokens", 0)
        or details.get("cache_creation_tokens", 0)
        or 0
    )

Update the BaseDailySpendTransaction constructor call inside _common_add_spend_log_transaction_to_daily_transaction:

Before:

cache_read_input_tokens=usage_obj.get("cache_read_input_tokens", 0)
or 0,
cache_creation_input_tokens=usage_obj.get(
    "cache_creation_input_tokens", 0
)
or 0,

After:

cache_read_input_tokens=_extract_cache_read_tokens(usage_obj),
cache_creation_input_tokens=_extract_cache_creation_tokens(usage_obj),

Effect

Daily spend aggregation now correctly increments cache_read_input_tokens and cache_creation_input_tokens for both Anthropic-style and OpenAI-compatible-style usage objects.
The dashboard's "Cache Read Tokens" and "Cache Write Tokens" tiles will accurately reflect cache hits/writes for kimi-k2, moonshotai, openai, deepseek, and similar providers.
No change to existing Anthropic behavior — when cache_read_input_tokens is present at the top level, it takes precedence (explicit value short-circuits the fallback).

Independence of the Two Bugs

Bug 1 and Bug 2 sit on completely separate code paths and must both be fixed:

	Bug 1	Bug 2
Path	`litellm.completion_cost` → cost calculator	proxy spend log → daily aggregation
File	`cost_calculator.py`, `types/utils.py`	`proxy/db/db_spend_update_writer.py`
Trigger	Caller passes `custom_cost_per_token`	Any request through the proxy
Impact	Per-request cost field is wrong	Dashboard cache token totals are 0

Fixing one does not address the other.

Files Changed

#	File	Change
1	`litellm/types/utils.py`	`CostPerToken` TypedDict extended with optional cache cost fields (`total=False`)
2	`litellm/cost_calculator.py`	`_cost_per_token_custom_pricing_helper`: added `cached_tokens` / `cache_creation_tokens` params, separate cache-rate calculation with fallback to input rate
3	`litellm/cost_calculator.py`	`cost_per_token`: normalize Anthropic vs OpenAI-compatible `prompt_tokens` and cache token sources before calling helper
4	`tests/litellm/test_cost_calculator.py`	New test reproducing Bug 1 and validating the fix
5	`litellm/proxy/db/db_spend_update_writer.py`	Added `_extract_cache_read_tokens` / `_extract_cache_creation_tokens` helpers; replaced direct `usage_obj.get(...)` calls in `_common_add_spend_log_transaction_to_daily_transaction`

Backward Compatibility

Callers that pass custom_cost_per_token with only input_cost_per_token / output_cost_per_token keep their previous billing behavior.
Anthropic providers continue to populate cache_read_input_tokens / cache_creation_input_tokens at the top level, which the helpers prefer over the prompt_tokens_details fallback.
No schema, public API, or config-file changes required.

Verification

uv run pytest tests/litellm/test_cost_calculator.py -v
# tests/litellm/test_cost_calculator.py::test_custom_pricing_applies_cache_read_input_cost PASSED

For Bug 2, manual verification: send a request through the proxy to an OpenAI-compatible provider that returns prompt_tokens_details.cached_tokens, then check the dashboard's Cache Read / Cache Write Tokens tiles.

Motivation, pitch

After this monkey patch, the cache token count started working correctly.

Based on the Kimi-k2.6 model pricing table and the usage history, the calculated result is accurate.

Hooray!

What part of LiteLLM is this about?

Proxy

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

No response

extent analysis

TL;DR

The fix involves updating the CostPerToken TypedDict to include optional cache pricing fields and modifying the _cost_per_token_custom_pricing_helper function to respect these cache rates, as well as adjusting the daily spend aggregator to correctly handle cache tokens from OpenAI-compatible providers.

Guidance

Update litellm/types/utils.py to extend the CostPerToken TypedDict with optional cache pricing fields.
Modify litellm/cost_calculator.py to make the _cost_per_token_custom_pricing_helper function aware of cache pricing and to handle both Anthropic and OpenAI-compatible provider conventions.
Adjust litellm/proxy/db/db_spend_update_writer.py to correctly extract cache read and creation tokens from usage objects, regardless of the provider convention.
Verify the fix by running the provided test and checking the dashboard's cache token counts.

Example

The example code snippets provided in the issue body demonstrate the necessary changes, such as extending the CostPerToken TypedDict and updating the _cost_per_token_custom_pricing_helper function.

Notes

The changes are designed to be backward compatible, preserving the existing behavior for callers that do not pass cache pricing information and for Anthropic providers. The fix requires no schema, public API, or config-file changes.

Recommendation

Apply the workaround by implementing the described changes to litellm/types/utils.py, litellm/cost_calculator.py, and litellm/proxy/db/db_spend_update_writer.py, as these updates address both bugs and ensure accurate cache token handling for all provider types.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

litellm - 💡(How to fix) Fix [Feature]: Cache Token Cost & Tracking for Custom Pricing and OpenAI-Compatible Providers [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

File 5: litellm/proxy/db/db_spend_update_writer.py

Code Example

Check for existing issues

The Feature

Summary

Bug 1: custom_cost_per_token Ignores Cache Token Pricing

Symptom

Root Cause

Fix

File 1: litellm/types/utils.py

File 2: litellm/cost_calculator.py — _cost_per_token_custom_pricing_helper

File 3: litellm/cost_calculator.py — cost_per_token (call site)

Effect

Test

File 4 (new): tests/litellm/test_cost_calculator.py

Bug 2: Dashboard Shows Cache Read Tokens = 0 / Cache Write Tokens = 0 for OpenAI-Compatible Providers

Symptom

Root Cause

Fix

File 5: litellm/proxy/db/db_spend_update_writer.py

Effect

Independence of the Two Bugs

Files Changed

Backward Compatibility

Verification

Motivation, pitch

What part of LiteLLM is this about?

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

File 5: `litellm/proxy/db/db_spend_update_writer.py`

Bug 1: `custom_cost_per_token` Ignores Cache Token Pricing

File 1: `litellm/types/utils.py`

File 2: `litellm/cost_calculator.py` — `_cost_per_token_custom_pricing_helper`

File 3: `litellm/cost_calculator.py` — `cost_per_token` (call site)

File 4 (new): `tests/litellm/test_cost_calculator.py`

Bug 2: Dashboard Shows `Cache Read Tokens = 0` / `Cache Write Tokens = 0` for OpenAI-Compatible Providers

File 5: `litellm/proxy/db/db_spend_update_writer.py`