litellm - ✅(Solved) Fix [Bug]: Fireworks AI - cache_read_input_token_cost configured but not used in cost calculation [1 pull requests, 1 comments, 2 participants]

litellm2026-04-17 13:37:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#25950•Fetched 2026-04-18 05:52:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fabienbarbaud

Participants

ahb7

fabienbarbaud

Timeline (top)

labeled ×2commented ×1

PR fix notes

PR #26016: fix: apply cache_read_input_token_cost to cached tokens in Fireworks AI cost calculation

Repository: BerriAI/litellm
Author: VANDRANKI
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/26016

Description (problem / solution / changelog)

What

litellm/llms/fireworks_ai/cost_calculator.py ignores cached_tokens when computing prompt cost. All input tokens are charged at input_cost_per_token even when cache_read_input_token_cost is configured and cache hits are reported in prompt_tokens_details.

Fixes #25950.

Why

The cost_per_token function did:

prompt_cost = usage["prompt_tokens"] * model_info["input_cost_per_token"]

This charges every token at the full input rate, ignoring cached_tokens in prompt_tokens_details and the cache_read_input_token_cost / cache_creation_input_token_cost fields that are correctly loaded into model_info.

Fix

Extract cached_tokens and cache_creation_tokens from usage.prompt_tokens_details, then apply the appropriate rate to each bucket:

non_cached_tokens * input_cost_per_token
+ cached_tokens * cache_read_input_token_cost
+ cache_creation_tokens * cache_creation_input_token_cost

Falls back to 0.0 if the cache cost fields are not set, so standard Fireworks serverless tier pricing is unchanged.

Test

Using the numbers from the issue report:

prompt_tokens = 44341, cached_tokens = 41518
input_cost_per_token = 1.4e-6, cache_read_input_token_cost = 2.6e-7

Before: 44341 * 1.4e-6 = $0.0621 (wrong) After: 2823 * 1.4e-6 + 41518 * 2.6e-7 = $0.00395 + $0.01079 = $0.0147 (correct)

Changed files

litellm/llms/fireworks_ai/cost_calculator.py (modified, +23/-1)

Code Example

input_cost = (prompt_tokens - cached_tokens) * input_cost_per_token + cached_tokens * cache_read_input_token_cost

---

input_cost = prompt_tokens * input_cost_per_token

---

model_list:
  - model_name: GLM-5p1
    litellm_params:
      model: fireworks_ai/accounts/fireworks/models/glm-5p1
      api_key: "os.environ/XXX"
    model_info:
      input_cost_per_token: 0.0000014
      output_cost_per_token: 0.0000044
      cache_read_input_token_cost: 2.6e-7      # $0.26/1M
      cache_creation_input_token_cost: 2.6e-7  # $0.26/1M
      supports_prompt_caching: true

---

{
  "usage_object": {
    "prompt_tokens": 44341,
    "prompt_tokens_details": {
      "cached_tokens": 41518  // 93.6% cache hit rate
    },
    "completion_tokens": 853
  },
  "cost_breakdown": {
    "input_cost": 0.0620774,  // ❌ Wrong: 44341 * $1.40/1M = $0.062
    "output_cost": 0.003753,
    "total_cost": 0.0658306
  },
  "model_map_information": {
    "model_map_value": {
      "input_cost_per_token": 0.0000014,        // ✅ Loaded
      "output_cost_per_token": 0.0000044,      // ✅ Loaded
      "cache_read_input_token_cost": 2.6e-7,   // ✅ Loaded but NOT used
      "cache_creation_input_token_cost": 2.6e-7, // ✅ Loaded but NOT used
      "supports_prompt_caching": true
    }
  }
}


### Expected vs Actual Cost

| Token Type | Count | Expected Cost | Actual Cost |
|------------|-------|-------------|-------------|
| Non-cached | 2,823 | 2,823 × $1.40/1M = **$0.004** | Included in $0.062 |
| Cached | 41,518 | 41,518 × $0.26/1M = **$0.011** | Charged at $1.40/1M |
| **Total Input** | 44,341 | **$0.015** | **$0.062** ❌ |

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Bug Description

For Fireworks AI models, the cache_read_input_token_cost field is correctly loaded from model_info configuration (visible in model_map_information), but it is not applied in the actual cost calculation. All input tokens are charged at full input_cost_per_token price regardless of cached_tokens count.

Expected Behavior

When cached_tokens > 0, the cost calculation should be:

input_cost = (prompt_tokens - cached_tokens) * input_cost_per_token + cached_tokens * cache_read_input_token_cost

Actual Behavior

The input_cost is calculated as if all tokens were non-cached:

input_cost = prompt_tokens * input_cost_per_token

Steps to Reproduce

Configure a Fireworks AI model with prompt caching enabled
Set cache_read_input_token_cost in model_info
Send a request with cache hits (verified by cached_tokens > 0 in response)
Observe cost_breakdown.input_cost ignores cached token pricing

Configuration

model_list:
  - model_name: GLM-5p1
    litellm_params:
      model: fireworks_ai/accounts/fireworks/models/glm-5p1
      api_key: "os.environ/XXX"
    model_info:
      input_cost_per_token: 0.0000014
      output_cost_per_token: 0.0000044
      cache_read_input_token_cost: 2.6e-7      # $0.26/1M
      cache_creation_input_token_cost: 2.6e-7  # $0.26/1M
      supports_prompt_caching: true

Relevant log output

{
  "usage_object": {
    "prompt_tokens": 44341,
    "prompt_tokens_details": {
      "cached_tokens": 41518  // 93.6% cache hit rate
    },
    "completion_tokens": 853
  },
  "cost_breakdown": {
    "input_cost": 0.0620774,  // ❌ Wrong: 44341 * $1.40/1M = $0.062
    "output_cost": 0.003753,
    "total_cost": 0.0658306
  },
  "model_map_information": {
    "model_map_value": {
      "input_cost_per_token": 0.0000014,        // ✅ Loaded
      "output_cost_per_token": 0.0000044,      // ✅ Loaded
      "cache_read_input_token_cost": 2.6e-7,   // ✅ Loaded but NOT used
      "cache_creation_input_token_cost": 2.6e-7, // ✅ Loaded but NOT used
      "supports_prompt_caching": true
    }
  }
}


### Expected vs Actual Cost

| Token Type | Count | Expected Cost | Actual Cost |
|------------|-------|-------------|-------------|
| Non-cached | 2,823 | 2,823 × $1.40/1M = **$0.004** | Included in $0.062 |
| Cached | 41,518 | 41,518 × $0.26/1M = **$0.011** | Charged at $1.40/1M |
| **Total Input** | 44,341 | **$0.015** | **$0.062** ❌ |

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.3-stable

Twitter / LinkedIn details

No response

extent analysis

TL;DR

The cost calculation for Fireworks AI models with prompt caching enabled should be updated to apply the cache_read_input_token_cost field.

Guidance

Review the cost calculation logic to ensure it correctly applies the cache_read_input_token_cost when cached_tokens > 0.
Verify that the cache_read_input_token_cost value is being loaded correctly from the model_info configuration.
Update the cost calculation to use the formula: input_cost = (prompt_tokens - cached_tokens) * input_cost_per_token + cached_tokens * cache_read_input_token_cost.
Test the updated cost calculation with a request that has cache hits to ensure the correct cost is being applied.

Example

def calculate_input_cost(prompt_tokens, cached_tokens, input_cost_per_token, cache_read_input_token_cost):
    non_cached_tokens = prompt_tokens - cached_tokens
    input_cost = (non_cached_tokens * input_cost_per_token) + (cached_tokens * cache_read_input_token_cost)
    return input_cost

Notes

The issue appears to be specific to the Fireworks AI models with prompt caching enabled, and the fix should be applied to the cost calculation logic for these models.

Recommendation

Apply the workaround by updating the cost calculation logic to correctly apply the cache_read_input_token_cost field, as this will ensure accurate cost calculations for Fireworks AI models with prompt caching enabled.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.