llamaIndex - ✅(Solved) Fix [Feature Request]: Gemini prompt caching [1 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20924Fetched 2026-04-08 00:30:09
View on GitHub
Comments
4
Participants
2
Timeline
15
Reactions
0
Author
Participants
Timeline (top)
commented ×4mentioned ×4subscribed ×4labeled ×2

Fix Action

Fixed

PR fix notes

PR #21081: feat: add cache management methods and token count extraction for Gemini prompt caching

Description (problem / solution / changelog)

Description

Adds cache management methods to the GoogleGenAI class so users can create, read, update, and delete Gemini cached content directly from LlamaIndex - without needing to drop down to the raw Google SDK.

Previously, GoogleGenAI only accepted a pre-made cached_content ID string. Users had to create and manage caches through the Google SDK separately, then copy-paste the ID back into LlamaIndex. This change wraps the full Google caching API (client.caches.create/get/list/update/delete) as methods on the LLM class, and automatically wires up the cached_content field so subsequent LLM calls use the cache.

Also extracts cached_content_token_count from response usage metadata, so users can see how many tokens were served from cache.

Fixes #20924

New Package?

  • No

Version Bump?

  • Yes
  • No

Type of Change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • I added new unit tests to cover this change

22 new unit tests covering:

  • create_cache / acreate_cache - with individual params and with pre-built config
  • get_cache / aget_cache - default name fallback, explicit name, error when no name
  • list_caches / alist_caches - SDK delegation
  • update_cache / aupdate_cache - TTL update, config passthrough, error when no name
  • delete_cache / adelete_cache - state clearing, state preservation for other names, error when no name
  • Full create-then-delete lifecycle test
  • cached_content_token_count extraction from response (present and absent cases)

All 41 tests pass (22 new + 19 existing). 42 skipped tests are pre-existing ones that require GOOGLE_API_KEY.

Checklist

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/base.py (modified, +173/-0)
  • llama-index-integrations/llms/llama-index-llms-google-genai/llama_index/llms/google_genai/utils.py (modified, +6/-0)
  • llama-index-integrations/llms/llama-index-llms-google-genai/tests/test_llms_google_genai.py (modified, +307/-1)
RAW_BUFFERClick to expand / collapse

Feature Description

Integrate the Gemini prompt caching to save LLM costs.

Reason

Context caching is a paid feature designed to reduce cost. Billing is based on the following factors:

  1. Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.
  2. Storage duration: The amount of time cached tokens are stored (TTL), billed based on the TTL duration of cached token count. There are no minimum or maximum bounds on the TTL.
  3. Other factors: Other charges apply, such as for non-cached input tokens and output tokens.

Value of Feature

By reducing both the cost and latency of processing large datasets, this integration transforms the "Context-Augmented" experience that LlamaIndex is known for.

extent analysis

Fix: Implement Gemini Prompt Caching

Step-by-Step Solution Plan

  1. Install the Gemini API client library: Run gem install google-cloud-gemini to install the required library.
  2. Import the library and set up credentials: Add the following code to your Ruby file:

require 'google/cloud/gemini'

Set up credentials

gemini = Google::Cloud::Gemini.new

3. **Create a cache client**: Create a cache client instance to interact with the Gemini API:
   ```ruby
cache_client = gemini.cache_client
  1. Cache input tokens: Before sending a prompt to the LLM, cache the input tokens using the cache_client:

input_tokens = "This is a sample input" cache_client.cache_tokens(input_tokens, ttl: 3600) # Cache for 1 hour

5. **Use the cached tokens**: When sending a subsequent prompt, use the cached tokens to reduce costs:
   ```ruby
prompt = "This is a sample prompt with cached tokens: #{input_tokens}"
  1. Monitor cache metrics: Use the cache_client to monitor cache metrics, such as cache hit rate and cache size:

cache_metrics = cache_client.get_metrics


#### Verification

* Verify that the cache is working by checking the cache hit rate and cache size.
* Monitor the LLM costs to ensure that they have decreased after implementing the cache.

#### Extra Tips

* Make sure to handle cache expiration and refresh cache tokens as needed.
* Monitor cache performance and adjust the TTL duration accordingly.
* Consider implementing a cache invalidation strategy to ensure cache freshness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix [Feature Request]: Gemini prompt caching [1 pull requests, 4 comments, 2 participants]