hermes - 💡(How to fix) Fix Feature Request: Support Native Gemini Context Caching

hermes2026-05-21 12:49:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

For Gemini models running natively via GeminiNativeClient (configured via gemini or google-gemini-cli profiles), the request builder build_gemini_request constructs a payload that sends the entire message history and system instruction fresh on every turn. Because Gemini context caching requires calling the dedicated cachedContents API endpoint to create a cache resource, the current client implementation results in linear, uncached token billing on every turn.

Code Example

POST https://generativelanguage.googleapis.com/v1beta/cachedContents

---

{
     "model": "models/gemini-2.5-flash",
     "contents": [...],
     "systemInstruction": {...},
     "ttl": "300s"
   }

---

{
     "contents": [... new turns ...],
     "cachedContent": "cachedContents/abc123xyz"
   }

RAW_BUFFERClick to expand / collapse

Feature Description

Support Google Gemini context caching (prompt caching) in the native Gemini adapter (agent/gemini_native_adapter.py) to reduce input token latency and cost for sessions with large system prompts and long multi-turn history.

Current Behavior

Currently, prompt caching is only implemented for Anthropic models. The agent uses the system_and_3 strategy in agent/prompt_caching.py to inject cache_control blocks into the messages payload.

For large-context agents (e.g., 40k+ tokens of system prompt + history), this leads to excessive API usage costs.

Proposed Implementation

To support Gemini context caching, the Gemini native adapter should implement the following workflow:

Threshold Check: Context caching on Gemini requires a minimum of 32,768 tokens (both for Google AI Studio and Vertex AI). The adapter should check if the stable prefix (system instructions + stable message history) exceeds this limit.
Cache Resource Creation: If the threshold is met, the client should send a POST request to the v1beta.cachedContents endpoint:
```
POST https://generativelanguage.googleapis.com/v1beta/cachedContents
```
With a body specifying the model, the system instruction, and contents to cache, along with a TTL (e.g., mapped from prompt_caching.cache_ttl in config.yaml, default 300s):
```
{
  "model": "models/gemini-2.5-flash",
  "contents": [...],
  "systemInstruction": {...},
  "ttl": "300s"
}
```
Cache Resource Reuse: The API returns a resource name (e.g., cachedContents/abc123xyz). The client must store this cache reference and attach it to subsequent chat completion requests in the request payload:
```
{
  "contents": [... new turns ...],
  "cachedContent": "cachedContents/abc123xyz"
}
```
Integration with prompt_caching.py: Align the Gemini cache invalidation logic with the existing prompt_caching.cache_ttl and session management configuration.

Additional Context

With the release of Gemini 1.5/2.5/3.5 models and their massive context windows, many users run Hermes Agent with highly detailed system prompts and deep memories. Adding native caching will significantly lower the cost barrier for running Hermes on Google's model suite.

If there is interest in this feature, our team is happy to implement this and submit a PR.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering