hermes - 💡(How to fix) Fix Feature: Reduce cache-read token overhead for DeepSeek providers — configurable cache_ttl, skills snapshot trimming, memory compaction

StepCodex · 2026-05-28T20:09:11Z

[hermes] Background Hermes Agent on DeepSeek v4 Flash API mode burns ~60M tokens in 18 hours for a typical user session — of which ~58.7M 84% are cache read to… ## Background Hermes Agent on DeepSeek v4 Flash (API mode) burns **~60M tokens in 18 hours** for a typical user session — of which **~58.7M (84%) are cache_read_tokens**. The actual cost is under $1 USD thanks to DeepSeek's cheap cache pricing, but on provider-agnostic billing dashboards the raw token count looks wildly inflated. ## Root Cause Analysis Three factors compound: ### 1. `prompt_caching.cache_ttl: 5m` is too short The default 5-minute TTL means any natural pause in conversation (user checks notifications, walks away, thinks) invalidates prompt caching. The next request re-caches the full system prompt (~80KB+), generating fresh cache_write + cache_read tokens. **Suggested fix:** Bump default to 30-60m. Also make this configurable per-provider (Anthropic may have different TTL semantics). ### 2. Skills prompt snapshot (~60KB) is sent every turn The `.skills_prompt_snapshot.json` contains full metadata for all 38 installed skills. This is injected into the system prompt every API call, whether or not the skills are used. On long sessions with 85-224 API calls, this alone accounts for 5-25M cache reads. **Suggested fixes:** - Store only skill name + single-line description in the snapshot (trim metadata) - Or switch to lazy-loading: only inject the top-K skills that fuzzy-match the current query - Or add a `max_skills_snapshot_kb` config to cap the injected size ### 3. Memory + User Profile grows unbounded Memory char limit (2200) and user profile char limit (1375) are soft caps — they still consume context on every turn. With verbose brand-voice rules, story references, and user background stored as inline text, ~3.3KB of fixed overhead per call adds up. **Suggested fix:** Add `max_context_tokens` config — when system prompt exceeds this, auto-trim/compress memory and profile entries. Bonus: allow skills to be the container for verbose reference material (already supported by `skill_view()`). ## Data from User Session (60 sessions, 64 days) | Component | Raw Tokens | % of Total | |-----------|-----------|-----------| | Input tokens | ~994K | 1.7% | | Output tokens | ~284K | 0.5% | | Cache read tokens | ~58.7M | **98%** | | Cache write tokens | ~0 | 0% | | Reasoning tokens | ~0 | 0% | | **Total** | **~60M** | 100% | Worst offenders: - **纱窗网目选择指南** session: 271K input + 25.9M cache read (200 API calls) - **Discord广播适配器** session: 169K input + 21.5M cache read (224 API calls) - **巴特的人物背景介绍** session: 156K input + 6.0M cache read (85 API calls) ## Proposed Feature Set (3 items) 1. **`prompt_caching.cache_ttl`** — Increase default from 5m to 30m, make per-provider configurable 2. **Skills snapshot trimming** — Add config option to limit snapshot verbosity or switch to lazy/on-demand skill index injection 3. **`context.max_system_prompt_tokens`** — Soft cap that triggers auto-compression of memory/user_profile when exceeded ## Environment - Model: deepseek-v4-flash (via api.deepseek.com) - Provider: deepseek - Hermes config: default (config_version: 24) - Platform: Windows 10 - Sessions auto-reset: daily at 04:00

hermes2026-05-28 20:09:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Three factors compound:

RAW_BUFFERClick to expand / collapse

Background

Hermes Agent on DeepSeek v4 Flash (API mode) burns ~60M tokens in 18 hours for a typical user session — of which ~58.7M (84%) are cache_read_tokens. The actual cost is under $1 USD thanks to DeepSeek's cheap cache pricing, but on provider-agnostic billing dashboards the raw token count looks wildly inflated.

Root Cause Analysis

Three factors compound:

1. `prompt_caching.cache_ttl: 5m` is too short

The default 5-minute TTL means any natural pause in conversation (user checks notifications, walks away, thinks) invalidates prompt caching. The next request re-caches the full system prompt (~80KB+), generating fresh cache_write + cache_read tokens.

Suggested fix: Bump default to 30-60m. Also make this configurable per-provider (Anthropic may have different TTL semantics).

2. Skills prompt snapshot (~60KB) is sent every turn

The .skills_prompt_snapshot.json contains full metadata for all 38 installed skills. This is injected into the system prompt every API call, whether or not the skills are used. On long sessions with 85-224 API calls, this alone accounts for 5-25M cache reads.

Suggested fixes:

Store only skill name + single-line description in the snapshot (trim metadata)
Or switch to lazy-loading: only inject the top-K skills that fuzzy-match the current query
Or add a max_skills_snapshot_kb config to cap the injected size

3. Memory + User Profile grows unbounded

Memory char limit (2200) and user profile char limit (1375) are soft caps — they still consume context on every turn. With verbose brand-voice rules, story references, and user background stored as inline text, ~3.3KB of fixed overhead per call adds up.

Suggested fix: Add max_context_tokens config — when system prompt exceeds this, auto-trim/compress memory and profile entries. Bonus: allow skills to be the container for verbose reference material (already supported by skill_view()).

Data from User Session (60 sessions, 64 days)

Component	Raw Tokens	% of Total
Input tokens	~994K	1.7%
Output tokens	~284K	0.5%
Cache read tokens	~58.7M	98%
Cache write tokens	~0	0%
Reasoning tokens	~0	0%
Total	~60M	100%

Worst offenders:

纱窗网目选择指南 session: 271K input + 25.9M cache read (200 API calls)
Discord广播适配器 session: 169K input + 21.5M cache read (224 API calls)
巴特的人物背景介绍 session: 156K input + 6.0M cache read (85 API calls)

Proposed Feature Set (3 items)

prompt_caching.cache_ttl — Increase default from 5m to 30m, make per-provider configurable
Skills snapshot trimming — Add config option to limit snapshot verbosity or switch to lazy/on-demand skill index injection
context.max_system_prompt_tokens — Soft cap that triggers auto-compression of memory/user_profile when exceeded

Environment

Model: deepseek-v4-flash (via api.deepseek.com)
Provider: deepseek
Hermes config: default (config_version: 24)
Platform: Windows 10
Sessions auto-reset: daily at 04:00

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Feature: Reduce cache-read token overhead for DeepSeek providers — configurable cache_ttl, skills snapshot trimming, memory compaction

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Background

Root Cause Analysis

1. `prompt_caching.cache_ttl: 5m` is too short

2. Skills prompt snapshot (~60KB) is sent every turn

3. Memory + User Profile grows unbounded

Data from User Session (60 sessions, 64 days)

Proposed Feature Set (3 items)

Environment

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Feature: Reduce cache-read token overhead for DeepSeek providers — configurable cache_ttl, skills snapshot trimming, memory compaction

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Background

Root Cause Analysis

1. prompt_caching.cache_ttl: 5m is too short

2. Skills prompt snapshot (~60KB) is sent every turn

3. Memory + User Profile grows unbounded

Data from User Session (60 sessions, 64 days)

Proposed Feature Set (3 items)

Environment

Still need to ship something?

TRENDING

1. `prompt_caching.cache_ttl: 5m` is too short