hermes - 💡(How to fix) Fix Feature: Reduce cache-read token overhead for DeepSeek providers — configurable cache_ttl, skills snapshot trimming, memory compaction

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Three factors compound:

RAW_BUFFERClick to expand / collapse

Background

Hermes Agent on DeepSeek v4 Flash (API mode) burns ~60M tokens in 18 hours for a typical user session — of which ~58.7M (84%) are cache_read_tokens. The actual cost is under $1 USD thanks to DeepSeek's cheap cache pricing, but on provider-agnostic billing dashboards the raw token count looks wildly inflated.

Root Cause Analysis

Three factors compound:

1. prompt_caching.cache_ttl: 5m is too short

The default 5-minute TTL means any natural pause in conversation (user checks notifications, walks away, thinks) invalidates prompt caching. The next request re-caches the full system prompt (~80KB+), generating fresh cache_write + cache_read tokens.

Suggested fix: Bump default to 30-60m. Also make this configurable per-provider (Anthropic may have different TTL semantics).

2. Skills prompt snapshot (~60KB) is sent every turn

The .skills_prompt_snapshot.json contains full metadata for all 38 installed skills. This is injected into the system prompt every API call, whether or not the skills are used. On long sessions with 85-224 API calls, this alone accounts for 5-25M cache reads.

Suggested fixes:

  • Store only skill name + single-line description in the snapshot (trim metadata)
  • Or switch to lazy-loading: only inject the top-K skills that fuzzy-match the current query
  • Or add a max_skills_snapshot_kb config to cap the injected size

3. Memory + User Profile grows unbounded

Memory char limit (2200) and user profile char limit (1375) are soft caps — they still consume context on every turn. With verbose brand-voice rules, story references, and user background stored as inline text, ~3.3KB of fixed overhead per call adds up.

Suggested fix: Add max_context_tokens config — when system prompt exceeds this, auto-trim/compress memory and profile entries. Bonus: allow skills to be the container for verbose reference material (already supported by skill_view()).

Data from User Session (60 sessions, 64 days)

ComponentRaw Tokens% of Total
Input tokens~994K1.7%
Output tokens~284K0.5%
Cache read tokens~58.7M98%
Cache write tokens~00%
Reasoning tokens~00%
Total~60M100%

Worst offenders:

  • 纱窗网目选择指南 session: 271K input + 25.9M cache read (200 API calls)
  • Discord广播适配器 session: 169K input + 21.5M cache read (224 API calls)
  • 巴特的人物背景介绍 session: 156K input + 6.0M cache read (85 API calls)

Proposed Feature Set (3 items)

  1. prompt_caching.cache_ttl — Increase default from 5m to 30m, make per-provider configurable
  2. Skills snapshot trimming — Add config option to limit snapshot verbosity or switch to lazy/on-demand skill index injection
  3. context.max_system_prompt_tokens — Soft cap that triggers auto-compression of memory/user_profile when exceeded

Environment

  • Model: deepseek-v4-flash (via api.deepseek.com)
  • Provider: deepseek
  • Hermes config: default (config_version: 24)
  • Platform: Windows 10
  • Sessions auto-reset: daily at 04:00

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING