hermes - 💡(How to fix) Fix state.db FTS trigram index bloat: 70% of DB size is full-text indexes [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  1. Dual FTS indexes: Every message is indexed twice:

    • messages_fts (porter tokenizer) — 111MB
    • messages_fts_trigram (trigram tokenizer) — 247MB
  2. Trigram tokenizer is expensive for CJK: Chinese text produces significantly more trigram tokens than English, inflating the index. The trigram index is 2.2x larger than the porter stemmer index despite indexing the same data.

  3. system_prompt stored per-session: 2.4K sessions × ~17KB system_prompt = 38.6MB. Many sessions share nearly identical prompts (same model + similar config), but each stores a full copy.

Fix Action

Fixed

RAW_BUFFERClick to expand / collapse

Problem

The state.db file grows rapidly due to dual FTS5 indexes on the messages table. On a moderately used instance (42K messages, 2.4K sessions):

ComponentSize% of DB
messages data (content, tool_calls, reasoning)99MB19.6%
sessions data (system_prompt)45MB8.9%
FTS indexes (fts + fts_trigram)358MB70.8%
Other (indexes, overhead)3MB0.7%
Total505MB100%

The messages_fts_trigram table alone consumes 247MB (49% of the entire DB) — 2.6x the size of the primary FTS index.

Root Cause

  1. Dual FTS indexes: Every message is indexed twice:

    • messages_fts (porter tokenizer) — 111MB
    • messages_fts_trigram (trigram tokenizer) — 247MB
  2. Trigram tokenizer is expensive for CJK: Chinese text produces significantly more trigram tokens than English, inflating the index. The trigram index is 2.2x larger than the porter stemmer index despite indexing the same data.

  3. system_prompt stored per-session: 2.4K sessions × ~17KB system_prompt = 38.6MB. Many sessions share nearly identical prompts (same model + similar config), but each stores a full copy.

Growth Rate

  • Daily: ~150 new sessions + ~2000 messages → +20MB/day
  • At this rate: 600MB/month, 7.3GB/year

Suggested Fixes

  1. Make trigram FTS optional: The porter stemmer FTS handles most English queries well. The trigram index is only needed for CJK substring search (3+ chars). Consider:

    • Adding a config option to disable trigram indexing
    • Or only building it on-demand when CJK search is used
  2. Normalize system_prompt storage: Store a deduplicated system_prompts table with a foreign key from sessions, eliminating redundant ~38MB.

  3. Add VACUUM/PRAGMA: Consider PRAGMA auto_vacuum = INCREMENTAL or periodic VACUUM to reclaim space after session deletion.

  4. Add a session retention/cleanup mechanism: Currently sessions grow indefinitely. A configurable TTL or max session count would help long-running instances.

Environment

  • Hermes Agent v0.13.0 (2026.5.7)
  • macOS, Python 3.11.15
  • state.db: 505MB, 42155 messages, 2375 sessions
  • Primary models: MiniMax-M2.7 (1915 sessions), glm-5-turbo (458 sessions)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix state.db FTS trigram index bloat: 70% of DB size is full-text indexes [1 pull requests]