hermes - 💡(How to fix) Fix feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16834Fetched 2026-04-29 06:38:41
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
labeled ×3

Root Cause

The pre-filter pays for itself because retain is the most expensive operation ($15/1M), and skipping a retain call saves the full extraction cost — including the Hindsight-side LLM extraction that also runs on every retained turn.

Code Example

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.\n\nRetain if the turn contains:\n- Personal facts, preferences, decisions, or corrections\n- Technical choices or workflow decisions the user made\n- Durable insights about the user's environment or work\n- Relationships, roles, or project context\n\nSkip if the turn is:\n- Content the user is working with (scripts, documents, research, code) rather than expressing\n- Tool output, debugging logs, or SQL results\n- The assistant explaining, suggesting, or executing tasks\n- Factual claims written for an audience (not personal facts)\n- Ephemeral session state (model switches, connection checks)\n\nRespond with a single JSON object: {\"retain\": true/false, \"reason\": \"one sentence\"}",
    "min_user_chars": 20
  }
}

---

User turn completes
┌──────────────────┐     no      ┌─────────────────┐
User message <   │────────────▶│ Skip retain     │
│ min_user_chars? (no cost)└──────────────────┘             └─────────────────┘
       │ yes
┌──────────────────┐     no      ┌─────────────────┐
│ pre_filter       │────────────▶│ Skip retain     │
│ enabled? (no cost)└──────────────────┘             └─────────────────┘
       │ yes
┌──────────────────┐
LLM classification│
 (cheap model)~500-2000 tokens  │
└──────────────────┘
       ├── retain: false ──▶ Skip (no retain API call)
       └── retain: true  ──▶ Send to Hindsight retain API
                              (extraction runs as normal)
RAW_BUFFERClick to expand / collapse

Title

feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost

Body

Problem

Hindsight's Hermes plugin sends every completed turn to the retain API with zero content-based filtering. The only controls are auto_retain (on/off) and retain_every_n_turns (batching, not filtering).

This causes two problems:

1. Noise in memory — Tool output, pasted or uploaded documents (scripts, research), SQL query results, debugging sessions, and any content the user is working with (not expressing as personal fact) all get retained. The extraction LLM then creates memories that are irrelevant or actively misleading (e.g., arguments from a YouTube script attributed as personal opinions of the user).

2. Wasted token cost — Cloud users pay Retain: $15.00/1M tokens, Reflect: $3.00/1M tokens, Recall: $0.75/1M tokens. Every turn — regardless of content quality — consumes retain tokens for extraction.

Why heuristic filters aren't enough

Regex patterns, document upload flags, and content-type heuristics all share the same flaw: they're source-based, not content-based. A YouTube script can arrive via:

  • File upload ([The user sent a text document: ...])
  • Pasted inline in CLI (no document marker)
  • Referenced from a file read
  • Typed as part of a brainstorming session

A regex can't distinguish "Chris is telling me about his architecture decision" from "Chris pasted a script about healthcare ROI for me to review." Only an LLM can make that judgment call.

Proposed solution

Add a retain_pre_filter option that runs a lightweight LLM classification call before sending content to the retain API. If the pre-filter says "skip," the turn is silently dropped — no retain API call, no extraction, no token cost.

Config (~/.hermes/hindsight/config.json)

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.\n\nRetain if the turn contains:\n- Personal facts, preferences, decisions, or corrections\n- Technical choices or workflow decisions the user made\n- Durable insights about the user's environment or work\n- Relationships, roles, or project context\n\nSkip if the turn is:\n- Content the user is working with (scripts, documents, research, code) rather than expressing\n- Tool output, debugging logs, or SQL results\n- The assistant explaining, suggesting, or executing tasks\n- Factual claims written for an audience (not personal facts)\n- Ephemeral session state (model switches, connection checks)\n\nRespond with a single JSON object: {\"retain\": true/false, \"reason\": \"one sentence\"}",
    "min_user_chars": 20
  }
}
OptionTypeDefaultDescription
enabledboolfalseEnable pre-filter (opt-in, zero breaking change)
modelstringagent's current modelModel to use for classification. Defaults to the agent's configured model. Can be set to gpt-oss-120b (same model Hindsight uses for extraction — aggressively cheap) or a local model via Ollama for zero cost
promptstringbuilt-in defaultCustom classification prompt. Users can tune this to their needs
min_user_charsint20Skip pre-filter (and retain) for turns where user message is below this length. These are almost never worth retaining

How it works

User turn completes
┌──────────────────┐     no      ┌─────────────────┐
│ User message <   │────────────▶│ Skip retain     │
│ min_user_chars?  │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
┌──────────────────┐     no      ┌─────────────────┐
│ pre_filter       │────────────▶│ Skip retain     │
│ enabled?         │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
┌──────────────────┐
│ LLM classification│
│ (cheap model)     │
│ ~500-2000 tokens  │
└──────────────────┘
       ├── retain: false ──▶ Skip (no retain API call)
       └── retain: true  ──▶ Send to Hindsight retain API
                              (extraction runs as normal)

Cost analysis

Using gpt-oss-120b (the model Hindsight already uses for extraction in its default config — aggressively cheap open-weight model):

MetricValue
Retain ops (14 days, single user)315
Pre-filter input tokens (~1.5K avg per classification)~473K
Pre-filter cost (gpt-oss-120b)~$0.01 (self-hosted: $0)
Turns skipped (estimated 50-70%)~157-220
Retain tokens saved (at 3,718 avg tokens/op)~584K-818K
Retain cost saved ($15/1M)~$8.76-$12.27
Net savings per 14 days~$8.75-$12.26
Projected annual savings~$228-$320

For self-hosted users running a local 20B model (e.g., via Ollama) for classification, the pre-filter cost is $0 and still catches an estimated 70-80% of garbage turns.

The pre-filter pays for itself because retain is the most expensive operation ($15/1M), and skipping a retain call saves the full extraction cost — including the Hindsight-side LLM extraction that also runs on every retained turn.

Why this is better than retain-mission-only filtering

The retain mission (extraction instructions) is the last line of defense — it tells the extraction LLM what to keep. But:

  1. Tokens are already spent — the retain API call, serialization, and extraction LLM call all run before the retain mission has any effect
  2. Extraction quality degrades with noise — the extraction LLM has to process and discard garbage content, which can confuse entity resolution and fact extraction even for the good parts of the turn
  3. No way to handle pasted content — the retain mission can't distinguish "Chris told me this" from "Chris pasted this document for me to work on" when both arrive as plain text

The pre-filter acts as the first line of defense: it decides whether the turn is even worth sending. The retain mission then handles the nuance of what to extract from the turns that pass through. Both layers serve different purposes.

Key design decisions

1. Opt-in, not opt-out. All existing users see zero behavior change. The filter is disabled by default.

2. Configurable model. Users can use a cheap model for classification (DeepSeek, Qwen Flash, local Ollama) or default to their agent's current model. The prompt is also overridable for customization.

3. Async, non-blocking. The pre-filter classification should run asynchronously (like retain_async already does) so it doesn't add latency to the user's conversation. If the pre-filter is slow, the worst case is a brief delay before the retain batch is queued — the user's response is already delivered.

4. Logging. When a turn is skipped, log the classification result (retain: false, reason) at debug level so users can audit what's being filtered and tune the prompt if needed.

5. Graceful degradation. If the pre-filter model is unavailable or errors, fall through to normal retain (send everything). The filter is a cost optimization, not a gatekeeper — it should never cause data loss.

Backward compatibility

Zero breaking change. retain_pre_filter.enabled defaults to false. When disabled, behavior is identical to current.

Alternatives considered

ApproachProblem
Regex/heuristic filtersSource-based, can't distinguish user-expressed facts from user-pasted content
retain_every_n_turnsBatching, not filtering — same noise, just delayed
Retain mission rules onlyTokens already spent by the time extraction runs; can't handle pasted text
Disable auto_retain, manual onlyLoses automatic retention convenience
Post-extraction cleanupDoesn't save tokens — the expensive extraction already ran

Environment

  • Hermes Agent: v2026.4.x
  • Hindsight plugin: latest
  • Config: ~/.hermes/hindsight/config.json

extent analysis

TL;DR

To reduce noise and token cost in Hindsight's Hermes plugin, enable the proposed retain_pre_filter option, which uses a lightweight LLM classification call to filter out irrelevant content before sending it to the retain API.

Guidance

  • Enable the retain_pre_filter option in the ~/.hermes/hindsight/config.json configuration file by setting "enabled": true.
  • Choose a suitable model for the pre-filter, such as gpt-oss-120b, and configure the prompt to customize the classification criteria.
  • Set the min_user_chars threshold to skip pre-filtering for short user messages that are unlikely to be worth retaining.
  • Monitor the pre-filter's performance and adjust the configuration as needed to balance filtering effectiveness and token cost savings.

Example

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.",
    "min_user_chars": 20
  }
}

Notes

The proposed solution is designed to be opt-in, with zero breaking changes, and allows for customization of the pre-filter model and prompt. However, the effectiveness of the pre-filter may depend on the quality of the LLM model and the specific use case.

Recommendation

Apply the proposed retain_pre_filter workaround to reduce noise and token cost in Hindsight's Hermes plugin, as it provides a flexible and customizable solution for filtering out irrelevant content.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost [1 participants]