hermes - 💡(How to fix) Fix feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost [1 participants]

hermes2026-04-28 05:17:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#16834•Fetched 2026-04-29 06:38:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

cannonball-me

Participants

cannonball-me

Timeline (top)

labeled ×3

Root Cause

The pre-filter pays for itself because retain is the most expensive operation ($15/1M), and skipping a retain call saves the full extraction cost — including the Hindsight-side LLM extraction that also runs on every retained turn.

Code Example

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.\n\nRetain if the turn contains:\n- Personal facts, preferences, decisions, or corrections\n- Technical choices or workflow decisions the user made\n- Durable insights about the user's environment or work\n- Relationships, roles, or project context\n\nSkip if the turn is:\n- Content the user is working with (scripts, documents, research, code) rather than expressing\n- Tool output, debugging logs, or SQL results\n- The assistant explaining, suggesting, or executing tasks\n- Factual claims written for an audience (not personal facts)\n- Ephemeral session state (model switches, connection checks)\n\nRespond with a single JSON object: {\"retain\": true/false, \"reason\": \"one sentence\"}",
    "min_user_chars": 20
  }
}

---

User turn completes
       │
       ▼
┌──────────────────┐     no      ┌─────────────────┐
│ User message <   │────────────▶│ Skip retain     │
│ min_user_chars?  │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
       ▼
┌──────────────────┐     no      ┌─────────────────┐
│ pre_filter       │────────────▶│ Skip retain     │
│ enabled?         │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
       ▼
┌──────────────────┐
│ LLM classification│
│ (cheap model)     │
│ ~500-2000 tokens  │
└──────────────────┘
       │
       ├── retain: false ──▶ Skip (no retain API call)
       │
       └── retain: true  ──▶ Send to Hindsight retain API
                              (extraction runs as normal)

RAW_BUFFERClick to expand / collapse

Title

feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost

Body

Problem

Hindsight's Hermes plugin sends every completed turn to the retain API with zero content-based filtering. The only controls are auto_retain (on/off) and retain_every_n_turns (batching, not filtering).

This causes two problems:

1. Noise in memory — Tool output, pasted or uploaded documents (scripts, research), SQL query results, debugging sessions, and any content the user is working with (not expressing as personal fact) all get retained. The extraction LLM then creates memories that are irrelevant or actively misleading (e.g., arguments from a YouTube script attributed as personal opinions of the user).

2. Wasted token cost — Cloud users pay Retain: $15.00/1M tokens, Reflect: $3.00/1M tokens, Recall: $0.75/1M tokens. Every turn — regardless of content quality — consumes retain tokens for extraction.

Why heuristic filters aren't enough

Regex patterns, document upload flags, and content-type heuristics all share the same flaw: they're source-based, not content-based. A YouTube script can arrive via:

File upload ([The user sent a text document: ...])
Pasted inline in CLI (no document marker)
Referenced from a file read
Typed as part of a brainstorming session

A regex can't distinguish "Chris is telling me about his architecture decision" from "Chris pasted a script about healthcare ROI for me to review." Only an LLM can make that judgment call.

Proposed solution

Add a retain_pre_filter option that runs a lightweight LLM classification call before sending content to the retain API. If the pre-filter says "skip," the turn is silently dropped — no retain API call, no extraction, no token cost.

Config (`~/.hermes/hindsight/config.json`)

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.\n\nRetain if the turn contains:\n- Personal facts, preferences, decisions, or corrections\n- Technical choices or workflow decisions the user made\n- Durable insights about the user's environment or work\n- Relationships, roles, or project context\n\nSkip if the turn is:\n- Content the user is working with (scripts, documents, research, code) rather than expressing\n- Tool output, debugging logs, or SQL results\n- The assistant explaining, suggesting, or executing tasks\n- Factual claims written for an audience (not personal facts)\n- Ephemeral session state (model switches, connection checks)\n\nRespond with a single JSON object: {\"retain\": true/false, \"reason\": \"one sentence\"}",
    "min_user_chars": 20
  }
}

Option	Type	Default	Description
`enabled`	bool	`false`	Enable pre-filter (opt-in, zero breaking change)
`model`	string	agent's current model	Model to use for classification. Defaults to the agent's configured model. Can be set to `gpt-oss-120b` (same model Hindsight uses for extraction — aggressively cheap) or a local model via Ollama for zero cost
`prompt`	string	built-in default	Custom classification prompt. Users can tune this to their needs
`min_user_chars`	int	`20`	Skip pre-filter (and retain) for turns where user message is below this length. These are almost never worth retaining

How it works

User turn completes
       │
       ▼
┌──────────────────┐     no      ┌─────────────────┐
│ User message <   │────────────▶│ Skip retain     │
│ min_user_chars?  │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
       ▼
┌──────────────────┐     no      ┌─────────────────┐
│ pre_filter       │────────────▶│ Skip retain     │
│ enabled?         │             │ (no cost)       │
└──────────────────┘             └─────────────────┘
       │ yes
       ▼
┌──────────────────┐
│ LLM classification│
│ (cheap model)     │
│ ~500-2000 tokens  │
└──────────────────┘
       │
       ├── retain: false ──▶ Skip (no retain API call)
       │
       └── retain: true  ──▶ Send to Hindsight retain API
                              (extraction runs as normal)

Cost analysis

Using gpt-oss-120b (the model Hindsight already uses for extraction in its default config — aggressively cheap open-weight model):

Metric	Value
Retain ops (14 days, single user)	315
Pre-filter input tokens (~1.5K avg per classification)	~473K
Pre-filter cost (gpt-oss-120b)	~$0.01 (self-hosted: $0)
Turns skipped (estimated 50-70%)	~157-220
Retain tokens saved (at 3,718 avg tokens/op)	~584K-818K
Retain cost saved ($15/1M)	~$8.76-$12.27
Net savings per 14 days	~$8.75-$12.26
Projected annual savings	~$228-$320

For self-hosted users running a local 20B model (e.g., via Ollama) for classification, the pre-filter cost is $0 and still catches an estimated 70-80% of garbage turns.

Why this is better than retain-mission-only filtering

The retain mission (extraction instructions) is the last line of defense — it tells the extraction LLM what to keep. But:

Tokens are already spent — the retain API call, serialization, and extraction LLM call all run before the retain mission has any effect
Extraction quality degrades with noise — the extraction LLM has to process and discard garbage content, which can confuse entity resolution and fact extraction even for the good parts of the turn
No way to handle pasted content — the retain mission can't distinguish "Chris told me this" from "Chris pasted this document for me to work on" when both arrive as plain text

The pre-filter acts as the first line of defense: it decides whether the turn is even worth sending. The retain mission then handles the nuance of what to extract from the turns that pass through. Both layers serve different purposes.

Key design decisions

1. Opt-in, not opt-out. All existing users see zero behavior change. The filter is disabled by default.

2. Configurable model. Users can use a cheap model for classification (DeepSeek, Qwen Flash, local Ollama) or default to their agent's current model. The prompt is also overridable for customization.

3. Async, non-blocking. The pre-filter classification should run asynchronously (like retain_async already does) so it doesn't add latency to the user's conversation. If the pre-filter is slow, the worst case is a brief delay before the retain batch is queued — the user's response is already delivered.

4. Logging. When a turn is skipped, log the classification result (retain: false, reason) at debug level so users can audit what's being filtered and tune the prompt if needed.

5. Graceful degradation. If the pre-filter model is unavailable or errors, fall through to normal retain (send everything). The filter is a cost optimization, not a gatekeeper — it should never cause data loss.

Backward compatibility

Zero breaking change. retain_pre_filter.enabled defaults to false. When disabled, behavior is identical to current.

Alternatives considered

Approach	Problem
Regex/heuristic filters	Source-based, can't distinguish user-expressed facts from user-pasted content
`retain_every_n_turns`	Batching, not filtering — same noise, just delayed
Retain mission rules only	Tokens already spent by the time extraction runs; can't handle pasted text
Disable `auto_retain`, manual only	Loses automatic retention convenience
Post-extraction cleanup	Doesn't save tokens — the expensive extraction already ran

Environment

Hermes Agent: v2026.4.x
Hindsight plugin: latest
Config: ~/.hermes/hindsight/config.json

extent analysis

TL;DR

To reduce noise and token cost in Hindsight's Hermes plugin, enable the proposed retain_pre_filter option, which uses a lightweight LLM classification call to filter out irrelevant content before sending it to the retain API.

Guidance

Enable the retain_pre_filter option in the ~/.hermes/hindsight/config.json configuration file by setting "enabled": true.
Choose a suitable model for the pre-filter, such as gpt-oss-120b, and configure the prompt to customize the classification criteria.
Set the min_user_chars threshold to skip pre-filtering for short user messages that are unlikely to be worth retaining.
Monitor the pre-filter's performance and adjust the configuration as needed to balance filtering effectiveness and token cost savings.

Example

{
  "retain_pre_filter": {
    "enabled": true,
    "model": "gpt-oss-120b",
    "prompt": "You are a memory gatekeeper. Given the conversation turn below, decide if it contains information worth retaining in long-term memory about the user.",
    "min_user_chars": 20
  }
}

Notes

The proposed solution is designed to be opt-in, with zero breaking changes, and allows for customization of the pre-filter model and prompt. However, the effectiveness of the pre-filter may depend on the quality of the LLM model and the specific use case.

Recommendation

Apply the proposed retain_pre_filter workaround to reduce noise and token cost in Hindsight's Hermes plugin, as it provides a flexible and customizable solution for filtering out irrelevant content.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Title

Body

Problem

Why heuristic filters aren't enough

Proposed solution

Config (`~/.hermes/hindsight/config.json`)

How it works

Cost analysis

Why this is better than retain-mission-only filtering

Key design decisions

Backward compatibility

Alternatives considered

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix feat(hindsight): LLM-based retain pre-filter to reduce noise and token cost [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Title

Body

Problem

Why heuristic filters aren't enough

Proposed solution

Config (~/.hermes/hindsight/config.json)

How it works

Cost analysis

Why this is better than retain-mission-only filtering

Key design decisions

Backward compatibility

Alternatives considered

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Config (`~/.hermes/hindsight/config.json`)