openclaw - 💡(How to fix) Fix Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80218Fetched 2026-05-11 03:17:28
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
closed ×1commented ×1

Code Example

async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}

---

User Query
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
[Context Engine assemble()]
memory_search(domain=X, topK=3)
bootstrap_segment_load(skills=Y)
Filtered context → Main LLM
RAW_BUFFERClick to expand / collapse

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Currently, every time OpenClaw runs an agent turn, all bootstrap files are injected into the context window regardless of whether the current query is relevant:

  • MEMORY.md (~20 KB) — full content every turn
  • TOOLS.md (~20 KB) — full content every turn
  • AGENTS.md, SOUL.md, IDENTITY.md, USER.md (~5 KB combined) — all every turn

Even though the model can technically use memory_search / memory_get on demand, MEMORY.md is still included in the Project Context bootstrap by default. This means ~40–50 KB of fixed token overhead per turn, of which 70–80% is typically irrelevant to the current query.

Desired Behavior

At the context assembly stage (assemble()), add a relevance filtering step that:

  1. RAG-filter MEMORY.md: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire file
  2. Optionally segment and filter TOOLS.md: Load only tool descriptions relevant to the current task domain
  3. Preserve static sections: Keep SOUL.md, AGENTS.md, USER.md fully loaded (they are small and cacheable via KV prefix reuse)

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

OpenClaw ships with a SQLite-based memory engine (vector + BM25 hybrid search) that already supports:

  • memory_search(query, corpus, topK) — exact same interface needed for bootstrap RAG
  • memory_get(corpus, path, from, lines) — for fetching specific segments

2. Context Engine Plugin API

The pluggable context engine API (plugins.slots.contextEngine) provides clean hooks at exactly the right lifecycle point:

async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}

3. Bootstrap Truncation Controls

Existing mechanisms (bootstrapMaxChars 12K, bootstrapTotalMaxChars 60K) provide hard size caps but not semantic filtering.

4. KV Cache / Prompt Prefix Reuse

Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Build @openclaw/smart-context-engine that wraps the legacy engine and adds MEMORY.md RAG filtering in assemble(), using the existing built-in memory search engine. No core changes required.

Phase 2: Core RAG Bootstrap Filtering

Segment MEMORY.md and TOOLS.md at meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query at assemble() time. Can be implemented as a new built-in smart mode alongside legacy.

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Key insight: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and no data leaving the machine.

OptionCostLatencyChinese SupportPrivacy
Local small model (Q4, Ollama)Free (local RAM/GPU)~100–300ms on M5 Max✅ Best (Qwen2.5-1.5B)✅ Full
API small model (GPT-4o-mini)~$0.001/turn~200–500ms✅ Good❌ External API
No filtering (current)Token cost0ms⚠️ Full context uploaded

Recommended local models:

ModelSizeMemoryBest For
qwen2.5:1.5b (Q4)~1 GB1 GBChinese + English bilingual
llama3.2:1b (Q4)~700 MB700 MBEnglish-only, fastest
phi3:mini (Q4)~2 GB2 GBEnglish reasoning

On Apple M5 Max with 128 GB RAM, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves ~50–100 tokens/second — validated by the user in production.

Phase 3 integration architecture:

User Query
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
Filtered context → Main LLM

Zero API cost · No data leaves machine · Sub-300ms latency

Reference Research

WorkApproachKey Finding
Anthropic Contextual Retrieval (2024)BM25 + embedding dual-path召回Compresses context to 4% while retaining 85% accuracy
LLM Janitor (2024)Small model pre-filters contextReduces context by ~60% with minimal degradation
Gorilla (Berkeley, 2024)Dynamic API/tool routing via retrieverReduces tool-use failures
Ollama + Apple Silicon MPSLocal LLM inference on M-series GPU1B models run at 50–100 tok/s on M5 Max

Related Documentation

  • /concepts/context-engine — Context Engine Plugin API
  • /concepts/system-prompt — Bootstrap injection mechanism
  • /concepts/memory-builtin — Built-in memory search engine
  • /concepts/context — Context assembly overview

Tags

enhancement context-engine memory token-optimization local-llm ollama


Submitted via OpenClaw agent on behalf of a production user. Phase 3 prototype validated on Apple M5 Max (128 GB RAM). Willing to contribute a reference implementation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content [1 comments, 2 participants]