Code Example

async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}

---

User Query
    ↓
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
    ↓
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
    ↓
Filtered context → Main LLM

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Currently, every time OpenClaw runs an agent turn, all bootstrap files are injected into the context window regardless of whether the current query is relevant:

MEMORY.md (~20 KB) — full content every turn
TOOLS.md (~20 KB) — full content every turn
AGENTS.md, SOUL.md, IDENTITY.md, USER.md (~5 KB combined) — all every turn

Even though the model can technically use memory_search / memory_get on demand, MEMORY.md is still included in the Project Context bootstrap by default. This means ~40–50 KB of fixed token overhead per turn, of which 70–80% is typically irrelevant to the current query.

Desired Behavior

At the context assembly stage (assemble()), add a relevance filtering step that:

RAG-filter MEMORY.md: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire file
Optionally segment and filter TOOLS.md: Load only tool descriptions relevant to the current task domain
Preserve static sections: Keep SOUL.md, AGENTS.md, USER.md fully loaded (they are small and cacheable via KV prefix reuse)

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

OpenClaw ships with a SQLite-based memory engine (vector + BM25 hybrid search) that already supports:

memory_search(query, corpus, topK) — exact same interface needed for bootstrap RAG
memory_get(corpus, path, from, lines) — for fetching specific segments

2. Context Engine Plugin API

The pluggable context engine API (plugins.slots.contextEngine) provides clean hooks at exactly the right lifecycle point:

async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}

3. Bootstrap Truncation Controls

Existing mechanisms (bootstrapMaxChars 12K, bootstrapTotalMaxChars 60K) provide hard size caps but not semantic filtering.

4. KV Cache / Prompt Prefix Reuse

Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Build @openclaw/smart-context-engine that wraps the legacy engine and adds MEMORY.md RAG filtering in assemble(), using the existing built-in memory search engine. No core changes required.

Phase 2: Core RAG Bootstrap Filtering

Segment MEMORY.md and TOOLS.md at meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query at assemble() time. Can be implemented as a new built-in smart mode alongside legacy.

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Key insight: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and no data leaving the machine.

Option	Cost	Latency	Chinese Support	Privacy
Local small model (Q4, Ollama)	Free (local RAM/GPU)	~100–300ms on M5 Max	✅ Best (Qwen2.5-1.5B)	✅ Full
API small model (GPT-4o-mini)	~$0.001/turn	~200–500ms	✅ Good	❌ External API
No filtering (current)	Token cost	0ms	✅	⚠️ Full context uploaded

Recommended local models:

Model	Size	Memory	Best For
qwen2.5:1.5b (Q4)	~1 GB	1 GB	Chinese + English bilingual
llama3.2:1b (Q4)	~700 MB	700 MB	English-only, fastest
phi3:mini (Q4)	~2 GB	2 GB	English reasoning

On Apple M5 Max with 128 GB RAM, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves ~50–100 tokens/second — validated by the user in production.

Phase 3 integration architecture:

User Query
    ↓
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
    ↓
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
    ↓
Filtered context → Main LLM

Zero API cost · No data leaves machine · Sub-300ms latency

Reference Research

Work	Approach	Key Finding
Anthropic Contextual Retrieval (2024)	BM25 + embedding dual-path召回	Compresses context to 4% while retaining 85% accuracy
LLM Janitor (2024)	Small model pre-filters context	Reduces context by ~60% with minimal degradation
Gorilla (Berkeley, 2024)	Dynamic API/tool routing via retriever	Reduces tool-use failures
Ollama + Apple Silicon MPS	Local LLM inference on M-series GPU	1B models run at 50–100 tok/s on M5 Max

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Desired Behavior

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

2. Context Engine Plugin API

3. Bootstrap Truncation Controls

4. KV Cache / Prompt Prefix Reuse

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Phase 2: Core RAG Bootstrap Filtering

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Reference Research

Related Documentation

Tags

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Desired Behavior

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

2. Context Engine Plugin API

3. Bootstrap Truncation Controls

4. KV Cache / Prompt Prefix Reuse

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Phase 2: Core RAG Bootstrap Filtering

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Reference Research

Related Documentation

Tags

Still need to ship something?

RELATED_DISCOVERY

TRENDING