openclaw - ✅(Solved) Fix Embedding context size is hardcoded — make local memorySearch.contextSize configurable [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#69667Fetched 2026-04-22 07:49:36
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×2

When the local embeddings provider creates the node-llama-cpp EmbeddingContext, it now passes { contextSize: 4096 } (seen in dist/engine-embeddings-Bk3B82BS.js), but the value is hardcoded. For large embedding models (e.g. Qwen3-Embedding-8B, 36-layer decoder-only) the KV cache + compute buffers dominate the gateway's GPU footprint, and different deployments want different tradeoffs. contextSize should be configurable via openclaw.json.

Root Cause

For the same embedding model (Qwen3-Embedding-8B-Q8_0.gguf, 4096-dim) we observed roughly linear scaling of non-weight VRAM with context size:

contextSizeTotal gateway GPU footprint
4096 (current hardcoded default)~8.8 GB
32768 (node-llama-cpp "auto")~32 GB

Net difference: ~24 GB of VRAM depending on context size. On a single-GPU box where the gateway shares VRAM with a large LLM (SGLang, vLLM, Ollama etc.), "auto" can push the machine into OOM territory, while 4096 is enough headroom for typical memory-search chunks (128–512 tokens).

Different users will want different values:

  • Memory-search / RAG ingestion on short chunks → 2048–4096 is plenty
  • Embedding long documents end-to-end → may want 8192+
  • Resource-constrained hosts → as low as 1024

Hardcoding 4096 is a reasonable default but removes the ability to tune.

Fix Action

Fixed

PR fix notes

PR #69680: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup)

Description (problem / solution / changelog)

Summary

Replace searchVector's full-table-scan SQL with sqlite-vec's native KNN operator. Keeps vec_distance_cosine() in the SELECT so the returned score stays in the expected cosine [0, 1] range.

Fixes #69666.

Benchmark

Measured on a real 10,827-chunk workspace (4096-dim Qwen3-Embedding-8B):

PatternTime per query
Before (vec_distance_cosine(...) AS dist + ORDER BY dist LIMIT)~8,490 ms
Naive KNN (v.distance AS dist + MATCH ? AND k)~48 ms (but returns 0 results — see below)
After (this PR: vec_distance_cosine + MATCH ? AND k)~50 ms

~190× speedup, same result set.

Why the naive fix doesn't work

sqlite-vec creates chunks_vec with L2 distance by default, not cosine:

CREATE VIRTUAL TABLE chunks_vec USING vec0(id TEXT PRIMARY KEY, embedding FLOAT[4096])

So v.distance is the squared L2 distance, which can exceed 1. score = 1 - dist then goes negative for any non-trivial query, and the downstream minScore filter drops every result.

The correct fix uses MATCH ? AND k = ? only for candidate selection (this is where the speedup lives — sqlite-vec's vec0 index walks the shards), and keeps vec_distance_cosine() in the SELECT for the score, matching the existing semantics.

Implementation notes

  • The query vector is bound twice now: once for vec_distance_cosine(v.embedding, ?) and once for MATCH ?.
  • LIMIT ? is removed; AND k = ? caps the KNN candidate pool to the same count.
  • ORDER BY dist ASC still sorts by cosine distance — sqlite-vec's KNN ordering (L2) is only used for candidate pruning; final ordering is unchanged.
  • No change to the fallback path (listChunks(...).map(cosineSimilarity)) when sqlite-vec isn't available.

Testing

  • Local gateway running against a 10,827-chunk store returns identical top-K ids to the previous implementation for all test queries (spot-checked across semantic, keyword-heavy, and low-overlap queries).
  • Search latency dropped from 8-30s (observed with multiple concurrent tool calls queuing) to ~2s end-to-end; the remaining ~2s is merge/MMR/decay, not the vector SQL (separate optimization opportunity, out of scope for this PR).

Related

  • Filed #69667 (configurable contextSize for local embedding provider) in the same debug session. Independent change; will send a follow-up PR.

Alternative considered

Creating chunks_vec with distance_metric=cosine at schema time would let us use v.distance directly. That's a cleaner long-term shape but requires a migration for existing installs, so I opted for the source-compatible SELECT-side cosine which needs zero schema change and no reindex.

Changed files

  • extensions/memory-core/src/memory/manager-search.ts (modified, +13/-5)

Code Example

if (!embeddingContext)
    embeddingContext = await embeddingModel.createEmbeddingContext({ contextSize: 4096 });

---

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,          // new: optional, defaults to current 4096
          "gpuLayers": "max",           // new: optional, defaults to node-llama-cpp's auto
          "batchSize": 512              // new: optional, passthrough
        }
      }
    }
  }
}

---

// resolve from config, fall back to 4096
const contextSize = cfg?.local?.contextSize ?? 4096;
const gpuLayers   = cfg?.local?.gpuLayers;   // undefined -> node-llama-cpp default
const batchSize   = cfg?.local?.batchSize;

if (!embeddingModel) {
    embeddingModel = await llama.loadModel({
        modelPath: resolved,
        ...(gpuLayers !== undefined ? { gpuLayers } : {}),
    });
}
if (!embeddingContext) {
    embeddingContext = await embeddingModel.createEmbeddingContext({
        contextSize,
        ...(batchSize !== undefined ? { batchSize } : {}),
    });
}
RAW_BUFFERClick to expand / collapse

Bug type

Configuration / resource usage

Summary

When the local embeddings provider creates the node-llama-cpp EmbeddingContext, it now passes { contextSize: 4096 } (seen in dist/engine-embeddings-Bk3B82BS.js), but the value is hardcoded. For large embedding models (e.g. Qwen3-Embedding-8B, 36-layer decoder-only) the KV cache + compute buffers dominate the gateway's GPU footprint, and different deployments want different tradeoffs. contextSize should be configurable via openclaw.json.

Why this matters

For the same embedding model (Qwen3-Embedding-8B-Q8_0.gguf, 4096-dim) we observed roughly linear scaling of non-weight VRAM with context size:

contextSizeTotal gateway GPU footprint
4096 (current hardcoded default)~8.8 GB
32768 (node-llama-cpp "auto")~32 GB

Net difference: ~24 GB of VRAM depending on context size. On a single-GPU box where the gateway shares VRAM with a large LLM (SGLang, vLLM, Ollama etc.), "auto" can push the machine into OOM territory, while 4096 is enough headroom for typical memory-search chunks (128–512 tokens).

Different users will want different values:

  • Memory-search / RAG ingestion on short chunks → 2048–4096 is plenty
  • Embedding long documents end-to-end → may want 8192+
  • Resource-constrained hosts → as low as 1024

Hardcoding 4096 is a reasonable default but removes the ability to tune.

Current code location

dist/engine-embeddings-Bk3B82BS.js, inside the local embedding provider's ensureContext() closure:

if (!embeddingContext)
    embeddingContext = await embeddingModel.createEmbeddingContext({ contextSize: 4096 });

There's no path to override this from config.

Proposed fix

Expose contextSize (and ideally a few related tunables) under the existing local-provider config block. Something like:

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,          // new: optional, defaults to current 4096
          "gpuLayers": "max",           // new: optional, defaults to node-llama-cpp's auto
          "batchSize": 512              // new: optional, passthrough
        }
      }
    }
  }
}

Implementation sketch:

// resolve from config, fall back to 4096
const contextSize = cfg?.local?.contextSize ?? 4096;
const gpuLayers   = cfg?.local?.gpuLayers;   // undefined -> node-llama-cpp default
const batchSize   = cfg?.local?.batchSize;

if (!embeddingModel) {
    embeddingModel = await llama.loadModel({
        modelPath: resolved,
        ...(gpuLayers !== undefined ? { gpuLayers } : {}),
    });
}
if (!embeddingContext) {
    embeddingContext = await embeddingModel.createEmbeddingContext({
        contextSize,
        ...(batchSize !== undefined ? { batchSize } : {}),
    });
}

Bounds-check contextSize against model.trainContextSize; log a warning and clamp if higher.

Related

Environment

  • OpenClaw: 2026.4.15 (041266a)
  • node-llama-cpp: 3.18.1
  • Host: Linux 6.17 aarch64 / NVIDIA GB10 (128 GB unified memory)
  • Model: hf:Qwen/Qwen3-Embedding-8B-GGUF/Qwen3-Embedding-8B-Q8_0.gguf

How I'd like to help

Happy to open a PR: add the config field to the zod schema under memorySearch.local, wire it through to engine-embeddings, and include a short note in the config docs. Keeping the current 4096 default means no behavior change for existing installs.


Related: I'm filing a separate issue for the searchVector full-table-scan SQL pattern (unrelated but discovered in the same debug session).

extent analysis

TL;DR

Make the contextSize configurable via openclaw.json to allow for different tradeoffs in deployments.

Guidance

  • Expose contextSize under the existing local-provider config block in openclaw.json with a default value of 4096.
  • Update the ensureContext() closure to resolve contextSize from the config and fall back to 4096 if not specified.
  • Bounds-check contextSize against model.trainContextSize and log a warning if it's higher.
  • Consider adding other related tunables like gpuLayers and batchSize to the config.

Example

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,
          "gpuLayers": "max",
          "batchSize": 512
        }
      }
    }
  }
}

Notes

The proposed fix requires updating the engine-embeddings code to read the contextSize from the config and use it when creating the EmbeddingContext. Additionally, the zod schema under memorySearch.local needs to be updated to include the new config field.

Recommendation

Apply the proposed fix to make contextSize configurable, allowing for more flexibility in deployments and preventing potential OOM issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Embedding context size is hardcoded — make local memorySearch.contextSize configurable [1 pull requests, 1 participants]