openclaw - ✅(Solved) Fix Embedding context size is hardcoded — make local memorySearch.contextSize configurable [1 pull requests, 1 participants]

aalekh-sarvam · 2026-04-21T09:55:59Z

[openclaw] When the local embeddings provider creates the node-llama-cpp EmbeddingContext , it now passes { contextSize: 4096 } seen in dist/engine-embeddings-… When the local embeddings provider creates the node-llama-cpp `EmbeddingContext`, it now passes `{ contextSize: 4096 }` (seen in `dist/engine-embeddings-Bk3B82BS.js`), but the value is **hardcoded**. For large embedding models (e.g. Qwen3-Embedding-8B, 36-layer decoder-only) the KV cache + compute buffers dominate the gateway's GPU footprint, and different deployments want different tradeoffs. `contextSize` should be configurable via `openclaw.json`. # PR #69680: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup) - Repository: openclaw/openclaw - Author: aalekh-sarvam - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/69680 ## Description (problem / solution / changelog) ## Summary Replace `searchVector`'s full-table-scan SQL with sqlite-vec's native KNN operator. Keeps `vec_distance_cosine()` in the `SELECT` so the returned score stays in the expected cosine [0, 1] range. Fixes #69666. ## Benchmark Measured on a real 10,827-chunk workspace (4096-dim Qwen3-Embedding-8B): | Pattern | Time per query | |---|---| | **Before** (`vec_distance_cosine(...) AS dist` + `ORDER BY dist LIMIT`) | ~8,490 ms | | Naive KNN (`v.distance AS dist` + `MATCH ? AND k`) | ~48 ms *(but returns 0 results — see below)* | | **After** (this PR: `vec_distance_cosine` + `MATCH ? AND k`) | ~50 ms | **~190× speedup**, same result set. ## Why the naive fix doesn't work sqlite-vec creates `chunks_vec` with L2 distance by default, not cosine: ```sql CREATE VIRTUAL TABLE chunks_vec USING vec0(id TEXT PRIMARY KEY, embedding FLOAT[4096]) ``` So `v.distance` is the squared L2 distance, which can exceed 1. `score = 1 - dist` then goes negative for any non-trivial query, and the downstream `minScore` filter drops every result. The correct fix uses `MATCH ? AND k = ?` only for candidate **selection** (this is where the speedup lives — sqlite-vec's vec0 index walks the shards), and keeps `vec_distance_cosine()` in the `SELECT` for the **score**, matching the existing semantics. ## Implementation notes - The query vector is bound twice now: once for `vec_distance_cosine(v.embedding, ?)` and once for `MATCH ?`. - `LIMIT ?` is removed; `AND k = ?` caps the KNN candidate pool to the same count. - `ORDER BY dist ASC` still sorts by cosine distance — sqlite-vec's KNN ordering (L2) is only used for candidate pruning; final ordering is unchanged. - No change to the fallback path (`listChunks(...).map(cosineSimilarity)`) when sqlite-vec isn't available. ## Testing - Local gateway running against a 10,827-chunk store returns identical top-K ids to the previous implementation for all test queries (spot-checked across semantic, keyword-heavy, and low-overlap queries). - Search latency dropped from 8-30s (observed with multiple concurrent tool calls queuing) to ~2s end-to-end; the remaining ~2s is merge/MMR/decay, not the vector SQL (separate optimization opportunity, out of scope for this PR). ## Related - Filed #69667 (configurable `contextSize` for local embedding provider) in the same debug session. Independent change; will send a follow-up PR. ## Alternative considered Creating `chunks_vec` with `distance_metric=cosine` at schema time would let us use `v.distance` directly. That's a cleaner long-term shape but requires a migration for existing installs, so I opted for the source-compatible SELECT-side cosine which needs zero schema change and no reindex. ## Changed files - `extensions/memory-core/src/memory/manager-search.ts` (modified, +13/-5) ## Fixed - Fixed by PR: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup) (https://github.com/openclaw/openclaw/pull/69680) ### Bug type Configuration / resource usage ### Summary When the local embeddings provider creates the node-llama-cpp `EmbeddingContext`, it now passes `{ contextSize: 4096 }` (seen in `dist/engine-embeddings-Bk3B82BS.js`), but the value is **hardcoded**. For large embedding models (e.g. Qwen3-Embedding-8B, 36-layer decoder-only) the KV cache + compute buffers dominate the gateway's GPU footprint, and different deployments want different tradeoffs. `contextSize` should be configurable via `openclaw.json`. ### Why this matters For the same embedding model (`Qwen3-Embedding-8B-Q8_0.gguf`, 4096-dim) we observed roughly linear scaling of non-weight VRAM with context size: | `contextSize` | Total gateway GPU footprint | |---|---| | 4096 (current hardcoded default) | ~8.8 GB | | 32768 (node-llama-cpp `"auto"`) | ~32 GB | Net difference: ~24 GB of VRAM depending on context size. On a single-GPU box where the gateway shares VRAM with a large LLM (SGLang, vLLM, Ollama etc.), "auto" can push the machine into OOM territory, while 4096 is enough headroom for typical memory-search chunks (128–512 tokens). Different users will want different values: - Memory-search / RAG ingestion on short chun

openclaw2026-04-21 09:55:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#69667•Fetched 2026-04-22 07:49:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aalekh-sarvam

Participants

aalekh-sarvam

Timeline (top)

cross-referenced ×2

When the local embeddings provider creates the node-llama-cpp EmbeddingContext, it now passes { contextSize: 4096 } (seen in dist/engine-embeddings-Bk3B82BS.js), but the value is hardcoded. For large embedding models (e.g. Qwen3-Embedding-8B, 36-layer decoder-only) the KV cache + compute buffers dominate the gateway's GPU footprint, and different deployments want different tradeoffs. contextSize should be configurable via openclaw.json.

Root Cause

For the same embedding model (Qwen3-Embedding-8B-Q8_0.gguf, 4096-dim) we observed roughly linear scaling of non-weight VRAM with context size:

`contextSize`	Total gateway GPU footprint
4096 (current hardcoded default)	~8.8 GB
32768 (node-llama-cpp `"auto"`)	~32 GB

Net difference: ~24 GB of VRAM depending on context size. On a single-GPU box where the gateway shares VRAM with a large LLM (SGLang, vLLM, Ollama etc.), "auto" can push the machine into OOM territory, while 4096 is enough headroom for typical memory-search chunks (128–512 tokens).

Different users will want different values:

Memory-search / RAG ingestion on short chunks → 2048–4096 is plenty
Embedding long documents end-to-end → may want 8192+
Resource-constrained hosts → as low as 1024

Hardcoding 4096 is a reasonable default but removes the ability to tune.

Fix Action

Fixed

Fixed by PR: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup) (https://github.com/openclaw/openclaw/pull/69680)

PR fix notes

PR #69680: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup)

Repository: openclaw/openclaw
Author: aalekh-sarvam
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/69680

Description (problem / solution / changelog)

Summary

Replace searchVector's full-table-scan SQL with sqlite-vec's native KNN operator. Keeps vec_distance_cosine() in the SELECT so the returned score stays in the expected cosine [0, 1] range.

Fixes #69666.

Benchmark

Measured on a real 10,827-chunk workspace (4096-dim Qwen3-Embedding-8B):

Pattern	Time per query
Before (`vec_distance_cosine(...) AS dist` + `ORDER BY dist LIMIT`)	~8,490 ms
Naive KNN (`v.distance AS dist` + `MATCH ? AND k`)	~48 ms (but returns 0 results — see below)
After (this PR: `vec_distance_cosine` + `MATCH ? AND k`)	~50 ms

~190× speedup, same result set.

Why the naive fix doesn't work

sqlite-vec creates chunks_vec with L2 distance by default, not cosine:

CREATE VIRTUAL TABLE chunks_vec USING vec0(id TEXT PRIMARY KEY, embedding FLOAT[4096])

So v.distance is the squared L2 distance, which can exceed 1. score = 1 - dist then goes negative for any non-trivial query, and the downstream minScore filter drops every result.

The correct fix uses MATCH ? AND k = ? only for candidate selection (this is where the speedup lives — sqlite-vec's vec0 index walks the shards), and keeps vec_distance_cosine() in the SELECT for the score, matching the existing semantics.

Implementation notes

The query vector is bound twice now: once for vec_distance_cosine(v.embedding, ?) and once for MATCH ?.
LIMIT ? is removed; AND k = ? caps the KNN candidate pool to the same count.
ORDER BY dist ASC still sorts by cosine distance — sqlite-vec's KNN ordering (L2) is only used for candidate pruning; final ordering is unchanged.
No change to the fallback path (listChunks(...).map(cosineSimilarity)) when sqlite-vec isn't available.

Testing

Local gateway running against a 10,827-chunk store returns identical top-K ids to the previous implementation for all test queries (spot-checked across semantic, keyword-heavy, and low-overlap queries).
Search latency dropped from 8-30s (observed with multiple concurrent tool calls queuing) to ~2s end-to-end; the remaining ~2s is merge/MMR/decay, not the vector SQL (separate optimization opportunity, out of scope for this PR).

Filed #69667 (configurable contextSize for local embedding provider) in the same debug session. Independent change; will send a follow-up PR.

Alternative considered

Creating chunks_vec with distance_metric=cosine at schema time would let us use v.distance directly. That's a cleaner long-term shape but requires a migration for existing installs, so I opted for the source-compatible SELECT-side cosine which needs zero schema change and no reindex.

Changed files

extensions/memory-core/src/memory/manager-search.ts (modified, +13/-5)

Code Example

if (!embeddingContext)
    embeddingContext = await embeddingModel.createEmbeddingContext({ contextSize: 4096 });

---

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,          // new: optional, defaults to current 4096
          "gpuLayers": "max",           // new: optional, defaults to node-llama-cpp's auto
          "batchSize": 512              // new: optional, passthrough
        }
      }
    }
  }
}

---

// resolve from config, fall back to 4096
const contextSize = cfg?.local?.contextSize ?? 4096;
const gpuLayers   = cfg?.local?.gpuLayers;   // undefined -> node-llama-cpp default
const batchSize   = cfg?.local?.batchSize;

if (!embeddingModel) {
    embeddingModel = await llama.loadModel({
        modelPath: resolved,
        ...(gpuLayers !== undefined ? { gpuLayers } : {}),
    });
}
if (!embeddingContext) {
    embeddingContext = await embeddingModel.createEmbeddingContext({
        contextSize,
        ...(batchSize !== undefined ? { batchSize } : {}),
    });
}

RAW_BUFFERClick to expand / collapse

Bug type

Configuration / resource usage

Summary

Why this matters

For the same embedding model (Qwen3-Embedding-8B-Q8_0.gguf, 4096-dim) we observed roughly linear scaling of non-weight VRAM with context size:

`contextSize`	Total gateway GPU footprint
4096 (current hardcoded default)	~8.8 GB
32768 (node-llama-cpp `"auto"`)	~32 GB

Different users will want different values:

Memory-search / RAG ingestion on short chunks → 2048–4096 is plenty
Embedding long documents end-to-end → may want 8192+
Resource-constrained hosts → as low as 1024

Hardcoding 4096 is a reasonable default but removes the ability to tune.

Current code location

dist/engine-embeddings-Bk3B82BS.js, inside the local embedding provider's ensureContext() closure:

if (!embeddingContext)
    embeddingContext = await embeddingModel.createEmbeddingContext({ contextSize: 4096 });

There's no path to override this from config.

Proposed fix

Expose contextSize (and ideally a few related tunables) under the existing local-provider config block. Something like:

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,          // new: optional, defaults to current 4096
          "gpuLayers": "max",           // new: optional, defaults to node-llama-cpp's auto
          "batchSize": 512              // new: optional, passthrough
        }
      }
    }
  }
}

Implementation sketch:

// resolve from config, fall back to 4096
const contextSize = cfg?.local?.contextSize ?? 4096;
const gpuLayers   = cfg?.local?.gpuLayers;   // undefined -> node-llama-cpp default
const batchSize   = cfg?.local?.batchSize;

if (!embeddingModel) {
    embeddingModel = await llama.loadModel({
        modelPath: resolved,
        ...(gpuLayers !== undefined ? { gpuLayers } : {}),
    });
}
if (!embeddingContext) {
    embeddingContext = await embeddingModel.createEmbeddingContext({
        contextSize,
        ...(batchSize !== undefined ? { batchSize } : {}),
    });
}

Bounds-check contextSize against model.trainContextSize; log a warning and clamp if higher.

node-llama-cpp LlamaContextOptions: https://node-llama-cpp.withcat.ai/api/type-aliases/LlamaContextOptions
node-llama-cpp issue #435 (Limit default context size in the node template) — same underlying concern (auto-sizing surprises on big models)

Environment

OpenClaw: 2026.4.15 (041266a)
node-llama-cpp: 3.18.1
Host: Linux 6.17 aarch64 / NVIDIA GB10 (128 GB unified memory)
Model: hf:Qwen/Qwen3-Embedding-8B-GGUF/Qwen3-Embedding-8B-Q8_0.gguf

How I'd like to help

Happy to open a PR: add the config field to the zod schema under memorySearch.local, wire it through to engine-embeddings, and include a short note in the config docs. Keeping the current 4096 default means no behavior change for existing installs.

Related: I'm filing a separate issue for the searchVector full-table-scan SQL pattern (unrelated but discovered in the same debug session).

extent analysis

TL;DR

Make the contextSize configurable via openclaw.json to allow for different tradeoffs in deployments.

Guidance

Expose contextSize under the existing local-provider config block in openclaw.json with a default value of 4096.
Update the ensureContext() closure to resolve contextSize from the config and fall back to 4096 if not specified.
Bounds-check contextSize against model.trainContextSize and log a warning if it's higher.
Consider adding other related tunables like gpuLayers and batchSize to the config.

Example

// openclaw.json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "local",
        "local": {
          "contextSize": 4096,
          "gpuLayers": "max",
          "batchSize": 512
        }
      }
    }
  }
}

Notes

The proposed fix requires updating the engine-embeddings code to read the contextSize from the config and use it when creating the EmbeddingContext. Additionally, the zod schema under memorySearch.local needs to be updated to include the new config field.

Recommendation

Apply the proposed fix to make contextSize configurable, allowing for more flexibility in deployments and preventing potential OOM issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Embedding context size is hardcoded — make local memorySearch.contextSize configurable [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #69680: fix(memory): use sqlite-vec KNN for searchVector (~190× speedup)

Description (problem / solution / changelog)

Summary

Benchmark

Why the naive fix doesn't work

Implementation notes

Testing

Related

Alternative considered

Changed files

Code Example

Bug type

Summary

Why this matters

Current code location

Proposed fix

Related

Environment

How I'd like to help

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING