openclaw - 💡(How to fix) Fix [Bug]: Memory indexer leaves files with Dirty: no after transient /v1/embeddings 500 — vector store silently partial [1 participants]

Q: Expected behavior

After a `/v1/embeddings` call failed during indexing, the files whose chunks could not be embedded should either (a) leave the indexer in a non-zero `Dirty` state, so the next incremental `openclaw memory index` re-processes them, or (b) be surfaced in `memory status` as missing. `Dirty: no` while the vector store does not yet cover all source files is the state that misleads operators.

openclaw2026-04-23 10:43:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70567•Fetched 2026-04-24 05:56:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

chestercs

Participants

chestercs

A single transient 500 Internal Server Error on /v1/embeddings during memory indexing left 8 of 10 workspace files unindexed while openclaw memory status reported Dirty: no, hiding the partial state until openclaw memory index --force recovered full coverage.

Error Message

[04-22 22:23:25] Starting vLLM server on http://0.0.0.0:8005 [04-23 10:09:32] ERROR (core.py:1110) EngineCore encountered a fatal error. [04-23 10:09:32] ERROR torch.AcceleratorError: CUDA error: operation not permitted [04-23 10:09:32] ERROR vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. 172.26.0.8:56888 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error [04-23 10:09:39] Shutting down [04-23 10:09:53] Starting vLLM server on http://0.0.0.0:8005 172.26.0.8:52636 - "POST /v1/embeddings HTTP/1.1" 200 OK 172.26.0.8:53496 - "POST /v1/embeddings HTTP/1.1" 200 OK ... subsequent requests all 200 OK ...

Root Cause

Fix Action

Fix / Workaround

Stack definition, including the idempotent patcher that writes this config deterministically on every docker compose up: https://github.com/chestercs/dgx-openclaw-stack

Affected: any OpenClaw deployment whose memorySearch embedding backend can transiently 5xx (crash + auto-restart, network blip, upstream provider hiccup).
Severity: data correctness of memory_search until an operator notices the indexed-count mismatch by hand; the agent loses semantic recall on the unindexed chunks silently, while hybrid search still has FTS coverage so the failure is not total.
Frequency: observed once in ~13 h uptime on GB10 in this environment; any instance of this class of transient failure appears to produce the silent gap.
Consequence: operators cannot trust Dirty: no as a health signal after any embedding-provider hiccup. Current workaround is to run openclaw memory index --force after any embedder restart.

Code Example

Indexed: 2/10 files · 14 chunks
Dirty: no
By source:
  memory · 2/10 files · 14 chunks
Vector: ready
Vector dims: 1024

---

{
  "enabled": true,
  "provider": "openai",
  "model": "BAAI/bge-m3",
  "remote": {
    "baseUrl": "http://vllm-embedding:8005/v1/"
  },
  "query": {
    "hybrid": {
      "enabled": true,
      "vectorWeight": 0.7,
      "textWeight": 0.3,
      "candidateMultiplier": 3,
      "mmr": { "enabled": true, "lambda": 0.7 }
    }
  }
}

---

[04-22 22:23:25] Starting vLLM server on http://0.0.0.0:8005
[04-23 10:09:32] ERROR (core.py:1110) EngineCore encountered a fatal error.
[04-23 10:09:32] ERROR torch.AcceleratorError: CUDA error: operation not permitted
[04-23 10:09:32] ERROR vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
172.26.0.8:56888 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
[04-23 10:09:39] Shutting down
[04-23 10:09:53] Starting vLLM server on http://0.0.0.0:8005
172.26.0.8:52636 - "POST /v1/embeddings HTTP/1.1" 200 OK
172.26.0.8:53496 - "POST /v1/embeddings HTTP/1.1" 200 OK
... subsequent requests all 200 OK ...

---

$ docker inspect vllm-embedding --format "RestartCount: {{.RestartCount}} | Health: {{.State.Health.Status}}"
RestartCount: 1 | Health: healthy

---

Before --force:
  Indexed: 2/10 files · 14 chunks
  Dirty: no

After --force:
  Indexed: 10/10 files · 34 chunks
  Dirty: no

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Run OpenClaw 2026.4.15 with memorySearch.remote.baseUrl pointing at a vLLM /v1/embeddings endpoint serving BAAI/bge-m3 (live config excerpt attached below).
Keep 10 markdown files in ~/.openclaw/workspace/memory/ that normally chunk to 34 chunks (confirmed post-recovery count).
Observe a single transient 500 on one /v1/embeddings call while indexing activity is ongoing. In this environment the vLLM embedder died with torch.AcceleratorError: CUDA error: operation not permitted (cudaErrorNotPermitted) → EngineDeadError, and was auto-restarted by Docker (restart: unless-stopped, recovered in ~10 s). Exactly one POST /v1/embeddings 500 Internal Server Error appears in the embedder log between boot and the post-restart 200 OKs (log excerpt below).
After the embedder becomes healthy again, run openclaw memory status.

Expected behavior

After a /v1/embeddings call failed during indexing, the files whose chunks could not be embedded should either (a) leave the indexer in a non-zero Dirty state, so the next incremental openclaw memory index re-processes them, or (b) be surfaced in memory status as missing. Dirty: no while the vector store does not yet cover all source files is the state that misleads operators.

Actual behavior

openclaw memory status returned:

Indexed: 2/10 files · 14 chunks
Dirty: no
By source:
  memory · 2/10 files · 14 chunks
Vector: ready
Vector dims: 1024

No indication that 8 of 10 files were unindexed. A subsequent incremental openclaw memory index was a no-op (consistent with Dirty: no). Only openclaw memory index --force recovered full coverage (Indexed: 10/10 files · 34 chunks, all files present in the workspace directory).

OpenClaw version

2026.4.15 (image ghcr.io/openclaw/openclaw:latest)

Operating system

Ubuntu-based DietPi on NVIDIA DGX Spark / ASUS Ascent (GB10, aarch64)

Install method

docker

Model

Embedding: BAAI/bge-m3 served by vLLM v0.18.2rc1.dev73+gdb7a17ecc; LLM: nvidia/Gemma-4-31B-IT-NVFP4 (irrelevant to this bug, listed for completeness).

Provider / routing chain

openclaw -> http://vllm-embedding:8005/v1 (vLLM, bge-m3)

Additional provider/model setup details

agents.defaults.memorySearch from the live openclaw.json (redacted apiKey):

{
  "enabled": true,
  "provider": "openai",
  "model": "BAAI/bge-m3",
  "remote": {
    "baseUrl": "http://vllm-embedding:8005/v1/"
  },
  "query": {
    "hybrid": {
      "enabled": true,
      "vectorWeight": 0.7,
      "textWeight": 0.3,
      "candidateMultiplier": 3,
      "mmr": { "enabled": true, "lambda": 0.7 }
    }
  }
}

Stack definition, including the idempotent patcher that writes this config deterministically on every docker compose up: https://github.com/chestercs/dgx-openclaw-stack

Logs, screenshots, and evidence

vLLM embedder log — the single 500 sandwiched between initial boot and the Docker-driven restart:

[04-22 22:23:25] Starting vLLM server on http://0.0.0.0:8005
[04-23 10:09:32] ERROR (core.py:1110) EngineCore encountered a fatal error.
[04-23 10:09:32] ERROR torch.AcceleratorError: CUDA error: operation not permitted
[04-23 10:09:32] ERROR vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
172.26.0.8:56888 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
[04-23 10:09:39] Shutting down
[04-23 10:09:53] Starting vLLM server on http://0.0.0.0:8005
172.26.0.8:52636 - "POST /v1/embeddings HTTP/1.1" 200 OK
172.26.0.8:53496 - "POST /v1/embeddings HTTP/1.1" 200 OK
... subsequent requests all 200 OK ...

Docker restart tracking:

$ docker inspect vllm-embedding --format "RestartCount: {{.RestartCount}} | Health: {{.State.Health.Status}}"
RestartCount: 1 | Health: healthy

Before / after openclaw memory status:

Before --force:
  Indexed: 2/10 files · 14 chunks
  Dirty: no

After --force:
  Indexed: 10/10 files · 34 chunks
  Dirty: no

Impact and severity

Affected: any OpenClaw deployment whose memorySearch embedding backend can transiently 5xx (crash + auto-restart, network blip, upstream provider hiccup).
Severity: data correctness of memory_search until an operator notices the indexed-count mismatch by hand; the agent loses semantic recall on the unindexed chunks silently, while hybrid search still has FTS coverage so the failure is not total.
Frequency: observed once in ~13 h uptime on GB10 in this environment; any instance of this class of transient failure appears to produce the silent gap.
Consequence: operators cannot trust Dirty: no as a health signal after any embedding-provider hiccup. Current workaround is to run openclaw memory index --force after any embedder restart.

Additional information

Related issue: #26772 describes an adjacent but distinct reliability gap — indexer process crash loses all progress due to atomic temp-file-swap. That issue is about the indexer itself crashing; this one is about the indexer completing but not flagging files whose per-chunk embed step returned 5xx.

extent analysis

TL;DR

The issue can be mitigated by running openclaw memory index --force after any embedder restart to ensure all files are properly indexed.

Guidance

Verify the embedder log for any 500 Internal Server Error messages to identify potential indexing issues.
Check the openclaw memory status output to confirm the number of indexed files and chunks.
Run openclaw memory index --force after any embedder restart to recover full coverage.
Consider implementing a periodic check for the Dirty: no state and running openclaw memory index --force if necessary to prevent silent gaps in indexing.
Review the related issue #26772 for potential improvements to the indexer's reliability.

Example

No code snippet is provided as the issue is related to the interaction between OpenClaw and the vLLM embedder, and the solution involves running a specific command to recover from the error.

Notes

The provided solution is a workaround, and a more permanent fix may require changes to the OpenClaw code to properly handle transient errors from the embedder. The frequency and consequence of this issue may vary depending on the specific environment and usage.

Recommendation

Apply the workaround by running openclaw memory index --force after any embedder restart, as this ensures that all files are properly indexed and prevents silent gaps in indexing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: Memory indexer leaves files with Dirty: no after transient /v1/embeddings 500 — vector store silently partial [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING