openclaw - ✅(Solved) Fix memorySearch: embedding reindex fails with 'TypeError: fetch failed' after indexing ~40K chunks [1 pull requests, 1 participants]

dionysoslin615 · 2026-03-29T05:14:08Z

[openclaw] Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks ~41K out… Memory search embedding reindex consistently fails with `TypeError: fetch failed` after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The `.tmp` file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop. # PR #56879: fix(memory): retry transient embedding transport failures - Repository: openclaw/openclaw - Author: GaosCode - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/56879 ## Description (problem / solution / changelog) ## Summary - Problem: memory indexing/search treated transient embedding transport failures like `fetch failed`, `ECONNRESET`, and `ENOTFOUND` as terminal errors. - Why it matters: a single network blip could abort memory indexing, while logs often only showed a flattened `fetch failed` message. - What changed: memory embedding retries now recognize transient transport failures, remote POST errors preserve prefix/cause detail, and async memory sync warnings use `formatUncaughtError(...)`. - Scope boundary: this does not add checkpoint/resume for atomic reindexing or change rollback semantics. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Memory / storage ## Linked Issue/PR - Related #56815 - [x] This PR fixes a bug or regression ## Root Cause / Regression History (if applicable) - Root cause: memory retry logic only matched rate-limit / 5xx-style messages, so transient transport-layer failures were not retried. - Missing detection / guardrail: the memory path did not inspect nested `cause`/`errno`/undici-style transient network errors, and transport failures from `postJson` were not wrapped with provider context. - Prior context (`git blame`, prior PR, issue, or refactor if known): the memory engine moved behind the plugin/host split, but the retry heuristic remained message-string based. - Why this regressed now: remote embedding failures can surface as transport exceptions instead of only HTTP status errors. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [x] Seam / integration test - Target test or file: - `packages/memory-host-sdk/src/host/post-json.test.ts` - `packages/memory-host-sdk/src/host/embeddings-remote-fetch.test.ts` - `extensions/memory-core/src/memory/manager.embedding-batches.test.ts` - Scenario the test should lock in: retry a transient `TypeError("fetch failed")` with `cause.code = "ECONNRESET"` and preserve wrapped error detail. ## User-visible / Behavior Changes - Memory embeddings now retry on transient transport/network failures such as `fetch failed`, `ECONNRESET`, `ENOTFOUND`, and undici timeout/socket errors. - Memory sync warnings for `session-start` and `search` now include stack/cause detail instead of flattening to `String(err)`. ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) ## Repro + Verification ### Environment - OS: macOS - Runtime/container: local Node 22 + pnpm - Model/provider: `openai` memory embeddings against a local fake `/v1/embeddings` endpoint - Relevant config (redacted): - `agents.defaults.memorySearch.provider = "openai"` - `agents.defaults.memorySearch.model = "text-embedding-3-small"` - `agents.defaults.memorySearch.remote.baseUrl = "http://127.0.0.1:8765/v1"` - `agents.defaults.memorySearch.remote.apiKey = "dummy-key"` ### Steps 1. Start a fake embeddings server that drops the first `/v1/embeddings` request and succeeds on the second. 2. Run `openclaw memory index --force --verbose`. 3. Run `openclaw memory search "retry embeddings" --json`. ### Expected - The first transient transport failure is retried automatically. - Memory indexing completes successfully. - Memory search returns the indexed `MEMORY.md` content. ### Actual - Observed `memory embeddings transient failure; retrying in 555ms`. - Fake server received `request #1` and `request #2`. - `memory index --force --verbose` completed successfully. - `memory search "retry embeddings" --json` returned the expected `MEMORY.md` result. ## Evidence - [x] Failing test/log before + passing after - [x] Trace/log snippets ## Human Verification (required) - Verified scenarios: - Manual smoke with a fake embeddings server that intentionally drops the first request and succeeds on the second. - `memory index --force --verbose` completed after the retry. - `memory search --json` returned the indexed `MEMORY.md` snippet. - Edge cases checked: - Existing retry behavior for rate-limit / 5xx errors s

openclaw2026-03-29 05:14:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#56815•Fetched 2026-04-08 01:47:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dionysoslin615

Participants

dionysoslin615

Timeline (top)

cross-referenced ×2

Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The .tmp file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.

Error Message

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"} // once during indexing {"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"} {"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

Root Cause

The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
Not a 429/rate-limit issue — only one rate-limit warning in the entire run
Not an OOM issue — 123GB RAM, no swap pressure
Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
Progress loss is the critical issue — runSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else

Code Example

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"}  // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

RAW_BUFFERClick to expand / collapse

Description

Environment

OpenClaw version: v2026.3.24 (cff6dc9)
Node.js: v25.8.2
OS: Linux 6.8.0-88-generic (x64), 123GB RAM
Embedding provider: SiliconFlow API (Pro/BAAI/bge-m3, 1024d, OpenAI-compatible endpoint at https://api.siliconflow.cn/v1/)
Files: 272 .md files (~187MB) under workspace memory/ directory
Config: memorySearch.remote.batch.concurrency: 2, default retry settings (3 attempts, 500ms/8000ms backoff)
main agent with the same SiliconFlow config successfully indexed 4 chunks — issue is specific to large-scale reindex

Reproduction Steps

Configure an agent with memorySearch.enabled: true
Place ~270 large .md files (100KB-1MB each) in the workspace memory/ directory
Use a remote embedding provider (SiliconFlow, OpenAI-compatible)
Trigger memory_search which initiates runSafeReindex
Observe: tmp file grows to ~2GB, ~41K chunks indexed
After ~1 hour: memory sync failed: TypeError: fetch failed
tmp is deleted, sqlite remains empty → next trigger restarts from scratch

Error Log

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"}  // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

No stack trace is included — TypeError: fetch failed is logged without the underlying cause (DNS, timeout, connection reset, etc.).

Observations

The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
Not a 429/rate-limit issue — only one rate-limit warning in the entire run
Not an OOM issue — 123GB RAM, no swap pressure
Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
Progress loss is the critical issue — runSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else

Suggested Improvements

Include full stack trace in the TypeError: fetch failed log so the root cause can be identified
Partial progress preservation — instead of deleting .tmp on failure, consider checkpointing or resuming from the last successful batch
Connection health checks — validate embedding API connectivity before starting a long reindex, or periodically during the process
Graceful degradation — if one batch fails, skip it and continue instead of aborting the entire reindex

extent analysis

Fix Plan

To address the issue of memory search embedding reindex consistently failing with TypeError: fetch failed, we will implement the following steps:

Checkpointing: Instead of deleting the .tmp file on failure, we will implement checkpointing to resume from the last successful batch.
Connection Health Checks: We will add connection health checks to validate embedding API connectivity before starting a long reindex and periodically during the process.
Graceful Degradation: If one batch fails, we will skip it and continue instead of aborting the entire reindex.

Example Code

Here's an example of how you can implement checkpointing and connection health checks in your code:

// Import required modules
const fs = require('fs');
const axios = require('axios');

// Define constants
const CHECKPOINT_FILE = 'checkpoint.json';
const EMBEDDING_API_URL = 'https://api.siliconflow.cn/v1/';

// Function to validate embedding API connectivity
async function validateApiConnectivity() {
    try {
        const response = await axios.get(EMBEDDING_API_URL);
        return response.status === 200;
    } catch (error) {
        console.error('Error validating API connectivity:', error);
        return false;
    }
}

// Function to checkpoint progress
async function checkpointProgress(batchNumber) {
    try {
        const checkpointData = { batchNumber };
        fs.writeFileSync(CHECKPOINT_FILE, JSON.stringify(checkpointData));
    } catch (error) {
        console.error('Error checkpointing progress:', error);
    }
}

// Function to resume from last checkpoint
async function resumeFromCheckpoint() {
    try {
        const checkpointData = fs.readFileSync(CHECKPOINT_FILE, 'utf8');
        const batchNumber = JSON.parse(checkpointData).batchNumber;
        return batchNumber;
    } catch (error) {
        console.error('Error resuming from checkpoint:', error);
        return 0;
    }
}

// Main reindex function
async function reindex() {
    // Validate API connectivity before starting
    if (!(await validateApiConnectivity())) {
        console.error('API connectivity failed. Aborting reindex.');
        return;
    }

    // Resume from last checkpoint
    let batchNumber = await resumeFromCheckpoint();

    // Loop through batches
    for (let i = batchNumber; i < 45000; i++) {
        try {
            // Process batch
            await processBatch(i);

            // Checkpoint progress
            await checkpointProgress(i);

            // Validate API connectivity periodically
            if (i % 1000 === 0) {
                if (!(await validateApiConnectivity())) {
                    console.error('API connectivity failed during reindex. Skipping batch.');
                    continue;
                }
            }
        } catch (error) {
            console.error(`Error processing batch ${i}:`, error);
            // Skip failed batch and continue

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #vector store #embedding generation #cache error #pipeline error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix memorySearch: embedding reindex fails with 'TypeError: fetch failed' after indexing ~40K chunks [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #56879: fix(memory): retry transient embedding transport failures

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Description

Environment

Reproduction Steps

Error Log

Observations

Suggested Improvements

extent analysis

Fix Plan

Example Code

Still need to ship something?

RELATED_DISCOVERY

TRENDING