openclaw - ✅(Solved) Fix memorySearch: embedding reindex fails with 'TypeError: fetch failed' after indexing ~40K chunks [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56815Fetched 2026-04-08 01:47:32
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×2

Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The .tmp file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.

Error Message

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"} // once during indexing {"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"} {"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

Root Cause

  1. The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
  2. Not a 429/rate-limit issue — only one rate-limit warning in the entire run
  3. Not an OOM issue — 123GB RAM, no swap pressure
  4. Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
  5. Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
  6. Progress loss is the critical issuerunSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
  7. No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else

Fix Action

Fixed

PR fix notes

PR #56879: fix(memory): retry transient embedding transport failures

Description (problem / solution / changelog)

Summary

  • Problem: memory indexing/search treated transient embedding transport failures like fetch failed, ECONNRESET, and ENOTFOUND as terminal errors.
  • Why it matters: a single network blip could abort memory indexing, while logs often only showed a flattened fetch failed message.
  • What changed: memory embedding retries now recognize transient transport failures, remote POST errors preserve prefix/cause detail, and async memory sync warnings use formatUncaughtError(...).
  • Scope boundary: this does not add checkpoint/resume for atomic reindexing or change rollback semantics.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Memory / storage

Linked Issue/PR

  • Related #56815
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: memory retry logic only matched rate-limit / 5xx-style messages, so transient transport-layer failures were not retried.
  • Missing detection / guardrail: the memory path did not inspect nested cause/errno/undici-style transient network errors, and transport failures from postJson were not wrapped with provider context.
  • Prior context (git blame, prior PR, issue, or refactor if known): the memory engine moved behind the plugin/host split, but the retry heuristic remained message-string based.
  • Why this regressed now: remote embedding failures can surface as transport exceptions instead of only HTTP status errors.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
  • Target test or file:
    • packages/memory-host-sdk/src/host/post-json.test.ts
    • packages/memory-host-sdk/src/host/embeddings-remote-fetch.test.ts
    • extensions/memory-core/src/memory/manager.embedding-batches.test.ts
  • Scenario the test should lock in: retry a transient TypeError("fetch failed") with cause.code = "ECONNRESET" and preserve wrapped error detail.

User-visible / Behavior Changes

  • Memory embeddings now retry on transient transport/network failures such as fetch failed, ECONNRESET, ENOTFOUND, and undici timeout/socket errors.
  • Memory sync warnings for session-start and search now include stack/cause detail instead of flattening to String(err).

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node 22 + pnpm
  • Model/provider: openai memory embeddings against a local fake /v1/embeddings endpoint
  • Relevant config (redacted):
    • agents.defaults.memorySearch.provider = "openai"
    • agents.defaults.memorySearch.model = "text-embedding-3-small"
    • agents.defaults.memorySearch.remote.baseUrl = "http://127.0.0.1:8765/v1"
    • agents.defaults.memorySearch.remote.apiKey = "dummy-key"

Steps

  1. Start a fake embeddings server that drops the first /v1/embeddings request and succeeds on the second.
  2. Run openclaw memory index --force --verbose.
  3. Run openclaw memory search "retry embeddings" --json.

Expected

  • The first transient transport failure is retried automatically.
  • Memory indexing completes successfully.
  • Memory search returns the indexed MEMORY.md content.

Actual

  • Observed memory embeddings transient failure; retrying in 555ms.
  • Fake server received request #1 and request #2.
  • memory index --force --verbose completed successfully.
  • memory search "retry embeddings" --json returned the expected MEMORY.md result.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets

Human Verification (required)

  • Verified scenarios:
    • Manual smoke with a fake embeddings server that intentionally drops the first request and succeeds on the second.
    • memory index --force --verbose completed after the retry.
    • memory search --json returned the indexed MEMORY.md snippet.
  • Edge cases checked:
    • Existing retry behavior for rate-limit / 5xx errors still passes.
    • Existing remote fetch mapping still passes.
  • What you did not verify:
    • Full long-running reindex against a large real remote provider.
    • Checkpoint/resume behavior after fatal non-retryable failures.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)

Risks and Mitigations

  • Risk: some transport-shaped errors may now retry before failing.
    • Mitigation: retries are still bounded by the existing backoff / max-attempt limits.
  • Risk: async memory sync warnings are now more verbose.
    • Mitigation: the change is limited to session-start / search warning paths, and secret redaction still applies.

Changed files

  • extensions/memory-core/src/memory/manager-embedding-ops.ts (modified, +131/-6)
  • extensions/memory-core/src/memory/manager.embedding-batches.test.ts (modified, +22/-0)
  • extensions/memory-core/src/memory/manager.ts (modified, +3/-2)
  • packages/memory-host-sdk/src/host/post-json.test.ts (modified, +22/-0)
  • packages/memory-host-sdk/src/host/post-json.ts (modified, +52/-21)

Code Example

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"}  // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}
RAW_BUFFERClick to expand / collapse

Description

Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The .tmp file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.

Environment

  • OpenClaw version: v2026.3.24 (cff6dc9)
  • Node.js: v25.8.2
  • OS: Linux 6.8.0-88-generic (x64), 123GB RAM
  • Embedding provider: SiliconFlow API (Pro/BAAI/bge-m3, 1024d, OpenAI-compatible endpoint at https://api.siliconflow.cn/v1/)
  • Files: 272 .md files (~187MB) under workspace memory/ directory
  • Config: memorySearch.remote.batch.concurrency: 2, default retry settings (3 attempts, 500ms/8000ms backoff)
  • main agent with the same SiliconFlow config successfully indexed 4 chunks — issue is specific to large-scale reindex

Reproduction Steps

  1. Configure an agent with memorySearch.enabled: true
  2. Place ~270 large .md files (100KB-1MB each) in the workspace memory/ directory
  3. Use a remote embedding provider (SiliconFlow, OpenAI-compatible)
  4. Trigger memory_search which initiates runSafeReindex
  5. Observe: tmp file grows to ~2GB, ~41K chunks indexed
  6. After ~1 hour: memory sync failed: TypeError: fetch failed
  7. tmp is deleted, sqlite remains empty → next trigger restarts from scratch

Error Log

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"}  // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

No stack trace is included — TypeError: fetch failed is logged without the underlying cause (DNS, timeout, connection reset, etc.).

Observations

  1. The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
  2. Not a 429/rate-limit issue — only one rate-limit warning in the entire run
  3. Not an OOM issue — 123GB RAM, no swap pressure
  4. Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
  5. Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
  6. Progress loss is the critical issuerunSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
  7. No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else

Suggested Improvements

  1. Include full stack trace in the TypeError: fetch failed log so the root cause can be identified
  2. Partial progress preservation — instead of deleting .tmp on failure, consider checkpointing or resuming from the last successful batch
  3. Connection health checks — validate embedding API connectivity before starting a long reindex, or periodically during the process
  4. Graceful degradation — if one batch fails, skip it and continue instead of aborting the entire reindex

extent analysis

Fix Plan

To address the issue of memory search embedding reindex consistently failing with TypeError: fetch failed, we will implement the following steps:

  • Checkpointing: Instead of deleting the .tmp file on failure, we will implement checkpointing to resume from the last successful batch.
  • Connection Health Checks: We will add connection health checks to validate embedding API connectivity before starting a long reindex and periodically during the process.
  • Graceful Degradation: If one batch fails, we will skip it and continue instead of aborting the entire reindex.

Example Code

Here's an example of how you can implement checkpointing and connection health checks in your code:

// Import required modules
const fs = require('fs');
const axios = require('axios');

// Define constants
const CHECKPOINT_FILE = 'checkpoint.json';
const EMBEDDING_API_URL = 'https://api.siliconflow.cn/v1/';

// Function to validate embedding API connectivity
async function validateApiConnectivity() {
    try {
        const response = await axios.get(EMBEDDING_API_URL);
        return response.status === 200;
    } catch (error) {
        console.error('Error validating API connectivity:', error);
        return false;
    }
}

// Function to checkpoint progress
async function checkpointProgress(batchNumber) {
    try {
        const checkpointData = { batchNumber };
        fs.writeFileSync(CHECKPOINT_FILE, JSON.stringify(checkpointData));
    } catch (error) {
        console.error('Error checkpointing progress:', error);
    }
}

// Function to resume from last checkpoint
async function resumeFromCheckpoint() {
    try {
        const checkpointData = fs.readFileSync(CHECKPOINT_FILE, 'utf8');
        const batchNumber = JSON.parse(checkpointData).batchNumber;
        return batchNumber;
    } catch (error) {
        console.error('Error resuming from checkpoint:', error);
        return 0;
    }
}

// Main reindex function
async function reindex() {
    // Validate API connectivity before starting
    if (!(await validateApiConnectivity())) {
        console.error('API connectivity failed. Aborting reindex.');
        return;
    }

    // Resume from last checkpoint
    let batchNumber = await resumeFromCheckpoint();

    // Loop through batches
    for (let i = batchNumber; i < 45000; i++) {
        try {
            // Process batch
            await processBatch(i);

            // Checkpoint progress
            await checkpointProgress(i);

            // Validate API connectivity periodically
            if (i % 1000 === 0) {
                if (!(await validateApiConnectivity())) {
                    console.error('API connectivity failed during reindex. Skipping batch.');
                    continue;
                }
            }
        } catch (error) {
            console.error(`Error processing batch ${i}:`, error);
            // Skip failed batch and continue

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix memorySearch: embedding reindex fails with 'TypeError: fetch failed' after indexing ~40K chunks [1 pull requests, 1 participants]