openclaw - 💡(How to fix) Fix Bug: memory search live embedding fails ~20–40% with `fetch failed | other side closed` (provider-agnostic; upstream healthy) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71784Fetched 2026-04-26 05:08:23
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using any remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via curl and via a plain Node.js fetch() from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.

Bulk reindex via the batch endpoint is not affected. Only the per-query single-embed path used by openclaw memory search (and presumably the in-conversation memory recall path) shows the issue.

This makes semantic memory recall unreliable in interactive sessions even though openclaw memory status reports Embeddings: ready.


Error Message

Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.

4. Two distinct error messages observed

ERROR Memory search failed: fetch failed | other side closed ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established Both originate from dist/subsystem-CWI_MDy_.js:161 (search subsystem) wrapping a lower-level error from dist/engine-embeddings-DVkdyn0v.jswithRemoteHttpResponsefetchWithSsrFGuard → undici dispatcher. The two strings correspond to undici error causes: .catch(e=>console.error('FAIL:', e.message, e.cause?.message)); The user-visible error patterns (other side closed, Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:

  1. Add a bounded retry (e.g. 1–2 retries with short backoff) around withRemoteHttpResponse for embedding calls, scoped to undici/TLS connection-reset error classes (UND_ERR_SOCKET, ECONNRESET, EPIPE, Client network socket disconnected before secure TLS connection was established, other side closed). This alone would make the user-visible behavior reliable.
  2. Surface the error class better in openclaw memory status --deep so users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same as Embeddings error: fetch failed | other side closed.

Root Cause

Suspected root cause

Fix Action

Fix / Workaround

Both originate from dist/subsystem-CWI_MDy_.js:161 (search subsystem) wrapping a lower-level error from dist/engine-embeddings-DVkdyn0v.jswithRemoteHttpResponsefetchWithSsrFGuard → undici dispatcher.

  • dist/extensions/google/embedding-provider.js and the corresponding OpenAI path both call withRemoteHttpResponse({ url, ssrfPolicy, init }).
  • dist/engine-embeddings-DVkdyn0v.js defines withRemoteHttpResponsefetchWithSsrFGuard.
  • dist/fetch-guard-DKbwHPzH.js instantiates per-call undici dispatchers via:
    • createPolicyDispatcherWithoutPinnedDns(...) for direct mode, or
    • createPinnedDispatcher(await resolvePinnedHostnameWithPolicy(...)) for the SSRF-pinned path,
  • backed by createHttp1Agent / createHttp1EnvHttpProxyAgent / createHttp1ProxyAgent from dist/undici-runtime-x3fQiq5e.js, with a global stream timeout from dist/undici-global-dispatcher-KzKcGOUY.js.

The user-visible error patterns (other side closed, Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:

Code Example

openclaw config set memory.backend builtin
openclaw config set agents.defaults.memorySearch.provider openai
openclaw config set agents.defaults.memorySearch.model text-embedding-3-large
openclaw config set models.providers.openai \
  '{"baseUrl":"https://api.openai.com/v1","apiKey":"sk-...","models":[]}' --strict-json
openclaw gateway restart

---

openclaw memory index --force --agent main
# → Memory index updated (main).

---

Provider: openai (requested: openai)
Model: text-embedding-3-large
Vector: ready
Vector dims: 3072
FTS: ready
Embeddings: ready

---

for i in 1 2 3 4 5 6 7 8 9 10; do
  result=$(openclaw memory search "pool stress test query $i" --agent main 2>&1 | tail -3)
  if echo "$result" | grep -qE "fetch failed|other side closed|socket disconnected"; then
    echo "Q$i: FAIL"
  else
    echo "Q$i: OK"
  fi
done

---

Q1: OK
Q2: OK
Q3: OK
Q4: OK
Q5: FAIL
Q6: FAIL
Q7: OK
Q8: OK
Q9: OK
Q10: OK
OK: 8 / FAIL: 2

---

OK: 6 / FAIL: 4

---

ERROR Memory search failed: fetch failed | other side closed
ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established

---

# Direct curl to OpenAI: 100% success
curl -sS -o /dev/null -w "HTTP %{http_code} time=%{time_total}\n" -X POST \
  https://api.openai.com/v1/embeddings \
  -H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
  -d '{"input":"transient pool test","model":"text-embedding-3-large"}'
# → HTTP 200 time=0.477410

---

# Native Node.js fetch to Gemini: 100% success, full 3072-dim payload returned
node -e "
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2-preview:embedContent', {
  method:'POST',
  headers:{'Content-Type':'application/json','x-goog-api-key':'***'},
  body:JSON.stringify({content:{parts:[{text:'test'}]},taskType:'RETRIEVAL_QUERY',outputDimensionality:3072})
}).then(r=>r.json()).then(j=>console.log('OK dims=',(j.embedding?.values||[]).length))
  .catch(e=>console.error('FAIL:', e.message, e.cause?.message));
"
# → OK dims= 3072
RAW_BUFFERClick to expand / collapse

Bug: Memory search transient fetch failed | other side closed / Client network socket disconnected before secure TLS connection was established for live embedding queries (provider-agnostic)

Summary

Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using any remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via curl and via a plain Node.js fetch() from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.

Bulk reindex via the batch endpoint is not affected. Only the per-query single-embed path used by openclaw memory search (and presumably the in-conversation memory recall path) shows the issue.

This makes semantic memory recall unreliable in interactive sessions even though openclaw memory status reports Embeddings: ready.


Environment

ItemValue
OpenClaw version2026.4.24 (cbcfdf6)
Node.jsv24.14.1
OSUbuntu 24.04 LTS, kernel 6.8.0-110-generic (x86_64)
Networkdirect outbound, no proxy, IPv4+IPv6 both working
memory.backendbuiltin
sqlite-vecenabled, vec0.so loaded, Vector dims: 3072, FTS: ready

Reproduced on two different remote embedding providers configured via openclaw config set:

  • openai / text-embedding-3-large (3072-dim, ~30 KB response)
  • gemini / gemini-embedding-2-preview (3072-dim, ~60 KB response)

Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.


Repro

1. Configure a remote embedding provider

openclaw config set memory.backend builtin
openclaw config set agents.defaults.memorySearch.provider openai
openclaw config set agents.defaults.memorySearch.model text-embedding-3-large
openclaw config set models.providers.openai \
  '{"baseUrl":"https://api.openai.com/v1","apiKey":"sk-...","models":[]}' --strict-json
openclaw gateway restart

2. Reindex (works fine, uses batch endpoint)

openclaw memory index --force --agent main
# → Memory index updated (main).

openclaw memory status --deep --agent main then reports:

Provider: openai (requested: openai)
Model: text-embedding-3-large
Vector: ready
Vector dims: 3072
FTS: ready
Embeddings: ready

3. Run live queries (fails ~20–40% of the time)

for i in 1 2 3 4 5 6 7 8 9 10; do
  result=$(openclaw memory search "pool stress test query $i" --agent main 2>&1 | tail -3)
  if echo "$result" | grep -qE "fetch failed|other side closed|socket disconnected"; then
    echo "Q$i: FAIL"
  else
    echo "Q$i: OK"
  fi
done

Observed output (idle gateway):

Q1: OK
Q2: OK
Q3: OK
Q4: OK
Q5: FAIL
Q6: FAIL
Q7: OK
Q8: OK
Q9: OK
Q10: OK
→ OK: 8 / FAIL: 2

Under concurrent load (background reindex of other agents running):

→ OK: 6 / FAIL: 4

4. Two distinct error messages observed

From the gateway log (/tmp/openclaw/openclaw-<date>.log):

ERROR Memory search failed: fetch failed | other side closed
ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established

Both originate from dist/subsystem-CWI_MDy_.js:161 (search subsystem) wrapping a lower-level error from dist/engine-embeddings-DVkdyn0v.jswithRemoteHttpResponsefetchWithSsrFGuard → undici dispatcher.

The two strings correspond to undici error causes:

  • other side closed → server closed the keep-alive socket between requests, request reused a dead socket.
  • Client network socket disconnected before secure TLS connection was established → TLS handshake aborted on a fresh socket (typical for pinned-DNS + Agent reuse with broken keep-alive).

Both are classic symptoms of a misconfigured / overly aggressive HTTP keep-alive pool.


Why this is not the upstream API

Same host, same network, same time:

# Direct curl to OpenAI: 100% success
curl -sS -o /dev/null -w "HTTP %{http_code} time=%{time_total}\n" -X POST \
  https://api.openai.com/v1/embeddings \
  -H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
  -d '{"input":"transient pool test","model":"text-embedding-3-large"}'
# → HTTP 200 time=0.477410
# Native Node.js fetch to Gemini: 100% success, full 3072-dim payload returned
node -e "
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2-preview:embedContent', {
  method:'POST',
  headers:{'Content-Type':'application/json','x-goog-api-key':'***'},
  body:JSON.stringify({content:{parts:[{text:'test'}]},taskType:'RETRIEVAL_QUERY',outputDimensionality:3072})
}).then(r=>r.json()).then(j=>console.log('OK dims=',(j.embedding?.values||[]).length))
  .catch(e=>console.error('FAIL:', e.message, e.cause?.message));
"
# → OK dims= 3072

Repeated curl and Node fetch runs against both endpoints from the same machine never reproduce the disconnect. The failure is specific to OpenClaw's internal fetch path.


Why it is not provider-specific

ProviderModelResponse sizeFailure rate observed
OpenAItext-embedding-3-large (3072-dim)~30 KB~20–40%
Googlegemini-embedding-2-preview (3072-dim)~60 KB~80–100%

Same host, same gateway version, same code path (withRemoteHttpResponsefetchWithSsrFGuard). Switching provider does not eliminate the bug, only changes its frequency. Larger response bodies / longer-held sockets correlate with higher failure rates, which strongly suggests a connection-pool / keep-alive issue rather than a per-provider authentication or URL bug.


Suspected root cause

Looking at the bundled code paths in 2026.4.24 (cbcfdf6):

  • dist/extensions/google/embedding-provider.js and the corresponding OpenAI path both call withRemoteHttpResponse({ url, ssrfPolicy, init }).
  • dist/engine-embeddings-DVkdyn0v.js defines withRemoteHttpResponsefetchWithSsrFGuard.
  • dist/fetch-guard-DKbwHPzH.js instantiates per-call undici dispatchers via:
    • createPolicyDispatcherWithoutPinnedDns(...) for direct mode, or
    • createPinnedDispatcher(await resolvePinnedHostnameWithPolicy(...)) for the SSRF-pinned path,
  • backed by createHttp1Agent / createHttp1EnvHttpProxyAgent / createHttp1ProxyAgent from dist/undici-runtime-x3fQiq5e.js, with a global stream timeout from dist/undici-global-dispatcher-KzKcGOUY.js.

The user-visible error patterns (other side closed, Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:

  1. share a connection pool across calls without reliably retiring sockets that the upstream has already half-closed,
  2. or interact badly with undici keep-alive defaults (keepAliveTimeout, keepAliveMaxTimeout, pipelining) for high-latency TLS endpoints like api.openai.com and generativelanguage.googleapis.com,
  3. or close/release the dispatcher (release(dispatcher)closeDispatcher) in a way that leaves an in-flight socket reusable for the next call.

A single retry on UND_ERR_SOCKET / ECONNRESET / TLS-handshake-aborted errors at the withRemoteHttpResponse layer would mask this for users, but the underlying pool behavior likely deserves a fix.


Impact

  • Semantic recall is unreliable in interactive sessions despite Embeddings: ready reporting healthy.
  • Users see "no matches" results or hard Memory search failed: fetch failed | … errors at a measurable rate (~20–40% in this environment, higher under concurrent load).
  • Active-memory plugin recall similarly degrades.
  • openclaw doctor does not surface this — memory status reports the provider as ready because the readiness probe happens to pass.

Workarounds tried

ActionResult
Switch provider OpenAI ↔ GeminiSame bug, different frequency.
Use gemini-embedding-001 instead of 2-previewSame bug.
Reduce outputDimensionality (3072 → default)Helps slightly (smaller payload) but does not eliminate.
gateway restartNo effect; reproduces immediately.
Direct curl / native Node fetch from same hostAlways succeeds — confirms not a network/upstream issue.

No workaround at the user-config level reliably eliminates the failures.


Suggested fixes

  1. Add a bounded retry (e.g. 1–2 retries with short backoff) around withRemoteHttpResponse for embedding calls, scoped to undici/TLS connection-reset error classes (UND_ERR_SOCKET, ECONNRESET, EPIPE, Client network socket disconnected before secure TLS connection was established, other side closed). This alone would make the user-visible behavior reliable.
  2. Tune the undici dispatcher for the embedding pool: explicit keepAliveTimeout / keepAliveMaxTimeout lower than the typical Google/OpenAI server-side keep-alive idle (e.g. 4 s), and pipelining: 0. Right now the symptoms are fully consistent with reusing a socket the server has already half-closed.
  3. Surface the error class better in openclaw memory status --deep so users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same as Embeddings error: fetch failed | other side closed.
  4. Optional: a memorySearch.remote.retry config knob ({enabled: true, maxAttempts: 2, backoffMs: 250}) so users can opt in/out without code changes.

Additional notes

  • This affects memory.backend: builtin. QMD-backed workspaces are not affected because they do not exercise the same per-query fetch path.
  • The bundled embedding-provider.js already uses executeWithApiKeyRotation, but rotation only kicks in for API-key-level failures, not for transient network/socket errors, so it does not help here.
  • Happy to provide more detailed undici-level logs if a debug flag is available — please point me at the right env var (NODE_DEBUG=undici was tried but the gateway buffers its own logger).

TL;DR

openclaw memory search (live single-embed path) fails ~20–40% of the time with fetch failed | other side closed or … socket disconnected before secure TLS connection was established, while the exact same upstream endpoint works 100% via curl and native Node fetch from the same host. Affects all remote embedding providers, gets worse with bigger responses and concurrent load. Looks like a keep-alive / pool-reuse bug in the SSRF-guarded fetch path; a retry layer + dispatcher tuning should fix the user-visible symptom.

extent analysis

TL;DR

Implement a retry mechanism with a bounded number of attempts and a short backoff for withRemoteHttpResponse to handle transient socket errors.

Guidance

  • Identify the specific error classes (UND_ERR_SOCKET, ECONNRESET, Client network socket disconnected before secure TLS connection was established, other side closed) that should trigger a retry.
  • Add a retry layer around withRemoteHttpResponse with a limited number of attempts (e.g., 1-2 retries) and a short backoff (e.g., 250ms).
  • Consider tuning the undici dispatcher settings, such as keepAliveTimeout and keepAliveMaxTimeout, to prevent socket reuse issues.
  • Surface the error class in openclaw memory status --deep to distinguish between authentication misconfigurations and transient socket pool failures.

Example

const maxAttempts = 2;
const backoffMs = 250;

function withRetry(fn) {
  let attempts = 0;
  function retry() {
    return fn().catch((error) => {
      if (attempts < maxAttempts && isTransientError(error)) {
        attempts++;
        return new Promise((resolve) => setTimeout(retry, backoffMs));
      }
      throw error;
    });
  }
  return retry();
}

function isTransientError(error) {
  // Check if the error is a transient socket error
  return error.code === 'UND_ERR_SOCKET' || error.code === 'ECONNRESET';
}

Notes

The provided example is a simplified illustration of a retry mechanism and may need to be adapted to the specific requirements of the withRemoteHttpResponse function.

Recommendation

Apply a retry workaround to mask the symptom, as the underlying pool behavior likely requires a more extensive fix. This will make the user-visible behavior more reliable while the root cause is being investigated.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Bug: memory search live embedding fails ~20–40% with `fetch failed | other side closed` (provider-agnostic; upstream healthy) [1 participants]