openclaw - 💡(How to fix) Fix `doctor.memory.status` blocks 6.7–105 s on live embedding probe — should cache or strict-budget by default [1 participants]

The doctor.memory.status request handler runs a live embedding probe via manager.probeEmbeddingAvailability() and inherits the embedding-batch timeout, which can be up to 600 s for local providers and 60–120 s for remote. On openclaw 2026.4.23 I observed five real doctor.memory.status calls over a 4-hour window, ranging 6.7 s → 105 s, with the 105 s call blocking the gateway WS thread during the stall window. This is not a Discord-specific issue — it surfaces in any openclaw doctor / openclaw status / health-monitor cycle — but it cooperates with #38596 and #71546 by adding additional latency to control-plane operations during exactly the windows when Discord channel-state introspection happens.

Root Cause

Code Example

1 doctor.memory.status   6,767 ms
1 doctor.memory.status  34,859 ms
1 doctor.memory.status  47,039 ms
1 doctor.memory.status  68,144 ms
1 doctor.memory.status 105,097 ms

---

dist/server-plugin-bootstrap-CxnqPNN-.js:4572   "doctor.memory.status" handler
dist/server-plugin-bootstrap-CxnqPNN-.js:4592     await manager.probeEmbeddingAvailability()
dist/manager-D84yq6oM.js:1824                   const EMBEDDING_QUERY_TIMEOUT_REMOTE_MS  =  60_000
dist/manager-D84yq6oM.js:1825                   const EMBEDDING_QUERY_TIMEOUT_LOCAL_MS   = 300_000
dist/manager-D84yq6oM.js:1826                   const EMBEDDING_BATCH_TIMEOUT_REMOTE_MS  = 120_000
dist/manager-D84yq6oM.js:1827                   const EMBEDDING_BATCH_TIMEOUT_LOCAL_MS   = 600_000

Summary

Environment


OpenClaw	2026.4.23 (`a979721`)
OS	macOS 15.6.1 (arm64)
Node (gateway runtime)	22.22.2
Memory backend	default (per `Memory: 2 files · 3 chunks · sources memory · plugin memory-core · vector ready · fts ready · cache on (4)` from `openclaw status`)
Provider	local (per `EMBEDDING_BATCH_TIMEOUT_LOCAL_MS` reachability)

Evidence

Today's gateway.log, latency events for doctor.memory.status (the [ws] ⇄ res ✓ doctor.memory.status <ms>ms lines):

1 doctor.memory.status   6,767 ms
1 doctor.memory.status  34,859 ms
1 doctor.memory.status  47,039 ms
1 doctor.memory.status  68,144 ms
1 doctor.memory.status 105,097 ms

Median ≈ 47 s. Worst case 105 s — that's 100 % of a 105-second user-visible window blocked on a status probe.

Code path

dist/server-plugin-bootstrap-CxnqPNN-.js:4572   "doctor.memory.status" handler
dist/server-plugin-bootstrap-CxnqPNN-.js:4592     await manager.probeEmbeddingAvailability()
dist/manager-D84yq6oM.js:1824                   const EMBEDDING_QUERY_TIMEOUT_REMOTE_MS  =  60_000
dist/manager-D84yq6oM.js:1825                   const EMBEDDING_QUERY_TIMEOUT_LOCAL_MS   = 300_000
dist/manager-D84yq6oM.js:1826                   const EMBEDDING_BATCH_TIMEOUT_REMOTE_MS  = 120_000
dist/manager-D84yq6oM.js:1827                   const EMBEDDING_BATCH_TIMEOUT_LOCAL_MS   = 600_000

For local providers, the batch timeout is 10 minutes. The status handler can hang for the full timeout if the embedding backend is slow or stalled, with no early bail-out for "we just need to know if memory is enabled, not actually re-probe the model."

Why this matters beyond `doctor`

The health-monitor (mentioned in #38596) calls into status during its check cycle. A 60–105 s stall in status means the health monitor's circuit breakers can fire on what was actually a healthy Discord channel — false positive restart.
Every openclaw status / openclaw doctor invoked while debugging Discord issues (as we did today) eats minutes of the operator's time and contributes the slow channels.status 3787 ms and node.list <multi-second> events that show up as control-plane noise in the same logs.
Per-process: while the embedding probe is in flight, other handlers in the same WS connection's request lane appear queued behind it from the operator-tool perspective, even if they execute on a separate worker.

Reproduction

Single-account openclaw 2026.4.23, default memory backend.
While idle, openclaw doctor or openclaw status --deep repeatedly for 30 min.
Observe gateway.log for [ws] ⇄ res ✓ doctor.memory.status <N>ms events. Most will be 6 s; a fraction will spike to 30–105 s when the embedding probe path is exercised under load or when the backend is cold.

Asks

Three concrete options, in order of effort:

Cache result for N seconds (simplest). The probe yes/no answer doesn't change every call. A 30 s LRU cache would make the 99th-percentile case sub-100 ms and only re-probe when stale.
Strict short-budget for the default path; deep probe behind a flag. Default doctor.memory.status returns enabled: true|false, lastProbeAt, lastProbeOk from cache; --deep triggers a real probeEmbeddingAvailability() with the long timeout. This matches how other CLIs split fast-status from deep-checks.
Decouple from health-monitor's path. Even if the user-facing CLI keeps its current behavior, the health-monitor should not invoke the slow probe during its 5-minute check cycle. A separate memory.healthLite op that returns provider-readiness from the cached state, without a re-probe, would prevent the false-positive Discord restarts described in #38596.

#38596 OPEN — health monitor restart loop. The memory-probe stall described here can fire the consecutiveHelloStalls / reconnectStallWatchdog paths from a control-plane stall, not a real Discord stall.
#71546 OPEN — Discord ingest-lag report. Independent surface, but compounds the operator-debugging-loop overhead.

Logs available

/Users/x./.openclaw/logs/gateway.log — the five doctor.memory.status <N>ms lines.
/tmp/openclaw/openclaw-2026-04-25.log — full ndjson trace.

extent analysis

TL;DR

Implement a caching mechanism for the doctor.memory.status handler to reduce the latency caused by the embedding probe.

Guidance

Consider caching the result of the probeEmbeddingAvailability() call for a short period, such as 30 seconds, to reduce the number of times the slow probe is executed.
Evaluate the feasibility of implementing a strict short-budget for the default path and moving the deep probe behind a flag, as suggested in the "Asks" section.
Investigate decoupling the health-monitor's path from the slow probe to prevent false-positive Discord restarts.
Review the related issues (#38596 and #71546) to ensure that the solution does not introduce new problems or regressions.

Example

// Simple caching example
const cache = {};
const cacheTTL = 30 * 1000; // 30 seconds

async function probeEmbeddingAvailability() {
  const cachedResult = cache['probeResult'];
  if (cachedResult && cachedResult.timestamp + cacheTTL > Date.now()) {
    return cachedResult.value;
  }
  const result = await manager.probeEmbeddingAvailability();
  cache['probeResult'] = { value: result, timestamp: Date.now() };
  return result;
}

Notes

The provided example is a simplified illustration of caching and may require modifications to fit the actual implementation. It is essential to consider the trade-offs between cache freshness and performance when implementing a caching solution.

Recommendation

Apply a caching workaround to reduce the latency caused by the embedding probe, as it is a relatively simple and low-risk solution that can provide significant performance improvements.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix `doctor.memory.status` blocks 6.7–105 s on live embedding probe — should cache or strict-budget by default [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Environment

Evidence

Code path

Why this matters beyond `doctor`

Reproduction

Asks

Related

Logs available

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix `doctor.memory.status` blocks 6.7–105 s on live embedding probe — should cache or strict-budget by default [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Environment

Evidence

Code path

Why this matters beyond doctor

Reproduction

Asks

Related

Logs available

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Why this matters beyond `doctor`