openclaw - 💡(How to fix) Fix `doctor.memory.status` blocks 6.7–105 s on live embedding probe — should cache or strict-budget by default [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71568Fetched 2026-04-26 05:11:24
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

The doctor.memory.status request handler runs a live embedding probe via manager.probeEmbeddingAvailability() and inherits the embedding-batch timeout, which can be up to 600 s for local providers and 60–120 s for remote. On openclaw 2026.4.23 I observed five real doctor.memory.status calls over a 4-hour window, ranging 6.7 s → 105 s, with the 105 s call blocking the gateway WS thread during the stall window. This is not a Discord-specific issue — it surfaces in any openclaw doctor / openclaw status / health-monitor cycle — but it cooperates with #38596 and #71546 by adding additional latency to control-plane operations during exactly the windows when Discord channel-state introspection happens.

Root Cause

The doctor.memory.status request handler runs a live embedding probe via manager.probeEmbeddingAvailability() and inherits the embedding-batch timeout, which can be up to 600 s for local providers and 60–120 s for remote. On openclaw 2026.4.23 I observed five real doctor.memory.status calls over a 4-hour window, ranging 6.7 s → 105 s, with the 105 s call blocking the gateway WS thread during the stall window. This is not a Discord-specific issue — it surfaces in any openclaw doctor / openclaw status / health-monitor cycle — but it cooperates with #38596 and #71546 by adding additional latency to control-plane operations during exactly the windows when Discord channel-state introspection happens.

Code Example

1 doctor.memory.status   6,767 ms
1 doctor.memory.status  34,859 ms
1 doctor.memory.status  47,039 ms
1 doctor.memory.status  68,144 ms
1 doctor.memory.status 105,097 ms

---

dist/server-plugin-bootstrap-CxnqPNN-.js:4572   "doctor.memory.status" handler
dist/server-plugin-bootstrap-CxnqPNN-.js:4592     await manager.probeEmbeddingAvailability()
dist/manager-D84yq6oM.js:1824                   const EMBEDDING_QUERY_TIMEOUT_REMOTE_MS  =  60_000
dist/manager-D84yq6oM.js:1825                   const EMBEDDING_QUERY_TIMEOUT_LOCAL_MS   = 300_000
dist/manager-D84yq6oM.js:1826                   const EMBEDDING_BATCH_TIMEOUT_REMOTE_MS  = 120_000
dist/manager-D84yq6oM.js:1827                   const EMBEDDING_BATCH_TIMEOUT_LOCAL_MS   = 600_000
RAW_BUFFERClick to expand / collapse

Summary

The doctor.memory.status request handler runs a live embedding probe via manager.probeEmbeddingAvailability() and inherits the embedding-batch timeout, which can be up to 600 s for local providers and 60–120 s for remote. On openclaw 2026.4.23 I observed five real doctor.memory.status calls over a 4-hour window, ranging 6.7 s → 105 s, with the 105 s call blocking the gateway WS thread during the stall window. This is not a Discord-specific issue — it surfaces in any openclaw doctor / openclaw status / health-monitor cycle — but it cooperates with #38596 and #71546 by adding additional latency to control-plane operations during exactly the windows when Discord channel-state introspection happens.

Environment

OpenClaw2026.4.23 (a979721)
OSmacOS 15.6.1 (arm64)
Node (gateway runtime)22.22.2
Memory backenddefault (per Memory: 2 files · 3 chunks · sources memory · plugin memory-core · vector ready · fts ready · cache on (4) from openclaw status)
Providerlocal (per EMBEDDING_BATCH_TIMEOUT_LOCAL_MS reachability)

Evidence

Today's gateway.log, latency events for doctor.memory.status (the [ws] ⇄ res ✓ doctor.memory.status <ms>ms lines):

1 doctor.memory.status   6,767 ms
1 doctor.memory.status  34,859 ms
1 doctor.memory.status  47,039 ms
1 doctor.memory.status  68,144 ms
1 doctor.memory.status 105,097 ms

Median ≈ 47 s. Worst case 105 s — that's 100 % of a 105-second user-visible window blocked on a status probe.

Code path

dist/server-plugin-bootstrap-CxnqPNN-.js:4572   "doctor.memory.status" handler
dist/server-plugin-bootstrap-CxnqPNN-.js:4592     await manager.probeEmbeddingAvailability()
dist/manager-D84yq6oM.js:1824                   const EMBEDDING_QUERY_TIMEOUT_REMOTE_MS  =  60_000
dist/manager-D84yq6oM.js:1825                   const EMBEDDING_QUERY_TIMEOUT_LOCAL_MS   = 300_000
dist/manager-D84yq6oM.js:1826                   const EMBEDDING_BATCH_TIMEOUT_REMOTE_MS  = 120_000
dist/manager-D84yq6oM.js:1827                   const EMBEDDING_BATCH_TIMEOUT_LOCAL_MS   = 600_000

For local providers, the batch timeout is 10 minutes. The status handler can hang for the full timeout if the embedding backend is slow or stalled, with no early bail-out for "we just need to know if memory is enabled, not actually re-probe the model."

Why this matters beyond doctor

  • The health-monitor (mentioned in #38596) calls into status during its check cycle. A 60–105 s stall in status means the health monitor's circuit breakers can fire on what was actually a healthy Discord channel — false positive restart.
  • Every openclaw status / openclaw doctor invoked while debugging Discord issues (as we did today) eats minutes of the operator's time and contributes the slow channels.status 3787 ms and node.list <multi-second> events that show up as control-plane noise in the same logs.
  • Per-process: while the embedding probe is in flight, other handlers in the same WS connection's request lane appear queued behind it from the operator-tool perspective, even if they execute on a separate worker.

Reproduction

  1. Single-account openclaw 2026.4.23, default memory backend.
  2. While idle, openclaw doctor or openclaw status --deep repeatedly for 30 min.
  3. Observe gateway.log for [ws] ⇄ res ✓ doctor.memory.status <N>ms events. Most will be 6 s; a fraction will spike to 30–105 s when the embedding probe path is exercised under load or when the backend is cold.

Asks

Three concrete options, in order of effort:

  1. Cache result for N seconds (simplest). The probe yes/no answer doesn't change every call. A 30 s LRU cache would make the 99th-percentile case sub-100 ms and only re-probe when stale.
  2. Strict short-budget for the default path; deep probe behind a flag. Default doctor.memory.status returns enabled: true|false, lastProbeAt, lastProbeOk from cache; --deep triggers a real probeEmbeddingAvailability() with the long timeout. This matches how other CLIs split fast-status from deep-checks.
  3. Decouple from health-monitor's path. Even if the user-facing CLI keeps its current behavior, the health-monitor should not invoke the slow probe during its 5-minute check cycle. A separate memory.healthLite op that returns provider-readiness from the cached state, without a re-probe, would prevent the false-positive Discord restarts described in #38596.

Related

  • #38596 OPEN — health monitor restart loop. The memory-probe stall described here can fire the consecutiveHelloStalls / reconnectStallWatchdog paths from a control-plane stall, not a real Discord stall.
  • #71546 OPEN — Discord ingest-lag report. Independent surface, but compounds the operator-debugging-loop overhead.

Logs available

  • /Users/x./.openclaw/logs/gateway.log — the five doctor.memory.status <N>ms lines.
  • /tmp/openclaw/openclaw-2026-04-25.log — full ndjson trace.

extent analysis

TL;DR

Implement a caching mechanism for the doctor.memory.status handler to reduce the latency caused by the embedding probe.

Guidance

  • Consider caching the result of the probeEmbeddingAvailability() call for a short period, such as 30 seconds, to reduce the number of times the slow probe is executed.
  • Evaluate the feasibility of implementing a strict short-budget for the default path and moving the deep probe behind a flag, as suggested in the "Asks" section.
  • Investigate decoupling the health-monitor's path from the slow probe to prevent false-positive Discord restarts.
  • Review the related issues (#38596 and #71546) to ensure that the solution does not introduce new problems or regressions.

Example

// Simple caching example
const cache = {};
const cacheTTL = 30 * 1000; // 30 seconds

async function probeEmbeddingAvailability() {
  const cachedResult = cache['probeResult'];
  if (cachedResult && cachedResult.timestamp + cacheTTL > Date.now()) {
    return cachedResult.value;
  }
  const result = await manager.probeEmbeddingAvailability();
  cache['probeResult'] = { value: result, timestamp: Date.now() };
  return result;
}

Notes

The provided example is a simplified illustration of caching and may require modifications to fit the actual implementation. It is essential to consider the trade-offs between cache freshness and performance when implementing a caching solution.

Recommendation

Apply a caching workaround to reduce the latency caused by the embedding probe, as it is a relatively simple and low-risk solution that can provide significant performance improvements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix `doctor.memory.status` blocks 6.7–105 s on live embedding probe — should cache or strict-budget by default [1 participants]