openclaw - ✅(Solved) Fix [Bug]: Chat-turn latency 30-60s+ on every turn since v2026.4.23: per-turn plugin runtime re-evaluation (cross-platform BLOCKER) [2 pull requests, 10 comments, 8 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75512Fetched 2026-05-02 05:33:42
View on GitHub
Comments
10
Participants
8
Timeline
25
Reactions
5
Author
Timeline (top)
commented ×10cross-referenced ×4subscribed ×4mentioned ×3

Since approximately v2026.4.5, the embedded runner's per-chat-turn prep
phase re-evaluates process-lifetime data (plugin discovery, manifest registry, installed-plugin index, registry snapshot, package root,
bundled plugins dir, auth-profile store) on every turn. On fast x86
dev hardware this manifests as a 3-8 second prep; on ARM SBCs it's
30-60 seconds, with the LLM call itself being only 2-3 seconds of that.

This is likely the same root cause underlying #60528, #62051, #71938
and the chokidar-loader regression in #73176. Filing as a separate
issue because I have concrete trace evidence pinpointing the eight
hot points and a reference patch that recovers ~85% of the latency.

Environment

  • OpenClaw: 2026.4.27 (adc20fe), reproduces on every release since ~2026.4.5
  • Hardware: orangepi4pro (ARM)
  • Provider: minimax
  • Channel: telegram
  • Node: 24.15.0

Evidence

[trace:embedded-run] prep stages — single chat turn after warm startup

Unpatched (current main):

totalMs=53561 stages=
workspace-sandbox:21ms
skills:1ms
core-plugin-tools:14976ms <-- 15s
bootstrap-context:51ms
bundle-tools:3438ms
system-prompt:15783ms <-- 16s
session-resource-loader:3445ms
agent-session:2ms
stream-setup:15844ms <-- 16s

Three independent stages all ~15s with similar shape — characteristic signature of three call sites each re-walking the same shared
dependency graph (plugin registry / auth profile resolution).

With the cache patch (linked below):

totalMs=8257 stages=
core-plugin-tools:2580ms (-83%) bundle-tools:495ms (-86%)
system-prompt:2326ms (-85%)
session-resource-loader:545ms (-84%)
stream-setup:2290ms (-86%)

End-to-end chat turn including LLM call drops from ~76s to ~11s.

node --prof baseline (unpatched)

[Summary]
ticks total nonlib name
5388 8.6% 21.7% JavaScript
19464 31.2% 78.3% C++
3377 5.4% 13.6% GC
37468 60.1% Shared libraries

[C++ entry points] top
3233 39.6% syscall@@GLIBC
1208 14.8% access@@GLIBC
936 11.5% __open@@GLIBC
771 9.4% __read@@GLIBC

[Bottom up (heavy) profile] top JS leaves
3391 ticks (~5.4% of total) findPackageRootSync
<- resolveOpenClawPackageRootSync
<- resolveBundledPluginsDir
<- resolvePluginSourceRoots
<- discoverOpenClawPlugins (78.6%)
<- resolvePluginCacheInputs (21.4%)

discoverOpenClawPlugins is invoked 700+ times per chat turn with
identical inputs (verified with stack-trace probe).

Where each second goes (instrumented and confirmed)

  1. hasAuthForProvider (model-config.helpers.ts:37) re-reads
    auth-profiles.json from disk per provider. Called ~30 times per createPdfTool / createImageTool / createImageGenerateTool
    construction.
  2. resolvePluginToolRegistry (tools.ts:99) → resolveRuntimePluginRegistry
    per turn; falls through to loadOpenClawPlugins on cacheKey miss.
  3. discoverOpenClawPlugins (discovery.ts:845) runs synchronous
    readdirSync + readFileSync on every plugin manifest each call.
  4. resolveBundledPluginsDirresolveOpenClawPackageRootSync
    findPackageRootSync walks fs from cwd looking for package.json
    on every call.
  5. Legacy doctor migrations (legacy-config-compat.ts) call into
    plugin manifest registry on every config read.

Root cause (best guess)

src/plugins/CLAUDE.md documents the intended pattern:

Cache concept: metadata stays fresh unless a caller owns an explicit
PluginMetadataSnapshot, PluginLookUpTable, or manifest registry for the current flow. Do not add persistent metadata caches for
discovery, manifest registries, installed-index reconstruction...

The seam parameters (manifestRegistry, index, candidates) exist
on most call sites — loadAgentToolResultMiddlewaresForRuntime, loadPluginManifestRegistry, loadInstalledPluginIndex,
loadPluginRegistrySnapshot, loadOpenClawPlugins — but no caller
in the agent runtime threads a snapshot through. Each per-turn entry
point re-derives the registry from disk.

Why this isn't visible to maintainers

On M-series Macs / fast x86, the re-derivation is fast enough that
it reads as 3-8s prep, which sits below the EMBEDDED_RUN_STAGE_WARN_TOTAL_MS = 10_000 warn threshold and
EMBEDDED_RUN_STAGE_WARN_STAGE_MS = 5_000 per-stage threshold in
attempt-stage-timing.ts. No CI assertion runs on slow hardware.
Tests follow the agents/CLAUDE.md guideline of using lightweight
artifacts instead of cold-loading plugin runtime, so the test suite
never exercises the per-turn cold path. The trace infrastructure exists
for debugging, not monitoring, so the regression goes unnoticed
across daily releases.

Mitigation (env-gated, no behavior change when unset)

I maintain a survival branch with eight process-lifetime caches at the
hot points above, opt-in via OPENCLAW_DIRTY_DISCOVERY_CACHE=1:

https://github.com/sergeyksv/openclaw/tree/survival/plugin-cache

The commit body is the canonical reference for the eight cache layers,
per-layer rationale, and per-stage measurements. This is not offered as an upstream PR — src/plugins/CLAUDE.md explicitly forbids
persistent metadata caches at this layer. It exists so that other users
hitting this regression on slow hardware can apply a working patch
while waiting for the proper fix.

Proper fix

Plumb a PluginRegistrySnapshot (or PluginMetadataSnapshot) through
the per-turn call chain, owned by the gateway / agent runtime for the process lifetime:

  • loadAgentToolResultMiddlewaresForRuntime({manifestRegistry})
    caller already accepts this; nothing passes it
  • resolvePluginTools({snapshot}) and
    resolveRuntimePluginRegistry({snapshot}) similar
  • createOpenClawCodingTools should not call hasAuthForProvider per
    provider during construction; either move the enumeration into
    execute() (lazy) or use static capability descriptors per the
    agents/tools/CLAUDE.md guideline

This is a 200+ line PR across ~15 files. I am not in a position to
write that PR (small fork, no insight into invariants the snapshot
must satisfy), but I'm happy to share more diagnostic detail or
help reproduce.

Asks

  1. Confirm whether this duplicates #60528 or should stay separate.
  2. If a perf gate on a CPU-budget runner with a real chat turn could be added to CI, the next regression of this shape would be caught
    automatically. Currently nothing surfaces it.
  3. The trace warn thresholds (10000ms total / 5000ms stage) seem high
    given the LLM portion is ~3s. Lowering to e.g. 3000ms / 1500ms
    would make the trace useful for proactive monitoring rather than
    only post-hoc investigation.

Error Message

No user-visible error is raised — the system functions correctly, just very slowly. The 30-60s wait per chat turn is the only externally visible symptom. EMBEDDED_RUN_STAGE_WARN_TOTAL_MS = 10_000 warn threshold and 3. The trace warn thresholds (10000ms total / 5000ms stage) seem high

Root Cause

  1. Embedded runner stage trace — emitted because total exceeds the EMBEDDED_RUN_STAGE_WARN_TOTAL_MS = 10_000 / EMBEDDED_RUN_STAGE_WARN_STAGE_MS = 5_000
    thresholds in src/agents/pi-embedded-runner/run/attempt-stage-timing.ts:

Fix Action

Fix / Workaround

A separate startup-phase trace adds another ~22.8s (model-resolution 6.5s, auth 7.0s, attempt-dispatch 9.3s) on the first turn after gateway start.

Who is affected: Everyone. Confirmed slow on ARM SBC (orangepi4pro), Intel i7, and Apple M2. Earlier in this debugging cycle, before per-plugin disabling was
applied as a mitigation, turn time was on the order of minutes; selectively disabling unused bundled plugins is currently the only thing keeping turns finite for most users.

Practical consequence: OpenClaw cannot be used for its primary function on any reasonable hardware as of v2026.4.5+. Workarounds in the wild (rollback to
v2026.4.23 — see piunikaweb 2026-04-29, or manually disabling every bundled plugin not strictly needed) sacrifice features and ship daily breakage risk. A working opt-in cache mitigation (sergeyksv/openclaw survival/plugin-cache) recovers ~85% of the latency without provider, network, or LLM-side changes — proving the cost is purely repeated work and the regression is recoverable.

PR fix notes

PR #75022: fix(infer): load model catalog metadata-only for list/inspect/providers

Description (problem / solution / changelog)

Summary

  • Problem: openclaw infer model list, openclaw infer model inspect, and openclaw infer model providers hang indefinitely on 2026.4.27. The grandchild Node process spins at 100% CPU, opens zero TCP connections, and writes nothing to stdout/stderr until the timeout fires. Reported in #74986; --version, --help, and gateway status work.
  • Root cause (catalog-inspection slice): All three handlers funnel into loadModelCatalog(...) in src/agents/model-catalog.ts. Even on the read-only path, the function unconditionally calls augmentModelCatalogWithProviderPlugins(...) at the call site formerly on line 194, which threads through src/plugins/provider-runtime.ts:resolveProviderPluginsForCatalogHookssrc/plugins/provider-hook-runtime.ts:resolveProviderPluginsForHookssrc/plugins/providers.runtime.ts:resolvePluginProviderssrc/plugins/loader.ts:resolveRuntimePluginRegistry. On a CLI cold start with no active registry, resolveRuntimePluginRegistry falls through to loadOpenClawPlugins(...), which since b7a1bfd2 ("fix(plugins): cache installed manifest registry") builds an installed-manifest cache key via safeFileSignature (fs.statSync per plugin) + hashJson(...) over the index — exactly the synchronous CPU work the reporter's lsof/ps evidence points at (100% CPU, zero TCP, zero stdout). The agents list 2026.4.29 fixes (8fe449c8, d5eae0d9) addressed the same regression class on a different code path by routing channel queries through listReadOnlyChannelPluginsForConfig — i.e. "no plugin runtime on read-only metadata paths."
  • Fix: Add an opt-in skipProviderPluginAugmentation?: boolean option to loadModelCatalog. When true, the function returns the catalog assembled from PI SDK static rows + manifest static rows + cfg.models.providers configured rows, and skips the augmentModelCatalogWithProviderPlugins(...) call entirely. The three CLI inspection commands — infer model list, infer model inspect, and infer model providers (buildModelProviders) — pass readOnly: true and skipProviderPluginAugmentation: true. The flag is opt-in so existing readOnly: true callers (the only one outside this PR is appendCatalogSupplementRows in src/commands/models/list.rows.ts:347, used by models list --all) keep their dynamic plugin-derived rows.
  • What changed:
    • src/agents/model-catalog.ts:
      • Add skipProviderPluginAugmentation?: boolean to loadModelCatalog's param type with a doc comment that names the contract.
      • Gate the augmentModelCatalogWithProviderPlugins(...) call behind the new flag; emit plugin-models-skipped instead of plugin-models-merged when bypassed.
      • Add a parallel readOnlyModelCatalogPromise cache slot for readOnly: true callers that do want augmentation, so long-running hosts (appendCatalogSupplementRows) don't rebuild from scratch on every call. Skip-augmentation callers deliberately stay uncached: their result is a strict subset of rows and must not be served the with-augmentation cache. useCache: false and resetModelCatalogCache() symmetrically invalidate both slots; the empty-result branch and the catch handler null the matching slot to avoid cache poisoning.
    • src/cli/capability-cli.ts: buildModelProviders (used by infer model providers), infer model list, and infer model inspect pass readOnly: true, skipProviderPluginAugmentation: true. Command --description text now points users to openclaw models list --all for live provider-discovered models.
    • src/cli/capability-cli.test.ts: three tests assert the flag combination per command.
    • src/agents/model-catalog.test.ts: five tests lock the new contract — augmentation actually skipped when the flag is set, read-only cache reuse for non-skip callers, metadata-only result must not be served as the with-augmentation cache, useCache: false paired with readOnly: true invalidates the read-only slot, and useCache: false without readOnly also invalidates the read-only slot (cross-slot freshness from the write-path caller in src/commands/auth-choice.model-check.ts).
  • What did NOT change (scope boundary):
    • CHANGELOG.md — left untouched; release-note wording is the maintainer's call.
    • Default behavior of loadModelCatalog: when skipProviderPluginAugmentation is omitted/false, the augmentation step still runs exactly as before, so models list --all (appendCatalogSupplementRows) and every other current readOnly: true caller keeps the same catalog contents.
    • ensureOpenClawModelsJson, buildShouldSuppressBuiltInModel, the manifest planner, and the model-catalog cache for the non-read-only slot: untouched.
    • infer model run (local + gateway), infer model auth, image/audio/tts/embedding subcommands: out of scope; they are write/run paths and do not go through the read-only catalog read.
    • No new exports, no plugin-SDK / public-surface contract changes, no any introduced.

Cross-reference — already fixed upstream (no overlap with this PR)

Issue #74986 reported four hang commands. Three are addressed on main or by this PR via separate code paths; the fourth (infer model run via the gateway path) is tracked separately under "Out of scope". This PR has no textual conflict with any of the upstream fixes:

CommandStatusFiles / functions touched
agents listFixed on main by 8fe449c8, d5eae0d9 (2026-04-26)src/commands/agents.commands.list.ts, src/commands/agents.providers.ts, src/commands/health-format.ts, src/commands/message-format.ts
infer model run --localFixed on main by 12ee7f69 (2026-04-29)src/agents/pi-embedded-runner/model.ts, src/agents/simple-completion-runtime.ts, src/cli/capability-cli.ts:656 (one-line cfg, add inside runModelRun, far from this PR's buildModelProviders / registerCapabilityCli call sites)
infer model list / inspect / providersFixed by this PRsrc/agents/model-catalog.ts, src/cli/capability-cli.ts (buildModelProviders, model.command("list"|"inspect"|"providers"))
infer model run (gateway path)Not yet fixedTracked under "Out of scope" — needs a focused follow-up issue with profile evidence

Reproduction

On 2026.4.27 (or current main before this PR), with a ~/.openclaw/config.yaml similar to the reporter's:

agents:
  defaults:
    llm: { idleTimeoutSeconds: 600 }
    model: { primary: ollama/qwen3.5:397b-cloud }
models:
  providers:
    ollama:
      baseUrl: http://winhost:11434
      apiKey: ollama-local
      api: ollama
openclaw gateway status                                  # works
openclaw infer model list                                # before fix: hangs at 100% CPU until timeout
                                                          # after fix: returns the catalog and exits
openclaw infer model inspect --model openai/gpt-5.4      # same
openclaw infer model providers --json                    # same

The hung process can be confirmed with ps -o pcpu,etimes,wchan,comm -p <pid> (CPU pegged at ~100, no progress) and lsof -p <pid> (only the std{out,err} pipes, zero TCP — i.e. work is happening before any provider network probe).

Risk / Mitigation

  • Risk 1 — different output for catalog list: Skipping augmentModelCatalogWithProviderPlugins means infer model list / inspect / providers no longer surface dynamic plugin-discovered models (e.g. live Ollama models from /api/tags). The output is now: PI SDK static rows + manifest-declared rows + cfg.models.providers configured rows.
    • Mitigation: For inspection commands this is the right trade-off — the user wants "what does the catalog know about" to return promptly, not "what does the live Ollama daemon currently expose"; the latter is what models scan / models list --all are for, both of which still go through the dynamic path (their loadModelCatalog({ readOnly: true }) call site does not pass the new flag). The hang the reporter sees is a strictly worse failure mode than slightly-less-fresh output. The new flag is opt-in, so no other call site changes. The updated command --description strings point users to models list --all for live discovery.
  • Risk 2 — test coverage: Need to lock the new metadata-only contract so a future refactor doesn't silently regress.
    • Mitigation: Three CLI tests assert the flag combination per command; five model-catalog tests verify (a) augmentation is genuinely not called when the flag is set, (b) the with-augmentation read-only cache is reused on repeat calls, (c) a metadata-only result is not served to a later non-skip caller, (d) useCache: false paired with readOnly: true invalidates the read-only slot, and (e) useCache: false without readOnly (the write-path direction used by src/commands/auth-choice.model-check.ts) also invalidates the read-only slot — locking the symmetric cross-slot freshness contract so a future revert of the guard relaxation cannot silently leave a stale read-only cache visible to inspection callers.
  • Risk 3 — typing/security: No any introduced; only an existing optional parameter is added (skipProviderPluginAugmentation?: boolean) and consulted via a strict === true check. No change to data flow, secrets handling, plugin trust boundary, or external surface.

Out of scope (tracked separately)

This PR intentionally does not address:

  • loadOpenClawPlugins synchronous hot-path cost — the safeFileSignature (per-plugin fs.statSync) + hashJson(...) cache-key build introduced by b7a1bfd2. This is the underlying engine that any non-skip catalog refresh still hits, and it has separate user-visible symptoms beyond the catalog inspection commands. Tracked in #75512 (per-turn re-evaluation, BLOCKER), #75069 (synchronous mirror walk blocking gateway main thread), and #75513 (ARM64 redundant calls on every request).
  • infer model run hang via the gateway path (prepareSimpleCompletionModelForAgentresolveModelAsyncprepareProviderRuntimeAuth). The --local variant is already fixed on main by 12ee7f69; the gateway variant warrants a focused issue with profile evidence.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • CLI
  • Agents/models
  • Tests

Linked Issue/PR

Refs #74986. Of the four hang commands reported in that issue:

  • agents list — already fixed on main by 8fe449c8 / d5eae0d9 (2026-04-26).
  • infer model run --local — already fixed on main by 12ee7f69 (2026-04-29).
  • infer model list — fixed by this PR. The same fix is preventively extended to infer model inspect and infer model providers, which share the read-only catalog code path but were not individually exercised by the reporter.
  • infer model run via the gateway path — still open, intentionally out of scope here (see "Out of scope" section); needs a focused follow-up issue with profile evidence before it can be closed.

This PR therefore does not by itself fully address #74986; the issue should remain open until the gateway-run path is tracked and a separate fix lands.

Changed files

  • src/agents/model-catalog.test.ts (modified, +168/-0)
  • src/agents/model-catalog.ts (modified, +51/-13)
  • src/cli/capability-cli.test.ts (modified, +54/-0)
  • src/cli/capability-cli.ts (modified, +27/-6)

PR #75521: fix(plugins): reuse active plugin registry for tool resolution on every run

Description (problem / solution / changelog)

Summary

Fixes a performance regression where core-plugin-tools takes 20-32s on every embedded message run because the plugin registry is reloaded from scratch — including re-staging bundled runtime deps (e.g. acpx with 31 npm specs). Total prep stages reach 40-70s per message.

Root Cause

The resolvePluginToolRegistry() function in src/plugins/tools.ts only preferred the already-active plugin registry for gateway-bindable subagent scenarios. For regular embedded messages, it fell through to resolveRuntimePluginRegistry(), which compares cache keys.

The gateway loads plugins at startup with onlyPluginIds: startupPluginIds (scoped), but the embedded runner requests tools without scope. The cache key includes serializePluginIdScope():

  • Gateway: JSON.stringify([\"acpx\", \"telegram\", ...])
  • Embedded runner: "__unscoped__"

The keys mismatch → getCompatibleActivePluginRegistry() returns undefined → loadOpenClawPlugins() runs fresh, triggering bundled runtime dependency staging for every bundled plugin on every message.

Fix

Move the compatibility relaxation into getCompatibleActivePluginRegistry() in src/plugins/loader.ts, preserving the full loader compatibility contract (workspace, config, activation metadata, runtime mode). When the active registry was loaded with a specific plugin scope and the current request is unscoped, we build a cache key using the active registry's successfully-loaded plugin IDs as the scope. If that scoped key matches the active cache key, the registry is compatible and is reused.

This keeps all existing compatibility checks intact while safely allowing the scoped gateway-startup registry to satisfy unscoped embedded-tool requests.

Changes

  • src/plugins/loader.ts: +38 lines in getCompatibleActivePluginRegistry() — add scoped-active → unscoped-request reuse path
  • src/plugins/tools.ts: reverted — restore original behavior
  • CHANGELOG.md: add entry for the fix

Verification

  • pnpm test src/plugins/tools.optional.test.ts src/plugins/loader.runtime-registry.test.ts — 27/27 pass
  • pnpm lint:core — clean

Closes #75520

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/plugins/loader.runtime-registry.test.ts (modified, +184/-0)
  • src/plugins/loader.ts (modified, +23/-4)
  • src/plugins/registry-empty.ts (modified, +1/-0)
  • src/plugins/registry-types.ts (modified, +1/-0)
  • src/plugins/registry.ts (modified, +8/-4)
  • src/plugins/tools.optional.test.ts (modified, +40/-3)
  • src/plugins/tools.ts (modified, +32/-20)
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

After upgrading from v2026.4.23 to v2026.4.26, every chat turn's embedded-runner prep phase takes ~50+ seconds (LLM call ~3s; total ~76s) on ARM hardware. Befor upgrade was like 15s.

Steps to reproduce

  1. On an ARM SBC (orangepi4pro, Node 24.15.0), install OpenClaw v2026.4.26 (also reproduces on v2026.4.27).
  2. Configure a single LLM provider (minimax) and a single channel (telegram); leave bundled plugins
    (brave/duckduckgo/exa/firecrawl/google/minimax/moonshot/ollama/perplexity/searxng/tavily/xai) at their default enabled-by-default state.
  3. Start the gateway: pnpm openclaw gateway.
  4. Wait for startup to settle, then send one chat message via Telegram.
  5. Observe: the gateway logs [plugins] loading <id> from .../dist/extensions/<id>/index.js for every bundled plugin on every turn, the [trace:embedded-run] prep stages line reports totalMs ≈ 53000 with core-plugin-tools, system-prompt, and stream-setup each ≈ 15000ms, and the user-perceived turn takes ~75-80 seconds
    end-to-end while the LLM call portion is ~3 seconds.
  6. Repeat on the same gateway process (no restart): subsequent turns show the same ~50s prep, confirming the work is per-turn rather than one-time startup.

Expected behavior

On the same hardware running v2026.4.23 (the last release widely reported as stable, per community threads referenced in #60528 and the piunikaweb 2026-04-29
article), each chat turn completed in roughly 5-10 seconds end-to-end with the LLM call being the dominant portion. The intended caching pattern is also
documented in src/plugins/CLAUDE.md:

▎ Cache concept: metadata stays fresh unless a caller owns an explicit PluginMetadataSnapshot, PluginLookUpTable, or manifest registry for the current flow. Do ▎ not add persistent metadata caches for discovery, manifest registries, installed-index reconstruction, owner lookup, model suppression, provider policy, ▎ public-artifact metadata, or similar control-plane answers.

Per that guidance, the embedded-runner per-turn entry points (loadAgentToolResultMiddlewaresForRuntime, resolvePluginTools, resolveRuntimePluginRegistry,
createOpenClawCodingTools) should be invoked with a caller-owned manifestRegistry / PluginRegistrySnapshot that is built once per session (or per process) and reused across turns, instead of each call re-deriving discovery, manifest registry, installed-plugin index, package-root walks, and per-provider
auth-profile-store reads from disk on every turn. The expected [trace:embedded-run] prep stages totalMs on this hardware is in the 5000-8000ms range, comparable to what an opt-in process-wide cache produces today (validated at 8257ms on this same hardware: see sergeyksv/openclaw survival/plugin-cache).

Actual behavior

Each chat turn takes ~75-80 seconds end-to-end on the orangepi4pro, with the LLM call accounting for only ~3 seconds. Direct evidence:

  1. Embedded runner stage trace — emitted because total exceeds the EMBEDDED_RUN_STAGE_WARN_TOTAL_MS = 10_000 / EMBEDDED_RUN_STAGE_WARN_STAGE_MS = 5_000
    thresholds in src/agents/pi-embedded-runner/run/attempt-stage-timing.ts:

[agent/embedded] [trace:embedded-run] prep stages:
phase=stream-ready totalMs=53561 stages= workspace-sandbox:21ms
skills:1ms
core-plugin-tools:14976ms
bootstrap-context:51ms
bundle-tools:3438ms
system-prompt:15783ms
session-resource-loader:3445ms
agent-session:2ms
stream-setup:15844ms

A separate startup-phase trace adds another ~22.8s (model-resolution 6.5s, auth 7.0s, attempt-dispatch 9.3s) on the first turn after gateway start.

  1. Per-turn plugin re-import — every chat turn the gateway logs:

[plugins] loading brave from /home/sergey/openclaw/dist/extensions/brave/index.js
[plugins] loading duckduckgo from /home/sergey/openclaw/dist/extensions/duckduckgo/index.js
... (all 12 bundled extensions) ...
[plugins] loaded 12 plugin(s) (12 attempted) in 189.9ms

This fires on every turn, not only at startup.

  1. strace evidence — the same plugin manifest files are re-opened hundreds of times per turn:

openat .../dist-runtime/extensions/brave/package.json
openat .../dist-runtime/extensions/brave/openclaw.plugin.json
openat .../dist-runtime/extensions/brave/index.js openat .../dist-runtime/extensions/browser/package.json
... [cycles through every plugin, then repeats] ...

  1. node --prof summary (single chat turn):

[Summary]
ticks total nonlib name
5388 8.6% 21.7% JavaScript
19464 31.2% 78.3% C++
3377 5.4% 13.6% GC
37468 60.1% Shared libraries

[C++ entry points] (top)
3233 39.6% syscall@@GLIBC
1208 14.8% access@@GLIBC
936 11.5% __open@@GLIBC
771 9.4% __read@@GLIBC

[Bottom up (heavy) profile] (hottest JS leaf)
3391 ticks findPackageRootSync (~5.4% of total CPU)
<- resolveOpenClawPackageRootSync
<- resolveBundledPluginsDir
<- resolvePluginSourceRoots
<- discoverOpenClawPlugins (78.6%)
<- resolvePluginCacheInputs (21.4%)

discoverOpenClawPlugins is invoked 700+ times per chat turn with identical inputs (verified by adding a process-wide stack-trace probe to the function entry).

  1. Per-factory breakdown of core-plugin-tools (instrumented):

[openclaw-tools-timing]
session-agent=0ms image-tool=612ms image-generate=1263ms
video-generate=1547ms music-generate=1730ms pdf-tool=2665ms
web-search=2841ms web-fetch=2842ms message-tool=2852ms
nodes-tool=2852ms tools-array=2858ms plugin-tools=2919ms

createPdfTool ~935ms, createImageGenerateTool ~651ms, createImageTool ~612ms — all dominated by hasAuthForProvider (src/agents/tools/model-config.helpers.ts:37) which reads auth-profiles.json from disk per provider, called dozens of times per tool factory.

No user-visible error is raised — the system functions correctly, just very slowly. The 30-60s wait per chat turn is the only externally visible symptom.

OpenClaw version

2026.4.26

Operating system

Ubuntu 24.04

Install method

No response

Model

minimax

Provider / routing chain

openclaw->minimax

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

Who is affected: Everyone. Confirmed slow on ARM SBC (orangepi4pro), Intel i7, and Apple M2. Earlier in this debugging cycle, before per-plugin disabling was
applied as a mitigation, turn time was on the order of minutes; selectively disabling unused bundled plugins is currently the only thing keeping turns finite for most users.

Severity: BLOCKER. Every chat turn pays ~75-80s of plugin/runtime overhead with the LLM portion only ~3s — the product is functionally unusable for interactive chat unless the user happens to be on hardware so fast it makes the regression appear merely "annoying" rather than "broken."

How often: Every single chat turn, deterministically. No warm-up effect within a process.

Practical consequence: OpenClaw cannot be used for its primary function on any reasonable hardware as of v2026.4.5+. Workarounds in the wild (rollback to
v2026.4.23 — see piunikaweb 2026-04-29, or manually disabling every bundled plugin not strictly needed) sacrifice features and ship daily breakage risk. A working opt-in cache mitigation (sergeyksv/openclaw survival/plugin-cache) recovers ~85% of the latency without provider, network, or LLM-side changes — proving the cost is purely repeated work and the regression is recoverable.

Additional information

Summary

Since approximately v2026.4.5, the embedded runner's per-chat-turn prep
phase re-evaluates process-lifetime data (plugin discovery, manifest registry, installed-plugin index, registry snapshot, package root,
bundled plugins dir, auth-profile store) on every turn. On fast x86
dev hardware this manifests as a 3-8 second prep; on ARM SBCs it's
30-60 seconds, with the LLM call itself being only 2-3 seconds of that.

This is likely the same root cause underlying #60528, #62051, #71938
and the chokidar-loader regression in #73176. Filing as a separate
issue because I have concrete trace evidence pinpointing the eight
hot points and a reference patch that recovers ~85% of the latency.

Environment

  • OpenClaw: 2026.4.27 (adc20fe), reproduces on every release since ~2026.4.5
  • Hardware: orangepi4pro (ARM)
  • Provider: minimax
  • Channel: telegram
  • Node: 24.15.0

Evidence

[trace:embedded-run] prep stages — single chat turn after warm startup

Unpatched (current main):

totalMs=53561 stages=
workspace-sandbox:21ms
skills:1ms
core-plugin-tools:14976ms <-- 15s
bootstrap-context:51ms
bundle-tools:3438ms
system-prompt:15783ms <-- 16s
session-resource-loader:3445ms
agent-session:2ms
stream-setup:15844ms <-- 16s

Three independent stages all ~15s with similar shape — characteristic signature of three call sites each re-walking the same shared
dependency graph (plugin registry / auth profile resolution).

With the cache patch (linked below):

totalMs=8257 stages=
core-plugin-tools:2580ms (-83%) bundle-tools:495ms (-86%)
system-prompt:2326ms (-85%)
session-resource-loader:545ms (-84%)
stream-setup:2290ms (-86%)

End-to-end chat turn including LLM call drops from ~76s to ~11s.

node --prof baseline (unpatched)

[Summary]
ticks total nonlib name
5388 8.6% 21.7% JavaScript
19464 31.2% 78.3% C++
3377 5.4% 13.6% GC
37468 60.1% Shared libraries

[C++ entry points] top
3233 39.6% syscall@@GLIBC
1208 14.8% access@@GLIBC
936 11.5% __open@@GLIBC
771 9.4% __read@@GLIBC

[Bottom up (heavy) profile] top JS leaves
3391 ticks (~5.4% of total) findPackageRootSync
<- resolveOpenClawPackageRootSync
<- resolveBundledPluginsDir
<- resolvePluginSourceRoots
<- discoverOpenClawPlugins (78.6%)
<- resolvePluginCacheInputs (21.4%)

discoverOpenClawPlugins is invoked 700+ times per chat turn with
identical inputs (verified with stack-trace probe).

Where each second goes (instrumented and confirmed)

  1. hasAuthForProvider (model-config.helpers.ts:37) re-reads
    auth-profiles.json from disk per provider. Called ~30 times per createPdfTool / createImageTool / createImageGenerateTool
    construction.
  2. resolvePluginToolRegistry (tools.ts:99) → resolveRuntimePluginRegistry
    per turn; falls through to loadOpenClawPlugins on cacheKey miss.
  3. discoverOpenClawPlugins (discovery.ts:845) runs synchronous
    readdirSync + readFileSync on every plugin manifest each call.
  4. resolveBundledPluginsDirresolveOpenClawPackageRootSync
    findPackageRootSync walks fs from cwd looking for package.json
    on every call.
  5. Legacy doctor migrations (legacy-config-compat.ts) call into
    plugin manifest registry on every config read.

Root cause (best guess)

src/plugins/CLAUDE.md documents the intended pattern:

Cache concept: metadata stays fresh unless a caller owns an explicit
PluginMetadataSnapshot, PluginLookUpTable, or manifest registry for the current flow. Do not add persistent metadata caches for
discovery, manifest registries, installed-index reconstruction...

The seam parameters (manifestRegistry, index, candidates) exist
on most call sites — loadAgentToolResultMiddlewaresForRuntime, loadPluginManifestRegistry, loadInstalledPluginIndex,
loadPluginRegistrySnapshot, loadOpenClawPlugins — but no caller
in the agent runtime threads a snapshot through. Each per-turn entry
point re-derives the registry from disk.

Why this isn't visible to maintainers

On M-series Macs / fast x86, the re-derivation is fast enough that
it reads as 3-8s prep, which sits below the EMBEDDED_RUN_STAGE_WARN_TOTAL_MS = 10_000 warn threshold and
EMBEDDED_RUN_STAGE_WARN_STAGE_MS = 5_000 per-stage threshold in
attempt-stage-timing.ts. No CI assertion runs on slow hardware.
Tests follow the agents/CLAUDE.md guideline of using lightweight
artifacts instead of cold-loading plugin runtime, so the test suite
never exercises the per-turn cold path. The trace infrastructure exists
for debugging, not monitoring, so the regression goes unnoticed
across daily releases.

Mitigation (env-gated, no behavior change when unset)

I maintain a survival branch with eight process-lifetime caches at the
hot points above, opt-in via OPENCLAW_DIRTY_DISCOVERY_CACHE=1:

https://github.com/sergeyksv/openclaw/tree/survival/plugin-cache

The commit body is the canonical reference for the eight cache layers,
per-layer rationale, and per-stage measurements. This is not offered as an upstream PR — src/plugins/CLAUDE.md explicitly forbids
persistent metadata caches at this layer. It exists so that other users
hitting this regression on slow hardware can apply a working patch
while waiting for the proper fix.

Proper fix

Plumb a PluginRegistrySnapshot (or PluginMetadataSnapshot) through
the per-turn call chain, owned by the gateway / agent runtime for the process lifetime:

  • loadAgentToolResultMiddlewaresForRuntime({manifestRegistry})
    caller already accepts this; nothing passes it
  • resolvePluginTools({snapshot}) and
    resolveRuntimePluginRegistry({snapshot}) similar
  • createOpenClawCodingTools should not call hasAuthForProvider per
    provider during construction; either move the enumeration into
    execute() (lazy) or use static capability descriptors per the
    agents/tools/CLAUDE.md guideline

This is a 200+ line PR across ~15 files. I am not in a position to
write that PR (small fork, no insight into invariants the snapshot
must satisfy), but I'm happy to share more diagnostic detail or
help reproduce.

Asks

  1. Confirm whether this duplicates #60528 or should stay separate.
  2. If a perf gate on a CPU-budget runner with a real chat turn could be added to CI, the next regression of this shape would be caught
    automatically. Currently nothing surfaces it.
  3. The trace warn thresholds (10000ms total / 5000ms stage) seem high
    given the LLM portion is ~3s. Lowering to e.g. 3000ms / 1500ms
    would make the trace useful for proactive monitoring rather than
    only post-hoc investigation.

extent analysis

TL;DR

The most likely fix is to plumb a PluginRegistrySnapshot through the per-turn call chain to avoid re-deriving the registry from disk on every turn.

Guidance

  • Review the src/plugins/CLAUDE.md documentation to understand the intended caching pattern and how to implement it correctly.
  • Investigate the loadAgentToolResultMiddlewaresForRuntime, resolvePluginTools, and resolveRuntimePluginRegistry functions to determine how to pass a PluginRegistrySnapshot through the call chain.
  • Consider implementing a process-lifetime cache at the hot points identified in the issue to mitigate the performance regression.
  • Evaluate the OPENCLAW_DIRTY_DISCOVERY_CACHE mitigation branch as a temporary workaround.

Example

No code example is provided as the issue requires a deeper understanding of the OpenClaw codebase and the caching pattern.

Notes

The issue is complex and requires a thorough understanding of the OpenClaw architecture and caching mechanisms. The provided mitigation branch may help alleviate the performance regression, but a proper fix requires a more comprehensive solution.

Recommendation

Apply the OPENCLAW_DIRTY_DISCOVERY_CACHE mitigation branch as a temporary workaround until a proper fix can be implemented. This will help alleviate the performance regression and provide a more usable experience for users.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

On the same hardware running v2026.4.23 (the last release widely reported as stable, per community threads referenced in #60528 and the piunikaweb 2026-04-29
article), each chat turn completed in roughly 5-10 seconds end-to-end with the LLM call being the dominant portion. The intended caching pattern is also
documented in src/plugins/CLAUDE.md:

▎ Cache concept: metadata stays fresh unless a caller owns an explicit PluginMetadataSnapshot, PluginLookUpTable, or manifest registry for the current flow. Do ▎ not add persistent metadata caches for discovery, manifest registries, installed-index reconstruction, owner lookup, model suppression, provider policy, ▎ public-artifact metadata, or similar control-plane answers.

Per that guidance, the embedded-runner per-turn entry points (loadAgentToolResultMiddlewaresForRuntime, resolvePluginTools, resolveRuntimePluginRegistry,
createOpenClawCodingTools) should be invoked with a caller-owned manifestRegistry / PluginRegistrySnapshot that is built once per session (or per process) and reused across turns, instead of each call re-deriving discovery, manifest registry, installed-plugin index, package-root walks, and per-provider
auth-profile-store reads from disk on every turn. The expected [trace:embedded-run] prep stages totalMs on this hardware is in the 5000-8000ms range, comparable to what an opt-in process-wide cache produces today (validated at 8257ms on this same hardware: see sergeyksv/openclaw survival/plugin-cache).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Chat-turn latency 30-60s+ on every turn since v2026.4.23: per-turn plugin runtime re-evaluation (cross-platform BLOCKER) [2 pull requests, 10 comments, 8 participants]