openclaw - ✅(Solved) Fix [Bug]: openclaw infer hangs indefinitely on 2026.4.27 — openclaw-infer child spins at 100% CPU with zero network I/O [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74986Fetched 2026-05-01 05:39:14
View on GitHub
Comments
3
Participants
3
Timeline
10
Reactions
2
Author
Timeline (top)
referenced ×5commented ×3cross-referenced ×1subscribed ×1

openclaw infer model run hangs indefinitely on OpenClaw 2026.4.27. The grandchild openclaw-infer Node.js process consumes 100% CPU but makes zero network connections and produces no output, eventually getting killed by the process timeout. The CLI never reaches the gateway and no request is logged.

This reproduces on both local Ollama models and remote API models, suggesting a pre-execution initialization regression.

Root Cause

openclaw infer model run hangs indefinitely on OpenClaw 2026.4.27. The grandchild openclaw-infer Node.js process consumes 100% CPU but makes zero network connections and produces no output, eventually getting killed by the process timeout. The CLI never reaches the gateway and no request is logged.

This reproduces on both local Ollama models and remote API models, suggesting a pre-execution initialization regression.

Fix Action

Fixed

PR fix notes

PR #75022: fix(infer): load model catalog metadata-only for list/inspect/providers

Description (problem / solution / changelog)

Summary

  • Problem: openclaw infer model list, openclaw infer model inspect, and openclaw infer model providers hang indefinitely on 2026.4.27, with the child Node process spinning at 100% CPU, no TCP connections, and no output until timeout. Reported in #74986; the user verified --version, --help, and gateway status still work, while the catalog-listing commands wedge before any I/O lands.
  • Root Cause: All three handlers funnel into loadModelCatalog(...) in src/agents/model-catalog.ts. Even with readOnly: true (added in this PR's first revision, which skips the ensureOpenClawModelsJson mutation path at line 142–145), the function still synchronously enters augmentModelCatalogWithProviderPlugins at line 194 — that path goes through src/plugins/provider-runtime.ts:resolveProviderPluginsForCatalogHooksresolveProviderPluginsForHooks(...), which loads each provider plugin's runtime module so it can invoke the plugin's augmentModelCatalog hook. With the 2026.4.27 manifest-driven catalog/auth refactors (commits 8a06db084, 13757465b, 20c7a98fb, b7a1bfd2d, 947aae5a9, d014b3634), that path now also fans out into the cached installed-manifest registry (b7a1bfd2d's synchronous fs.statSync per plugin + hashJson(...)) and the bundled-plugin runtime imports per provider, which is where the user's run hot-spins on configs that pin a custom provider (the reporter's models.providers.ollama block). The exact symptom — 100% CPU, zero TCP, zero stdout — matches synchronous CPU work inside provider-plugin runtime resolution rather than any network probe. The same regression class was already fixed for agents list / status in 2026.4.29 by replacing getChannelPlugin(...) with listReadOnlyChannelPluginsForConfig (8fe449c88, d5eae0d95) — the established remediation pattern is "avoid loading any plugin runtime on read-only metadata paths," not "skip one mutation function." The first revision of this PR (commit 6abe657) addressed only the mutation path; this PR's second commit closes the remaining plugin-runtime path so the read-only contract is actually metadata-only.
  • Fix: Add a new orthogonal option skipProviderPluginAugmentation?: boolean to loadModelCatalog. When true, the function returns the catalog assembled from PI SDK static rows + manifest static rows + cfg.models.providers configured rows (the same data sources that already work today) and skips the augmentModelCatalogWithProviderPlugins(...) call at line 194. Three CLI inspection commands — infer model list, infer model inspect, and infer model providers (buildModelProviders) — pass readOnly: true and skipProviderPluginAugmentation: true, mirroring the 2026.4.29 agents list "no plugin runtime on read-only paths" pattern. The option is opt-in to preserve existing readOnly: true callers (models list --all via appendCatalogSupplementRows, cli-auth-epoch.ts, etc.) which still want dynamic plugin-derived rows.
  • What changed:
    • src/agents/model-catalog.ts: add skipProviderPluginAugmentation?: boolean to loadModelCatalog's param type with a doc comment that names #74986 and explains the contract; gate the augmentModelCatalogWithProviderPlugins(...) call behind the new flag. No change to type signatures of any other export.
    • src/cli/capability-cli.ts: buildModelProviders (used by infer model providers), infer model list, and infer model inspect pass readOnly: true, skipProviderPluginAugmentation: true. Comment cites #74986 and the agents list 2026.4.29 fix pattern.
    • src/cli/capability-cli.test.ts: the three existing #74986 cases now assert that loadModelCatalog is called with both readOnly: true and skipProviderPluginAugmentation: true (not just readOnly: true).
    • src/agents/model-catalog.test.ts: one new it(...) case in the existing describe("loadModelCatalog", ...) block that primes the augmentModelCatalogWithProviderPlugins mock to return a synthetic ollama-live-only row, calls loadModelCatalog({ readOnly: true, skipProviderPluginAugmentation: true }), and asserts the synthetic row is absent and the augmentation mock was never called. Reuses the existing harness; no new fixtures.
  • What did NOT change (scope boundary):
    • CHANGELOG.md — left untouched; release-note wording is the maintainer's call.
    • Default behavior of loadModelCatalog: when skipProviderPluginAugmentation is omitted/false, the augmentation step still runs exactly as before, so models list --all (src/commands/models/list.rows.ts:347) and every other current readOnly: true caller keeps the same catalog contents.
    • ensureOpenClawModelsJson, buildShouldSuppressBuiltInModel (manifest registry resolver), the manifest planner, and the model-catalog cache: untouched.
    • infer model run (local + gateway), infer model auth, image/audio/tts/embedding subcommands: out of scope; they are write/run paths, do not go through the read-only catalog read, and any hang there needs a separate fix.
    • No new exports, no plugin-SDK / public-surface contract changes, no any introduced.

Reproduction

On 2026.4.27 (or current main), with a ~/.openclaw/config.yaml similar to the reporter's:

agents:
  defaults:
    llm: { idleTimeoutSeconds: 600 }
    model: { primary: ollama/qwen3.5:397b-cloud }
models:
  providers:
    ollama:
      baseUrl: http://winhost:11434
      apiKey: ollama-local
      api: ollama
openclaw gateway status                                  # works
openclaw infer model list                                # before fix: hangs at 100% CPU until timeout
                                                          # after fix: returns the catalog and exits
openclaw infer model inspect --model openai/gpt-5.4      # same
openclaw infer model providers --json                    # same

The hung process can be confirmed with ps -o pcpu,etimes,wchan,comm -p <pid> (CPU pegged at ~100, no progress) and lsof -p <pid> (only the std{out,err} pipes, zero TCP — i.e., the work is happening before any provider network probe).

Risk / Mitigation

  • Risk 1 — different output for catalog list: Skipping augmentModelCatalogWithProviderPlugins means infer model list / inspect / providers no longer surface dynamic plugin-discovered models (e.g., live Ollama models from /api/tags). The output is now: PI SDK static rows + manifest-declared rows + cfg.models.providers configured rows.
    • Mitigation: For inspection commands this is the right trade-off — the user wants "what does the catalog know about" to return promptly, not "what does the live Ollama daemon currently expose"; the latter is what models scan / models list --all are for, both of which still go through the dynamic path (their loadModelCatalog({ readOnly: true }) call sites do not pass the new flag). The hang the reporter sees is a strictly worse failure mode than slightly-less-fresh output. The new flag is opt-in, so no other call site changes.
  • Risk 2 — test coverage: Need to lock the new metadata-only contract so a future refactor doesn't silently regress.
    • Mitigation: Three CLI tests assert the flag combination per command; one model-catalog unit test verifies that augmentModelCatalogWithProviderPlugins is genuinely not called when the flag is set, even when its mock would have produced a row. All four reuse existing harness/mocks; no new fixtures.
  • Risk 3 — typing/security review: No any introduced; only an existing optional parameter is added (skipProviderPluginAugmentation?: boolean) and consulted via a strict === true check. No change to data flow, secrets handling, plugin trust boundary, or external surface.

Update — incremental commit on this PR

The first revision of this PR (6abe657) only added readOnly: true, which skips the ensureOpenClawModelsJson mutation branch. After re-review I confirmed that loadModelCatalog's remaining call into augmentModelCatalogWithProviderPlugins (line 194 of model-catalog.ts) still synchronously loads provider plugin runtime — i.e., the very class of work the reporter's lsof/ps evidence points at. The agents-list 2026.4.29 fix (8fe449c88, d5eae0d95) addressed the same regression class by routing channel queries through listReadOnlyChannelPluginsForConfig instead of getChannelPlugin(...); this commit applies the same "no plugin runtime on read-only paths" pattern to the model catalog by making readOnly truly metadata-only via the new opt-in skipProviderPluginAugmentation flag, gated to only the three infer inspection commands.

Update 2 — read-only catalog cache

A reviewer noted that the read-only path of loadModelCatalog had no cache reuse — every call rebuilt the catalog from scratch because only the non-readOnly slot (modelCatalogPromise) was ever populated. For one-shot CLI invocations (the issue scenario) this is harmless, but long-running hosts that hit the read-only path repeatedly (cli-auth-epoch.ts:171 refresh, appendCatalogSupplementRows for models list --all) would redo the PI SDK import / registry load / manifest suppression resolver / augmentModelCatalogWithProviderPlugins every time. This commit adds a parallel readOnlyModelCatalogPromise that caches the with-augmentation read-only result, mirroring every invariant of the original cache:

  • useCache: false invalidates both slots up-front.
  • empty results clear the matching slot so the next call retries (existing comment kept).
  • catch handlers null out the matching slot so transient dynamic-import / filesystem failures don't poison the cache.
  • resetModelCatalogCache() and the test reset both clear both slots.

skipProviderPluginAugmentation callers (the #74986 inspection commands) deliberately stay uncached: their result is a strict subset of the with-augmentation result, so caching it would let a later non-skip caller silently receive the smaller set. Their rebuild is cheap because the heavy provider-runtime fan-out is bypassed. Two new unit tests in model-catalog.test.ts lock this contract: ① two consecutive readOnly: true (without skipProviderPluginAugmentation) calls reuse the cache (registry.getAll() runs once, second result === first), and ② a readOnly: true, skipProviderPluginAugmentation: true call followed by a non-skip readOnly: true call rebuilds and includes the augmentation row.

Out of scope (tracked separately)

Reviewer also flagged two further items that this PR intentionally does not address:

  • loadOpenClawPlugins performance regression in 2026.4.27 — the synchronous fs.statSync per plugin + hashJson(...) introduced by b7a1bfd2d. This is the underlying engine that any non-skip catalog refresh still hits. Worth a focused follow-up with profiler data — the symptom-to-root-cause mapping ("hot-spin" vs "slow") is not directly observable from the reporter's ps/lsof evidence alone.
  • infer model run hang — separate code path (prepareSimpleCompletionModelForAgentresolveModelAsync); even with skipPiDiscovery: true it can still enter prepareProviderRuntimeAuth. Should be a dedicated issue.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • CLI
  • Agents/models
  • Tests

Linked Issue/PR

Fixes #74986

Changed files

  • src/agents/model-catalog.test.ts (modified, +133/-0)
  • src/agents/model-catalog.ts (modified, +51/-13)
  • src/cli/capability-cli.test.ts (modified, +54/-0)
  • src/cli/capability-cli.ts (modified, +27/-6)

Code Example

openclaw infer model list
   openclaw infer model run --local --model ollama/qwen3.5:397b-cloud --prompt "Reply with: ok"

---

UID        PID  PPID  CMD
mlaih   253345 253344  timeout 15 node .../openclaw.mjs infer model run ...
mlaih   253347 253345  openclaw
mlaih   253354 253347  99  openclaw-infer   <-- 100% CPU, 0 network connections

---

{
  "agents": {
    "defaults": {
      "llm": { "idleTimeoutSeconds": 600 },
      "model": { "primary": "ollama/qwen3.5:397b-cloud" }
    }
  },
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://winhost:11434",
        "apiKey": "ollama-local",
        "api": "ollama"
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

openclaw infer model run hangs indefinitely on OpenClaw 2026.4.27. The grandchild openclaw-infer Node.js process consumes 100% CPU but makes zero network connections and produces no output, eventually getting killed by the process timeout. The CLI never reaches the gateway and no request is logged.

This reproduces on both local Ollama models and remote API models, suggesting a pre-execution initialization regression.

Environment

  • OpenClaw: 2026.4.27 (installed via npm global — /home/mlaih/.npm-global/lib/node_modules/openclaw/)
  • Node.js: v24.14.1
  • Deployment: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2 on Windows)
  • Gateway: running at ws://127.0.0.1:18789 (healthy, responding to probe)
  • Ollama: http://winhost:11434 (healthy — direct curl to /api/chat works fine)

Steps to reproduce

  1. Ensure gateway is running: openclaw gateway status — confirms healthy
  2. Run any openclaw infer command, e.g.:
    openclaw infer model list
    openclaw infer model run --local --model ollama/qwen3.5:397b-cloud --prompt "Reply with: ok"
  3. Observe: command hangs indefinitely, eventually killed by internal timeout

Expected behavior

openclaw infer model list returns a JSON list of available models. openclaw infer model run --local --model ollama/qwen3.5:397b-cloud --prompt "..." returns the model response.

Actual behavior

Both commands hang. Process inspection during the hang reveals:

UID        PID  PPID  CMD
mlaih   253345 253344  timeout 15 node .../openclaw.mjs infer model run ...
mlaih   253347 253345  openclaw
mlaih   253354 253347  99  openclaw-infer   <-- 100% CPU, 0 network connections

The grandchild openclaw-infer process:

  • Has only 2 file descriptors open (stdout socket + stderr socket)
  • Makes zero TCP connections (verified via /proc/<pid>/net/tcp)
  • Produces no output to either fd
  • Spins at 100% CPU until killed

Gateway log shows zero requests from the infer CLI — it never connects.

Commands tested and their results

CommandResult
openclaw --versionWorks
openclaw --helpWorks
openclaw gateway statusWorks
openclaw infer model listHangs (SIGKILL after ~20s)
openclaw infer model run --local --model ollama/qwen3.5:397b-cloud --prompt "..."Hangs (SIGKILL after ~20s)
openclaw infer model run --model minimax/MiniMax-M2.7 --prompt "..."Hangs (SIGKILL after ~20s)
openclaw agents listHangs (SIGKILL after ~10s)
curl -s http://winhost:11434/api/chat -d '{"model":"qwen3.5:397b-cloud",...}'Works (~6s response)
OpenClaw image tool (via gateway)Works
Gateway WebSocket probeWorks

Relevant config

{
  "agents": {
    "defaults": {
      "llm": { "idleTimeoutSeconds": 600 },
      "model": { "primary": "ollama/qwen3.5:397b-cloud" }
    }
  },
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://winhost:11434",
        "apiKey": "ollama-local",
        "api": "ollama"
      }
    }
  }
}

Related issues

  • #72851 ("[Bug]: Ollama provider hangs on infer model run --local across 2026.4.20 / 2026.4.24 / 2026.4.25") — reported as fixed in 2026.4.26 with commit adding "lean provider path" for local Ollama probes. This system is on 2026.4.27 and still exhibits the hang, suggesting either an incomplete fix or a regression introduced by subsequent model catalog/provider index refactors in 2026.4.27.
  • The 2026.4.27 CHANGELOG shows large-scale model catalog refactors (moving Fireworks, Together AI, Qianfan, Xiaomi, NVIDIA, Cerebras, Mistral, Chutes, Kilo, OpenAI, OpenCode Go to plugin manifest modelCatalog rows) which may have touched the code path for CLI model probes.
  • The CHANGELOG entry for 2026.4.27 says: "CLI/models: keep default-model and allowlist pickers on explicit models.providers.*.models entries when models.mode is replace instead of loading the full built-in catalog. Fixes #64950." — this may be related.

Notes

  • Direct curl to Ollama API works reliably — Ollama itself is healthy
  • The image tool works correctly (it routes through the gateway internal Ollama integration, not the broken openclaw-infer CLI)
  • This suggests the bug is in the openclaw-infer CLI binary entry path, not in the Ollama provider or gateway integration
  • The openclaw-infer binary is not a standalone file — it is spawned as a Node.js child process of the openclaw CLI wrapper

TODO

  • Confirm whether this reproduces on clean 2026.4.27 install
  • Check if downgrading to 2026.4.26 resolves the issue
  • Identify which 2026.4.27 change introduced the regression

extent analysis

TL;DR

The most likely fix is to downgrade OpenClaw to version 2026.4.26, as the issue seems to be a regression introduced in version 2026.4.27.

Guidance

  1. Verify the issue on a clean 2026.4.27 install: Confirm whether the problem reproduces on a fresh installation of OpenClaw 2026.4.27 to rule out any environmental or configuration issues.
  2. Downgrade to 2026.4.26: Attempt to resolve the issue by downgrading OpenClaw to version 2026.4.26, as the problem may have been introduced in the 2026.4.27 update.
  3. Investigate the 2026.4.27 changelog: Examine the changes made in the 2026.4.27 release, particularly the model catalog refactors, to identify the potential cause of the regression.
  4. Check for related issues: Review related issues, such as #72851, to see if they provide any insight into the problem or potential solutions.

Example

No code snippet is provided, as the issue seems to be related to a specific version of OpenClaw and its internal workings.

Notes

The issue appears to be specific to the openclaw-infer CLI binary entry path and not related to the Ollama provider or gateway integration. The fact that direct curl requests to the Ollama API work reliably and the image tool functions correctly suggests that the problem lies within the OpenClaw CLI.

Recommendation

Apply the workaround of downgrading to OpenClaw version 2026.4.26, as it is likely to resolve the issue until a permanent fix is available for version 2026.4.27.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

openclaw infer model list returns a JSON list of available models. openclaw infer model run --local --model ollama/qwen3.5:397b-cloud --prompt "..." returns the model response.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: openclaw infer hangs indefinitely on 2026.4.27 — openclaw-infer child spins at 100% CPU with zero network I/O [1 pull requests, 3 comments, 3 participants]