openclaw - 💡(How to fix) Fix [Bug]: Provider-qualified default model resolution eagerly builds alias index and can block gateway event loop ~80s

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Provider-qualified default model resolution eagerly builds and normalizes the full configured-model alias index on the inbound reply hot path; with a 97-entry model catalog this blocked the gateway event loop for ~80-85s before the embedded agent startup logs appeared.

Root Cause

Consequence: messages can sit for over a minute before visible agent progress, and other gateway/channel work can time out because the Node event loop is saturated.

Fix Action

Fix / Workaround

[agent/embedded] [trace:embedded-run] startup stages: totalMs=18092 stages=workspace:1ms, runtime-plugins:1ms, hooks:0ms, model-resolution:1236ms, auth:2033ms, context-engine:1ms, attempt-workspace:14817ms, attempt-prompt:0ms, attempt-runtime-plan:3ms, attempt-dispatch:0ms

Code Example

# Sanitized gateway evidence from a real inbound WhatsApp group reply.
# Private channel IDs, hostnames, and token-bearing URLs are intentionally omitted.

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=85765.1
  eventLoopDelayMaxMs=85765.1
  eventLoopUtilization=0.999
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=115s last=embedded_run:started)]

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=45329ms
  timer delayed=35329ms

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=85268ms
  timer delayed=75268ms

[diagnostic] liveness warning:
  eventLoopDelayP99Ms=85228.3
  eventLoopUtilization=1
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=114s last=embedded_run:started)]

[agent/embedded] [trace:embedded-run] startup stages:
  totalMs=18092
  stages=workspace:1ms,
    runtime-plugins:1ms,
    hooks:0ms,
    model-resolution:1236ms,
    auth:2033ms,
    context-engine:1ms,
    attempt-workspace:14817ms,
    attempt-prompt:0ms,
    attempt-runtime-plan:3ms,
    attempt-dispatch:0ms

# Read-only local timings against the same config:
resolveDefaultModelForAgent({ cfg, agentId: "main" }): 50154.7ms
buildModelAliasIndex({ cfg, defaultProvider: "openai" }): 38774.5ms
buildConfiguredModelCatalog({ cfg }): 0.7ms
parse first 20 alias keys with plugin normalization: 29418.9ms
parse first 20 alias keys with plugin normalization disabled: 0.4ms
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

Provider-qualified default model resolution eagerly builds and normalizes the full configured-model alias index on the inbound reply hot path; with a 97-entry model catalog this blocked the gateway event loop for ~80-85s before the embedded agent startup logs appeared.

Steps to reproduce

  1. Run OpenClaw from source at 87cd6b3e923fcb8a4869dc35e5b582103be85e51 / package version 2026.5.24-beta.1 on Linux with the gateway daemon.
  2. Configure the default agent model as a provider-qualified primary model, for example:
    • agents.defaults.model.primary = "openai/gpt-5.5"
    • the routed agent inherits or uses the same provider-qualified default
  3. Configure a larger agents.defaults.models catalog. The observed case has 97 entries.
  4. Send a normal inbound WhatsApp group message that does not contain a model directive, heartbeat override, or explicit model selection.
  5. Observe that the gateway logs event-loop starvation before the embedded agent startup-stage trace appears.
  6. Profile the same config with a read-only harness around the model-selection helpers. In the observed config:
    • resolveDefaultModelForAgent({ cfg, agentId: "main" }): 50154.7ms
    • buildModelAliasIndex({ cfg, defaultProvider: "openai" }): 38774.5ms
    • buildConfiguredModelCatalog({ cfg }): 0.7ms
    • parsing the first 20 alias keys with plugin normalization enabled: 29418.9ms
    • the same 20-key parse with plugin normalization disabled: 0.4ms

Expected behavior

A normal inbound reply using an already provider-qualified default model should not synchronously build and normalize the full configured-model alias index before starting the agent run.

In this path, OpenClaw should resolve openai/gpt-5.5 cheaply, only build alias data if an alias is actually needed, and avoid blocking the gateway event loop long enough to delay unrelated timers and channel health checks.

Actual behavior

The inbound reply path calls resolveDefaultModel() before model directives are known to be needed. That helper resolves the default model and also eagerly builds the full alias index.

Observed source path at 87cd6b3e923fcb8a4869dc35e5b582103be85e51:

  • src/auto-reply/reply/get-reply.ts:252 calls resolveDefaultModel({ cfg, agentId }) for every inbound reply.
  • src/auto-reply/reply/directive-handling.defaults.ts:13-22 calls both resolveDefaultModelForAgent(...) and buildModelAliasIndex(...).
  • src/agents/model-selection-shared.ts:572-581 builds the alias index before checking whether the configured default model already contains a provider slash.
  • src/agents/model-selection-shared.ts:401-420 parses/normalizes each configured-model key before checking whether the entry actually has an alias.

This made the user-visible "pre-agent" delay much larger than the later embedded startup-stage trace suggested. The startup-stage trace accounted for about 18s, while the liveness/fetch-timeout logs showed the event loop had already been blocked for ~80-85s.

OpenClaw version

2026.5.24-beta.1 from source checkout commit 87cd6b3e923fcb8a4869dc35e5b582103be85e51.

Operating system

Ubuntu Linux, kernel 6.17.0-14-generic, x86_64.

Install method

Source checkout built into a local gateway daemon/runtime.

Model

openai/gpt-5.5

Provider / routing chain

OpenClaw gateway -> OpenAI provider, using a provider-qualified configured model ID. No model-router/proxy behavior is required to reproduce the model-resolution overhead.

Additional provider/model setup details

The default model was already configured as openai/gpt-5.5, and the routed agent used the same effective default. The config also had 97 entries under agents.defaults.models.

The slow path appears tied to configured-model normalization and alias-index construction, not to the model provider call itself. The delay happens before useful agent execution starts.

Logs, screenshots, and evidence

# Sanitized gateway evidence from a real inbound WhatsApp group reply.
# Private channel IDs, hostnames, and token-bearing URLs are intentionally omitted.

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=85765.1
  eventLoopDelayMaxMs=85765.1
  eventLoopUtilization=0.999
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=115s last=embedded_run:started)]

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=45329ms
  timer delayed=35329ms

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=85268ms
  timer delayed=75268ms

[diagnostic] liveness warning:
  eventLoopDelayP99Ms=85228.3
  eventLoopUtilization=1
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=114s last=embedded_run:started)]

[agent/embedded] [trace:embedded-run] startup stages:
  totalMs=18092
  stages=workspace:1ms,
    runtime-plugins:1ms,
    hooks:0ms,
    model-resolution:1236ms,
    auth:2033ms,
    context-engine:1ms,
    attempt-workspace:14817ms,
    attempt-prompt:0ms,
    attempt-runtime-plan:3ms,
    attempt-dispatch:0ms

# Read-only local timings against the same config:
resolveDefaultModelForAgent({ cfg, agentId: "main" }): 50154.7ms
buildModelAliasIndex({ cfg, defaultProvider: "openai" }): 38774.5ms
buildConfiguredModelCatalog({ cfg }): 0.7ms
parse first 20 alias keys with plugin normalization: 29418.9ms
parse first 20 alias keys with plugin normalization disabled: 0.4ms

Impact and severity

Affected: gateway inbound reply handling, observed through WhatsApp group messages routed to an embedded agent.

Severity: high availability/performance issue. The gateway event loop was blocked long enough to delay timers, produce channel fetch timeouts, and make the agent appear silent before it had meaningfully started.

Frequency: observed on 2/2 inbound messages in this configuration while investigating the incident. The cost scales with configured-model catalog size and plugin/model normalization work.

Consequence: messages can sit for over a minute before visible agent progress, and other gateway/channel work can time out because the Node event loop is saturated.

Additional information

This does not look like a WhatsApp-specific bug. WhatsApp made the symptom visible, but the hot path is shared model/default resolution before agent startup.

Related upstream context:

  • #86552 (perf(agents): reuse manifest metadata during model resolution) overlaps with repeated manifest metadata loading and should reduce part of the cost, but this report covers an additional eager-work issue: alias-index construction and per-entry parsing happen even when the default model is already provider-qualified.
  • #86372 (perf(gateway): propagate config context in model normalization to avoid stale policy warning) is related model-normalization context work, but it does not by itself address eager alias-index construction on inbound replies.
  • #79899 covers DefaultResourceLoader.reload() / attempt-workspace blocking, which matches the later ~15s attempt-workspace slice in the startup trace, but it does not explain the earlier ~80s pre-agent event-loop delay.
  • #86509 is a broader event-loop-starvation regression report; this issue is the narrower model-resolution/alias-index hot path with function-level timings.

Potential fix direction:

  1. Split default-model resolution from alias-index construction on the inbound reply path. Return a lazy alias-index getter or only build aliases once a directive/override path actually needs aliases.
  2. In resolveConfiguredModelRef, avoid building a full alias index before handling provider-qualified values like openai/gpt-5.5. If exact alias matching must remain ahead of provider-qualified parsing for compatibility, scan alias strings cheaply and only parse the matched entry.
  3. In buildModelAliasIndex, read entryRaw.alias before parsing the model key, and skip normalization entirely for entries without aliases.
  4. Reuse the manifest/plugin metadata context once per resolver call, consistent with the approach in #86552.
  5. Add a regression test with a large agents.defaults.models catalog where most entries do not define alias, asserting that provider-qualified default resolution does not call plugin/manifest normalization for every catalog entry.

Last known good version: NOT_ENOUGH_INFO

First known bad version: NOT_ENOUGH_INFO

AI assistance disclosure: this report was prepared with AI assistance and manually checked against local source, sanitized logs, and read-only timing evidence.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A normal inbound reply using an already provider-qualified default model should not synchronously build and normalize the full configured-model alias index before starting the agent run.

In this path, OpenClaw should resolve openai/gpt-5.5 cheaply, only build alias data if an alias is actually needed, and avoid blocking the gateway event loop long enough to delay unrelated timers and channel health checks.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING