openclaw - ✅(Solved) Fix MODELS_JSON_STATE.readyCache permanently cold under traffic — `markAuthProfileUsed` invalidates fingerprint on every successful call [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80279Fetched 2026-05-11 03:16:56
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
2
Timeline (top)
commented ×2cross-referenced ×2closed ×1

MODELS_JSON_STATE.readyCache (the in-process cache that fronts ensureOpenClawModelsJson) is structurally guaranteed to miss on every message for any agent that's actively serving traffic. The cache fingerprint includes auth-profiles.json's mtime, and markAuthProfileUsed rewrites that file on every successful provider call (just to bump usageStats.lastUsed). Net result: per-message model-resolution pays the full uncached cost — measured at 6–13 s in 2026.5.7 with a kimi/kimi-code config on a fresh tenant container.

Error Message

Observed (agent/embedded warn):

  1. Decouple lastUsed writes from the credential-bearing portion of auth-profiles.json. Persist usageStats in a sibling file (e.g., auth-profiles-usage.json), or skip the file write for lastUsed-only updates and only flush on cooldown / error-state transitions. Either way, the credential-bearing part of auth-profiles.json keeps a stable mtime under steady-state traffic. Image-build patch on markAuthProfileUsed to elide the file write for lastUsed-only updates (return false from the updater unless cooldown / error-state changes). Tested anchor pattern is the same as managed runtime patches that wrap upstream dist/ files; happy to share specifics if useful.

Root Cause

buildModelsJsonFingerprint in dist/models-config-BCL7xtRj.js keys on file mtimes:

async function buildModelsJsonFingerprint(params) {
    const authProfilesMtimeMs = await readFileMtimeMs(path.join(params.agentDir, "auth-profiles.json"));
    const modelsFileMtimeMs   = await readFileMtimeMs(path.join(params.agentDir, "models.json"));
    // …
    return stableStringify({
        config: params.config,
        sourceConfigForSecrets: params.sourceConfigForSecrets,
        envShape,
        authProfilesMtimeMs,   // <-- this
        modelsFileMtimeMs,
        // …
    });
}

markAuthProfileUsed in dist/usage-CQen01xn.js rewrites auth-profiles.json on every successful provider call:

async function markAuthProfileUsed(params) {
    const { store, profileId, agentDir } = params;
    const updated = await authProfileUsageDeps.updateAuthProfileStoreWithLock({
        agentDir,
        updater: (freshStore) => {
            if (!freshStore.profiles[profileId]) return false;
            updateUsageStatsEntry(freshStore, profileId, (existing) =>
                resetUsageStats(existing, { lastUsed: Date.now() }));
            return true;     // <-- triggers saveAuthProfileStore (writes the file)
        }
    });
    // …
}

So the fingerprint is a proxy for "credentials in auth-profiles.json changed" — but markAuthProfileUsed writes the file for a reason that has nothing to do with credentials. The two contracts are individually fine; their interaction is the bug.

The per-message cycle (verified against gateway logs and live stat of auth-profiles.json on a paired tenant):

  1. Embedded run → pi-embedded-*.js calls resolveModelAsync({skipPiDiscovery:true}) with empty discovery stores → returns null for plugin-backed providers → falls back to ensureOpenClawModelsJson.
  2. ensureOpenClawModelsJson reads the current auth-profiles.json mtime → fingerprint differs from any cached entry → cache miss → full re-resolution (runs the plugin's prepareProviderDynamicModel hook, plans the file, writes models.json) — ~6–13 s.
  3. LLM call succeeds → markAuthProfileUsed rewrites auth-profiles.json after the response → next message hits a fresh mtime → goto 2.

Confirmed on a live [email protected] agent: auth-profiles.json mtime advanced past the latest embedded-run timestamp by several seconds, then stayed stable for 60+ s while idle, then advanced again on the next message.

Fix Action

Fix / Workaround

Workaround for downstreams

Image-build patch on markAuthProfileUsed to elide the file write for lastUsed-only updates (return false from the updater unless cooldown / error-state changes). Tested anchor pattern is the same as managed runtime patches that wrap upstream dist/ files; happy to share specifics if useful.

PR fix notes

PR #80375: perf: consolidate auth profile success writes

Description (problem / solution / changelog)

Summary

  • Add markAuthProfileSuccess to record last-good auth profile and successful usage stats in one locked auth-store update.
  • Use it after successful embedded model runs instead of separate markAuthProfileGood and markAuthProfileUsed writes.
  • Add coverage for canonical provider alias handling and successful usage-stat reset.

Compatibility note

  • This intentionally removes the old markAuthProfileGood and markAuthProfileUsed exports from the deprecated openclaw/plugin-sdk/agent-runtime barrel. Repo-local usage has moved to markAuthProfileSuccess, and no in-repo imports of the old helper names remain on this PR branch.

Verification

  • git diff --check
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/auth-profiles/profiles.ts src/agents/auth-profiles.ts src/agents/pi-embedded-runner/run.ts src/agents/auth-profiles/order.test.ts
  • pnpm test src/agents/auth-profiles/order.test.ts src/agents/auth-profiles/usage.test.ts
  • pnpm test test/scripts/check-changelog-attributions.test.ts src/infra/changelog-unreleased.test.ts
  • pnpm tsgo:core

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/auth-profiles.ts (modified, +1/-2)
  • src/agents/auth-profiles/order.test.ts (modified, +23/-8)
  • src/agents/auth-profiles/profiles.test.ts (modified, +96/-54)
  • src/agents/auth-profiles/profiles.ts (modified, +32/-2)
  • src/agents/auth-profiles/usage-state.ts (modified, +1/-1)
  • src/agents/auth-profiles/usage.test.ts (modified, +0/-55)
  • src/agents/auth-profiles/usage.ts (modified, +0/-36)
  • src/agents/pi-embedded-runner/run.overflow-compaction.harness.ts (modified, +1/-2)
  • src/agents/pi-embedded-runner/run.ts (modified, +2/-8)
  • src/plugin-sdk/agent-runtime.ts (modified, +1/-2)

PR #73260: perf(models-config): content-hash auth-profiles + models.json drift detection

Description (problem / solution / changelog)

Summary

Splits the cache-fingerprint half of #72869 into a standalone PR.

This PR replaces mtime-based cache key inputs with content-based hashes for the ensureOpenClawModelsJson implicit-provider-discovery cache, plus a second-factor models.json content hash that catches external edits / partial corruption / sibling-process writes between cache hits. Includes the security hardening findings raised by Aisle on the original combined PR.

Why content hashes

The previous mtime-based key invalidated on every OAuth token refresh because auth-profiles.json gets rewritten with new access/refresh timestamps even when the set of available providers does not change. Same for models.json mtime: the file is the OUTPUT of this function, so each call observed its own write and invalidated the next call.

Now:

  • auth-profiles.json: SHA-256 over a stable serialization that strips volatile OAuth session fields (access, refresh, expires*, issuedAt, refreshed/lastChecked/lastRefresh/lastValidatedAt). Token rotation no longer invalidates; structural changes (added/removed profiles, rotated static type:token credentials) still do.
  • models.json: NOT included in the input fingerprint (would cause self-invalidation). Instead its content hash is captured at write time and stored alongside the readyCache entry. Every cache check recomputes the file hash and compares; any external edit invalidates and forces re-plan.

Security hardening (Aisle review on PR #72869)

SeverityFindingFix
🟠 High #1CWE-59 symlink-following chmodensureModelsFileMode now lstats first; refuses to chmod symlinks or non-regular files
🟡 Med #3CWE-1321 prototype pollutionObject.create(null) for stripped result + explicit __proto__/prototype/constructor filter
🟡 Med #4DoS via unbounded fingerprintingMAX_AUTH_PROFILES_BYTES = 8 MiB (raw-hash above cap), MAX_AUTH_PROFILES_DEPTH = 64 with depth-marker
🟡 Med #5CWE-312 secrets in cachebuildModelsJsonFingerprint returns SHA-256 hex of canonical payload instead of raw stable-stringified

Token-rotation correctness

token is intentionally NOT in AUTH_PROFILE_VOLATILE_FIELDS even though OAuth-style token fields rotate. Profiles with type: "token" use the literal token key as a long-lived static credential, and stripping it would mask real auth-state changes when a user rotates that credential. Documented inline.

Tests

Two tests in models-config.fingerprint-cache.test.ts:

  • does not invalidate the cache when OAuth session fields rotate (rotates access/refresh/expires; cache stays valid)
  • DOES invalidate the cache when a static type:token credential rotates (rotates the literal token field; cache invalidates)

Existing fingerprint-cache test suite still passes.

Compatibility

MODELS_JSON_STATE.readyCache value shape extended with modelsJsonHash: { fingerprint, modelsJsonHash, result }. All three plan return paths (skip/noop/write) capture the post-write hash. The refreshedFingerprint re-key path forwards modelsJsonHash through unchanged.

Related

  • Splits out of #72869 (cache-fingerprint half)
  • The targetProvider short-circuit half is in #73261

Real behavior proof

  • Behavior or issue addressed: Models-config cache fingerprint fail-closed behavior for unhashable/oversize models.json and auth-profiles.json from PR #73260.

  • Real environment tested: Local OpenClaw topic branch perf/models-config-cache-fingerprint at 05fda3d839, isolated temporary agent directory, production ensureOpenClawModelsJson invoked through a tsx runtime driver with no Vitest mocks.

  • Exact steps or command run after this patch: Ran the proof driver after the patch to warm cache on a small auth profile, grow auth-profiles.json past the 8 MiB cap, repeat calls with byte-identical and swapped oversize contents, restore a small file, then exercise oversize and symlinked models.json.

  • Evidence after fix: Full copied runtime trace is in the PR comment: https://github.com/openclaw/openclaw/pull/73260#issuecomment-4384857383

    Excerpt of copied live output:

    ensureOpenClawModelsJson warm cache hit: ~1 ms
    auth-profiles.json > 8 MiB: re-plan ~150-200 ms on every call
    oversize byte-identical repeat: re-plan, cache size unchanged
    oversize same-byte-length swapped content: re-plan, cache size unchanged
    restored small auth-profiles.json: cache hit restored (~1 ms)
    oversize models.json: full plan + rewrite
    symlinked models.json: full plan + rewrite
  • Observed result after fix: Uncacheable models content never matches cached state, oversize auth profiles bypass the ready cache instead of collapsing to a stale fingerprint, and restoring cacheable content re-enables fast hits.

  • What was not tested: Nothing else for this PR's changed behavior beyond the isolated local runtime proof and supplemental automated validation.

Changed files

  • CHANGELOG.md (modified, +5/-0)
  • src/agents/models-config-state.ts (modified, +49/-2)
  • src/agents/models-config.fingerprint-cache.test.ts (added, +469/-0)
  • src/agents/models-config.ts (modified, +546/-60)

Code Example

totalMs=6321   stages=… model-resolution:6304ms@6312ms,auth:3ms@…
totalMs=13206  stages=… model-resolution:12873ms@12885ms,auth:2ms@…
totalMs=12486  stages=… model-resolution:12473ms@12480ms,auth:1ms@…

---

async function buildModelsJsonFingerprint(params) {
    const authProfilesMtimeMs = await readFileMtimeMs(path.join(params.agentDir, "auth-profiles.json"));
    const modelsFileMtimeMs   = await readFileMtimeMs(path.join(params.agentDir, "models.json"));
    // …
    return stableStringify({
        config: params.config,
        sourceConfigForSecrets: params.sourceConfigForSecrets,
        envShape,
        authProfilesMtimeMs,   // <-- this
        modelsFileMtimeMs,
        // …
    });
}

---

async function markAuthProfileUsed(params) {
    const { store, profileId, agentDir } = params;
    const updated = await authProfileUsageDeps.updateAuthProfileStoreWithLock({
        agentDir,
        updater: (freshStore) => {
            if (!freshStore.profiles[profileId]) return false;
            updateUsageStatsEntry(freshStore, profileId, (existing) =>
                resetUsageStats(existing, { lastUsed: Date.now() }));
            return true;     // <-- triggers saveAuthProfileStore (writes the file)
        }
    });
    // …
}
RAW_BUFFERClick to expand / collapse

Summary

MODELS_JSON_STATE.readyCache (the in-process cache that fronts ensureOpenClawModelsJson) is structurally guaranteed to miss on every message for any agent that's actively serving traffic. The cache fingerprint includes auth-profiles.json's mtime, and markAuthProfileUsed rewrites that file on every successful provider call (just to bump usageStats.lastUsed). Net result: per-message model-resolution pays the full uncached cost — measured at 6–13 s in 2026.5.7 with a kimi/kimi-code config on a fresh tenant container.

Version

[email protected] (npm, latest at time of report).

Reproduction

  1. Configure an agent with a plugin-backed provider/model (verified with kimi/kimi-code via api-key auth profile; expected to reproduce for any non-static provider).
  2. Pair the agent and let prewarmConfiguredPrimaryModel complete (logs sidecars.model-prewarm:<n>ms — completes in ~700 ms with a single configured model).
  3. Send three back-to-back messages and look at the embedded-run startup-stages traces.

Observed (agent/embedded warn):

totalMs=6321   stages=… model-resolution:6304ms@6312ms,auth:3ms@…
totalMs=13206  stages=… model-resolution:12873ms@12885ms,auth:2ms@…
totalMs=12486  stages=… model-resolution:12473ms@12480ms,auth:1ms@…

model-resolution accounts for >97 % of startup-stages time on every message; auth, runtime-plugins, hooks, and context-engine are all single-digit ms. Subsequent messages are not faster than the first — the cache is missing every time, not gradually warming.

Root cause

buildModelsJsonFingerprint in dist/models-config-BCL7xtRj.js keys on file mtimes:

async function buildModelsJsonFingerprint(params) {
    const authProfilesMtimeMs = await readFileMtimeMs(path.join(params.agentDir, "auth-profiles.json"));
    const modelsFileMtimeMs   = await readFileMtimeMs(path.join(params.agentDir, "models.json"));
    // …
    return stableStringify({
        config: params.config,
        sourceConfigForSecrets: params.sourceConfigForSecrets,
        envShape,
        authProfilesMtimeMs,   // <-- this
        modelsFileMtimeMs,
        // …
    });
}

markAuthProfileUsed in dist/usage-CQen01xn.js rewrites auth-profiles.json on every successful provider call:

async function markAuthProfileUsed(params) {
    const { store, profileId, agentDir } = params;
    const updated = await authProfileUsageDeps.updateAuthProfileStoreWithLock({
        agentDir,
        updater: (freshStore) => {
            if (!freshStore.profiles[profileId]) return false;
            updateUsageStatsEntry(freshStore, profileId, (existing) =>
                resetUsageStats(existing, { lastUsed: Date.now() }));
            return true;     // <-- triggers saveAuthProfileStore (writes the file)
        }
    });
    // …
}

So the fingerprint is a proxy for "credentials in auth-profiles.json changed" — but markAuthProfileUsed writes the file for a reason that has nothing to do with credentials. The two contracts are individually fine; their interaction is the bug.

The per-message cycle (verified against gateway logs and live stat of auth-profiles.json on a paired tenant):

  1. Embedded run → pi-embedded-*.js calls resolveModelAsync({skipPiDiscovery:true}) with empty discovery stores → returns null for plugin-backed providers → falls back to ensureOpenClawModelsJson.
  2. ensureOpenClawModelsJson reads the current auth-profiles.json mtime → fingerprint differs from any cached entry → cache miss → full re-resolution (runs the plugin's prepareProviderDynamicModel hook, plans the file, writes models.json) — ~6–13 s.
  3. LLM call succeeds → markAuthProfileUsed rewrites auth-profiles.json after the response → next message hits a fresh mtime → goto 2.

Confirmed on a live [email protected] agent: auth-profiles.json mtime advanced past the latest embedded-run timestamp by several seconds, then stayed stable for 60+ s while idle, then advanced again on the next message.

Expected behavior

Once prewarmConfiguredPrimaryModel populates the cache, subsequent in-process model resolutions for the same configured model should reuse the cached result. Per-message model-resolution should be sub-100 ms with a hit.

Suggested fixes

Two paths, in order of decreasing surgical-ness:

  1. Decouple lastUsed writes from the credential-bearing portion of auth-profiles.json. Persist usageStats in a sibling file (e.g., auth-profiles-usage.json), or skip the file write for lastUsed-only updates and only flush on cooldown / error-state transitions. Either way, the credential-bearing part of auth-profiles.json keeps a stable mtime under steady-state traffic.

  2. Drop authProfilesMtimeMs (and modelsFileMtimeMs) from the fingerprint in favor of a content hash of just the credential-bearing fields (and for models.json, the resolved-model identity). The fingerprint already includes the full params.config; what an mtime check adds is an out-of-band invalidation signal for credential rotation done outside the planner. Replacing the mtime with a content hash of the fields that actually affect model resolution gives the same correctness without false invalidations on usage-stats writes.

(1) is lower-risk and doesn't change the cache invariant. (2) is a deeper fix and would also resolve any other case where the file mtime ticks without semantically meaningful changes.

Workaround for downstreams

Image-build patch on markAuthProfileUsed to elide the file write for lastUsed-only updates (return false from the updater unless cooldown / error-state changes). Tested anchor pattern is the same as managed runtime patches that wrap upstream dist/ files; happy to share specifics if useful.

Why this is hard to spot

  • Local single-user development rarely sends 3+ back-to-back messages and instruments the embedded-run startup-stages tracer simultaneously, so the per-message overhead reads as "model is slow" and gets attributed to the provider rather than the cache.
  • For static providers (anthropic/openai/etc. configured directly in code) the fast path through resolveModelAsync({skipPiDiscovery:true}) succeeds without consulting MODELS_JSON_STATE.readyCache, so the bug is invisible. It only bites plugin-backed providers (e.g., kimi-coding).
  • The cache exists and is wired up correctly; the symptom doesn't look like a cache bug because prewarm succeeds in <1 s.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Once prewarmConfiguredPrimaryModel populates the cache, subsequent in-process model resolutions for the same configured model should reuse the cached result. Per-message model-resolution should be sub-100 ms with a hit.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix MODELS_JSON_STATE.readyCache permanently cold under traffic — `markAuthProfileUsed` invalidates fingerprint on every successful call [2 pull requests, 2 comments, 3 participants]