openclaw - 💡(How to fix) Fix [Bug]: Cold-path auth resolution: ~4s on every cold dispatch (warm 2-4ms) [3 comments, 2 participants]

openclaw2026-05-05 19:50:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#78041•Fetched 2026-05-06 06:17:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

joking100182

Participants

clawsweeper[bot]

joking100182

Timeline (top)

commented ×3labeled ×2mentioned ×1subscribed ×1

Auth resolution shows a strong bimodal distribution: warm/cached path completes in 2-4 ms, cold path consistently lands in the 4138-4647 ms range, recurring every 1-15 minutes throughout the day.

Root Cause

Auth resolution shows a strong bimodal distribution: warm/cached path completes in 2-4 ms, cold path consistently lands in the 4138-4647 ms range, recurring every 1-15 minutes throughout the day.

Fix Action

Fix / Workaround

Configure an agent that uses the openai-codex provider via OAuth.
Send a dispatch, wait long enough for the in-memory access token to be considered stale, send another.
Observe the bimodal auth timing in the journal: warm dispatches at 2-4 ms, cold dispatches at ~4000+ ms.

Two distinct clusters:

Warm/cached path: auth = 2-4 ms (vast majority of dispatches)
Cold path: 17 distinct cold hits in the 4138-4647 ms range over 24h. Sample values: 4138, 4144, 4145, 4148, 4163, 4168, 4180, 4249, 4296, 4326, 4368, 4485, 4647 ms.

Other providers configured but not in active use for this measurement: anthropic (currently in cooldown for billing/429), openai. Workload during measurement: Discord-driven multi-agent dispatch (Eddie + several worker agents).

Code Example

## Code path traced (read-only, against installed dist files)

Auth phase entry (in `pi-embedded-CElEZtBc.js`):
- Line 852: `getApiKeyForModel({ store: params.authStore, ... })` called inside the auth stage
- Line 1714: `const authStore = pluginHarnessOwnsTransport ? createEmptyAuthProfileStore() : ensureAuthProfileStoreWithoutExternalProfiles(agentDir, { allowKeychainPrompt: false });`  -- per-call profile-store ensure
- Lines 886/902/921/934: `params.authStorage.setRuntimeApiKey(...)` writes runtime API key after resolution

Refresh adapter and per-call profile lookup (in attempt-dispatch logic):
- Line 401 hook: `refreshOAuth: async (cred) => await refreshOpenAICodexOAuthCredential(cred)`  -- registered as the refresh adapter for the provider
- Line 365: `if (listProfilesForProvider(ensureAuthProfileStoreForLocalUpdate(ctx.agentDir), PROVIDER_ID).length === 0) return null;`  -- profile store is re-read from disk on each call

(Logs are extracted from `journalctl -u openclaw.service` on the host; redacted/synthetic samples can be provided on request.)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Auth resolution shows a strong bimodal distribution: warm/cached path completes in 2-4 ms, cold path consistently lands in the 4138-4647 ms range, recurring every 1-15 minutes throughout the day.

Steps to reproduce

Configure an agent that uses the openai-codex provider via OAuth.
Send a dispatch, wait long enough for the in-memory access token to be considered stale, send another.
Observe the bimodal auth timing in the journal: warm dispatches at 2-4 ms, cold dispatches at ~4000+ ms.

Expected behavior

Cold-path auth resolution should land in approximately the same range as warm-path (2-4 ms), or at minimum within a single-digit-hundred ms range, since both paths resolve the same provider credential.

Actual behavior

Distribution of auth-stage timings over a 24h window, extracted from journald via:

journalctl -u openclaw.service --since 24h | grep -oE 'auth:[0-9]+ms@[0-9]+ms' | sort | uniq -c

Two distinct clusters:

Warm/cached path: auth = 2-4 ms (vast majority of dispatches)
Cold path: 17 distinct cold hits in the 4138-4647 ms range over 24h. Sample values: 4138, 4144, 4145, 4148, 4163, 4168, 4180, 4249, 4296, 4326, 4368, 4485, 4647 ms.

Cold hits recur every 1-15 minutes throughout the day, NOT once-per-process at startup. This rules out a one-time cold-start cost.

OpenClaw version

2026.5.3-1

Operating system

Ubuntu 24.04 LTS (LXC container, kernel 6.17.2-1-pve on Proxmox 9.1.1)

Install method

npm global (published package at /usr/lib/node_modules/openclaw)

Model

gpt-5.5 (via OpenAI Codex OAuth)

Provider / routing chain

openclaw -> openai-codex (OAuth)

Additional provider/model setup details

Logs, screenshots, and evidence

## Code path traced (read-only, against installed dist files)

Auth phase entry (in `pi-embedded-CElEZtBc.js`):
- Line 852: `getApiKeyForModel({ store: params.authStore, ... })` called inside the auth stage
- Line 1714: `const authStore = pluginHarnessOwnsTransport ? createEmptyAuthProfileStore() : ensureAuthProfileStoreWithoutExternalProfiles(agentDir, { allowKeychainPrompt: false });`  -- per-call profile-store ensure
- Lines 886/902/921/934: `params.authStorage.setRuntimeApiKey(...)` writes runtime API key after resolution

Refresh adapter and per-call profile lookup (in attempt-dispatch logic):
- Line 401 hook: `refreshOAuth: async (cred) => await refreshOpenAICodexOAuthCredential(cred)`  -- registered as the refresh adapter for the provider
- Line 365: `if (listProfilesForProvider(ensureAuthProfileStoreForLocalUpdate(ctx.agentDir), PROVIDER_ID).length === 0) return null;`  -- profile store is re-read from disk on each call

(Logs are extracted from `journalctl -u openclaw.service` on the host; redacted/synthetic samples can be provided on request.)

Impact and severity

Affected users/systems/channels: any agent using the openai-codex provider via OAuth on cold dispatches
Severity: annoying (adds ~4 s latency per cold dispatch but does not break functionality or cause errors)
Frequency: intermittent (17 cold hits in 24h, recurring every 1-15 minutes)
Consequence: noticeable user-facing latency on conversational dispatches that hit the cold path; warm path is unaffected

Additional information

Speculative -- hypotheses based on code-path tracing, not direct measurement

The sections below are inferred from the code paths above and have NOT been confirmed by structured timing logs around each sub-step. Filed here for triage context only; happy to add instrumentation in a PR if useful.

What this is NOT

Not the 10-min Anthropic SDK timeout (separate code path, not engaged for live Codex traffic)
Not the Anthropic billing 429 (already cooled down; surfaces as 429 and is handled by fallback)
Not attempt-dispatch.auth provider lookup itself (ruled out via cache validation in a prior session)

Suspected causes (ranked by likely contribution)

OAuth access-token refresh roundtrip: getApiKeyForModel for the openai-codex provider triggers refreshOpenAICodexOAuthCredential. On a small container this is plausibly a 1-3 s cost (TLS handshake + OAuth POST + JSON response). The long-lived Codex refresh token is reused, but the short-lived access token is re-fetched whenever the in-memory copy is stale.
Per-call profile-store reload: ensureAuthProfileStoreWithoutExternalProfiles(agentDir) and ensureAuthProfileStoreForLocalUpdate(agentDir) re-read the agent auth-profiles.json from disk on every invocation. Under load this likely adds tens to hundreds of ms but is unlikely to alone explain the ~4s.
Dynamic await import of agents/auth-profiles.runtime.js inside call sites: Node should cache the module after first import, but if multiple worker contexts each import on a cold path, each pays the cost once.

Suggested fix shape (open to maintainer judgment)

Two complementary changes that together should bring the cold path much closer to warm:

In-process OAuth access-token cache keyed by (provider, profileId), with TTL slightly less than the provider access-token expiry (so subsequent cold dispatches inside the TTL window skip the refresh roundtrip entirely). This addresses the dominant ~1-3 s OAuth cost.
Memoize ensureAuthProfileStoreWithoutExternalProfiles(agentDir) per-agentDir for the lifetime of a single dispatch attempt (or with a short TTL across attempts). Cheap, surgical; trims the per-call disk reads.

Cache invalidation contract that probably matters: invalidate (or refresh) the OAuth cache entry on any 401/403 from the provider, on any explicit auth-state mutation, and on agent-dir change.

What we have NOT done

We have NOT instrumented getApiKeyForModel or refreshOpenAICodexOAuthCredential with structured timing logs. The 4138-4647 ms attribution to OAuth refresh is inferred from code-path tracing, not direct measurement of each sub-step. Happy to add that instrumentation in a PR if useful.
We have NOT opened a PR; this issue is for triage and to confirm we are reading the code paths correctly.
We have NOT modified anything in /usr/lib/node_modules/openclaw aside from one unrelated in-place dist patch (Anthropic SDK timeout: 60000 inside mantle-anthropic.runtime), which is documented separately in our internal handoff and is not the subject of this issue.

Thanks for openclaw.

extent analysis

TL;DR

Implement an in-process OAuth access-token cache and memoize the ensureAuthProfileStoreWithoutExternalProfiles function to reduce cold path latency.

Guidance

Implement OAuth access-token cache: Create a cache that stores access tokens for each provider and profile ID, with a TTL slightly less than the provider's access token expiry.
Memoize ensureAuthProfileStoreWithoutExternalProfiles: Cache the result of this function per agent directory for a short period to reduce disk reads.
Invalidate cache on auth-state changes: Invalidate the OAuth cache entry on 401/403 errors, explicit auth-state mutations, or agent directory changes.
Verify cache effectiveness: Add structured timing logs to measure the impact of the cache on cold path latency.
Test and refine: Test the changes and refine the cache implementation as needed to ensure it effectively reduces cold path latency.

Example

// Example OAuth access-token cache implementation
const oauthCache = {};
const cacheTTL = 300; // 5 minutes

function getAccessToken(provider, profileId) {
  const cacheKey = `${provider}:${profileId}`;
  if (oauthCache[cacheKey] && oauthCache[cacheKey].expires > Date.now()) {
    return oauthCache[cacheKey].token;
  }
  // Fetch new access token and cache it
  const token = fetchAccessToken(provider, profileId);
  oauthCache[cacheKey] = { token, expires: Date.now() + cacheTTL * 1000 };
  return token;
}

Notes

The suggested fix shape is based on the provided code paths and may require refinement or additional changes to effectively address the issue. The cache implementation should be tested and verified to ensure it reduces cold path latency as expected.

Recommendation

Apply the suggested workaround by implementing an in-process OAuth access-token cache and memoizing the ensureAuthProfileStoreWithoutExternalProfiles

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: Cold-path auth resolution: ~4s on every cold dispatch (warm 2-4ms) [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Speculative -- hypotheses based on code-path tracing, not direct measurement

What this is NOT

Suspected causes (ranked by likely contribution)

Suggested fix shape (open to maintainer judgment)

What we have NOT done

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING