openclaw - ✅(Solved) Fix gateway/usage: costUsageCache has no cap/prune — stale entries accumulate across days [5 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68841Fetched 2026-04-19 15:06:52
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×5referenced ×5

Fix Action

Fixed

PR fix notes

PR #68842: fix(gateway): bound costUsageCache with MAX + FIFO eviction

Description (problem / solution / changelog)

Summary

  • Problem: costUsageCache in src/gateway/server-methods/usage.ts:65 has no delete/prune/evict path. The 30s TTL only gates stale reads; on a miss after expiry, set() overwrites the same key but never removes stale keys. parseDateRange derives cacheKey from getTodayStartMs, so cacheKey rolls at every UTC 00:00, and additional axes (days / startDate / endDate / utcOffset) multiply cardinality.
  • Why it matters: the macOS menu polls usage.cost every ~45s with no params (`MenuSessionsInjector.swift`), exercising `parseDateRange`'s default branch on every UTC day rollover. Over gateway uptime the Map grows monotonically.
  • What changed: adds `COST_USAGE_CACHE_MAX = 256` + a `setCostUsageCache` helper that evicts the oldest key when a new key would exceed the cap. Mirrors the pattern already used by `resolvedSessionKeyByRunId`, `TRANSCRIPT_SESSION_KEY_CACHE`, and `sessionTitleFieldsCache` in the same subsystem.
  • What did NOT change: TTL-on-read, in-flight dedup, and overwrite-on-same-key semantics are all preserved. No public API or config surface touched.

Change Type

  • Bug fix

Scope

  • gateway

Linked Issue

Closes #68841

Root Cause

The write path never considered cache size. Missing guardrail: a bound matching the sibling caches in the same file tree. Three other gateway caches (`resolvedSessionKeyByRunId`, `TRANSCRIPT_SESSION_KEY_CACHE`, `sessionTitleFieldsCache`) already implement MAX + FIFO eviction; `costUsageCache` alone was an outlier.

Regression Test Plan

Added `src/gateway/server-methods/usage.cost-usage-cache.test.ts`:

  • Drives growth through `__test.loadCostUsageSummaryCached` (same seam `usage.test.ts` already uses).
  • Only the external `loadCostUsageSummary` dependency is mocked (same pattern as the existing `usage.test.ts` top-level `vi.mock`).
  • 600 distinct (startMs, endMs) pairs that mirror day rollover + range switches.

Pre-fix: Map grows to 600. Post-fix: Map plateaus at the cap, the last-inserted key is retained, and the first-inserted key is evicted (FIFO).

Security Impact

None. No new permissions, secrets, network calls, or data scope change.

Repro + Verification

Environment: Node 22, macOS, `pnpm 10.33.0`.

Steps:

  1. `pnpm test src/gateway/server-methods/usage.cost-usage-cache.test.ts`

Expected (post-fix): 1 passed. Actual (pre-fix): the primary assertion `size < 600` failed (all 600 entries retained).

Evidence

  • Pre-fix (stashed): 1 failed.
  • Post-fix: 1 passed.
  • Full `src/gateway/server-methods/` suite: 36 files / 432 tests passed.
  • `pnpm check` + `pnpm build` clean.

Human Verification

  • Confirmed sibling pattern in `src/gateway/server-session-key.ts:22-34`, `src/gateway/session-transcript-key.ts:141-150`, and `src/gateway/session-utils.fs.ts:57-69`.
  • Verified the three `costUsageCache.set(...)` call sites (line 329 on-success, 342 in-flight clear, 347 initial insert) all route through the new helper.
  • `MAX = 256` matches the smaller sibling caps (RUN_LOOKUP_CACHE_LIMIT = 256, TRANSCRIPT_SESSION_KEY_CACHE_MAX = 256) rather than the largest (sessionTitleFieldsCache = 5000); 256 is sufficient headroom for the day × range × utcOffset axes the cache is actually indexed by.

Review Conversations

Greptile + Codex reviews will run on the PR; will respond to any flagged items.

Prior Art

  • PR #36682 (CLOSED, author self-closed): attempted a related LRU + MAX=64 eviction with broader scope. This PR differs: FIFO (not LRU) matches the three siblings in this file tree; MAX=256 matches those siblings; scope is strictly this one cache so the change stays XS.
  • PR #56318 (OPEN, bundled change covering multiple gateway areas): does not touch `costUsageCache`. This PR is intentionally scoped to the single outlier.

Compatibility / Migration

None. Internal behavioral fix; no public API or config surface touched.

Risks and Mitigations

  • Risk: cap of 256 is too aggressive and forces frequent re-fetches of cold entries. Mitigation: 256 matches two sibling caps in the same subsystem; usage queries are operator-facing and tolerate a cold re-fetch (each call is ~30s TTL anyway).
  • Risk: FIFO evicts a key that is still hot (e.g. a pinned date range operator keeps switching back to). Mitigation: the three sibling caches use the same FIFO semantics; LRU would diverge from the established pattern and requires extra bookkeeping. If hotness becomes a concern a follow-up PR can promote all four caches to LRU together.
  • Risk: eviction during an in-flight request. Mitigation: `setCostUsageCache` preserves the existing 3-step write flow (initial insert with inFlight, on-success set, in-flight clear); eviction only happens when adding a new key. An in-flight entry for the same key is overwritten, not evicted.

AI-assisted (fully tested). Generated via openclaw-audit pipeline (gatekeeper approve + post-harness cross-review 5/5 real-problem-real-fix + pre-pr cross-review 3/3 real-problem-real-fix).

Changed files

  • src/gateway/server-methods/usage.cost-usage-cache.test.ts (added, +90/-0)
  • src/gateway/server-methods/usage.ts (modified, +18/-3)

PR #68881: fix(gateway): cap and prune costUsageCache to prevent unbounded growth

Description (problem / solution / changelog)

Summary

costUsageCache in usage.ts stores CostUsageSummary results keyed by (startMs, endMs). The 30s TTL only causes stale reads to be ignored on the read path — it never deletes entries from the Map. Each day a new todayStartMs produces a new cache key, so entries accumulate indefinitely on long-running gateway processes.

Unlike the other module-level caches in the same codebase (sessionTitleFieldsCache with FIFO evict, resolvedSessionKeyByRunId with oldest-first evict), costUsageCache has no cap or prune mechanism.

Changes

  • Add MAX_COST_USAGE_CACHE_ENTRIES = 100 constant
  • Add pruneCostUsageCache(now) helper with two phases:
    1. TTL prune: Remove entries where updatedAt is past TTL and no inFlight promise is active
    2. FIFO evict: Delete oldest entries (Map insertion order) when size exceeds cap
  • Call pruneCostUsageCache after each successful summary load (the .then() callback)
  • Expose pruneCostUsageCache via __test for unit testing

Impact

  • Bounds the cache to at most 100 entries (generous headroom for any realistic usage pattern)
  • Follows the same eviction pattern as sessionTitleFieldsCache and resolvedSessionKeyByRunId in the same codebase
  • No behavioral change for the read path — TTL semantics are preserved
  • Single file change, zero risk to existing functionality

Fixes #68841

Changed files

  • src/gateway/server-methods/usage.ts (modified, +25/-0)

PR #68905: fix(gateway): add lazy eviction to costUsageCache

Description (problem / solution / changelog)

Summary

costUsageCache uses a TTL (COST_USAGE_CACHE_TTL_MS = 30_000) to skip stale entries on the read path, but never deletes them. Since cache keys are derived from (startMs, endMs), each new day produces new keys while old entries accumulate indefinitely, causing unbounded memory growth.

Fix

Add lazy eviction: when a stale entry is detected on the read path, delete it from the Map. This prevents unbounded growth while keeping the implementation simple and avoiding the need for a separate cleanup timer.

Testing

  • Stale entries are now evicted on read
  • Fresh entries continue to be served from cache
  • No behavioral change for valid cache hits

Closes #68841

Changed files


PR #68913: fix(gateway): add lazy eviction to costUsageCache

Description (problem / solution / changelog)

costUsageCache uses a TTL to skip stale entries on read, but never deletes them. Since cache keys are derived from (startMs, endMs), each new day produces new keys while old entries accumulate indefinitely. Add lazy eviction on read to prevent unbounded memory growth.

Closes #68841

Changed files

  • src/gateway/server-methods/usage.ts (modified, +4/-0)

PR #68974: fix(gateway): add lazy eviction to costUsageCache

Description (problem / solution / changelog)

Delete stale entries on read.

Closes #68841

Changed files

RAW_BUFFERClick to expand / collapse

요약

gateway/usage: costUsageCache 에 cap/prune 부재로 distinct (startMs, endMs) 마다 엔트리 영속 누적

공통 패턴

단일 FIND 기반 single CAND. src/gateway/server-methods/usage.ts:65costUsageCache: Map<string, CostUsageCacheEntry>(startMs, endMs) 를 key 로 CostUsageSummary 결과를 보관한다. TTL(COST_USAGE_CACHE_TTL_MS = 30_000) 은 read-path 에서 stale 값을 무시할 때만 사용되며, 프로덕션에는 엔트리를 제거하는 경로가 존재하지 않는다.

관련 FIND

  • FIND-gateway-memory-001: parseDateRangegetTodayStartMs(now, ...) 기반이라 매일 새 cacheKey 생성. 이전 key 의 엔트리는 무기한 남는다. 운영자 dashboard 노출 횟수에 비례하여 Map 성장.

근거 위치

  • 선언: src/gateway/server-methods/usage.ts:65
  • 누수 경로: src/gateway/server-methods/usage.ts:302-352
  • test-only clear: src/gateway/server-methods/usage.ts:365 (__test.costUsageCache.clear())
  • 대조 (same file, 올바른 eviction): L22-30 resolvedSessionKeyByRunId 의 oldest-first FIFO, L63-69 sessionTitleFieldsCache 의 while-loop FIFO evict

영향

  • impact_hypothesis: memory-growth (slow leak)
  • 운영 30일 ≈ 30 stale 엔트리. CostUsageSummary 는 세션 집계 결과 — 규모 의존적.
  • 즉각적 OOM 은 아니나 장기(수 개월) 가동 서버에서 heap drift.
  • P3 — 누적 속도 느림 + 엔트리 크기 세션 규모 의존.

대응 방향 (제안만)

동일 파일의 sessionTitleFieldsCache 패턴 (MAX + while-loop FIFO evict + optional TTL prune-on-set) 참조. 구체 구현은 SOL 단계.

반증 메모

  • config-reload.ts 경로에서 리셋되는지 미확인 (FIND self-check 에 명시).
  • loadCostUsageSummary 결과 크기 프로파일링 안 함 — 세션 규모 작을 시 P4 강등 가능.

관련 Finding 상세

1. costUsageCache 에 cap/prune 부재로 distinct (startMs, endMs) 마다 엔트리 영속 누적

  • 파일: src/gateway/server-methods/usage.ts:302-352
  • 증상 유형: memory-leak
  • 예상 영향: memory-growth — 정량 상한 (프로덕션 관측치 없음, 모델 기반):
  • distinct (startMs, endMs) 엔트리 수 ≈ (운영 일수) × (UI 에서 노출되는 date range option 수, 예: 1일/7일/30일) × (UTC offset 조합 수).
  • 단일 operator 1년 운영 시 대략 1,0003,000 엔트리. 엔트리 1개당 CostUsageSummary 는 세션 개수에 비례하는 집계 결과 — 프로덕션 세션 규모에서 10100KB 수준으로 추정.
  • 합계 ≈ 수십 MB. OOM 까지 진행하기보다 장기 heap 증가 (slow leak) 로 드러남.
  • GC-hostile: 엔트리 자체가 Map 에 잡혀 있어 major GC 에서도 회수 안 됨.
<details><summary>증거 / 메커니즘 / 근본 원인</summary>

costUsageCache 에 cap/prune 부재로 distinct (startMs, endMs) 마다 엔트리 영속 누적

문제

costUsageCache(startMs, endMs) 쌍을 키로 CostUsageSummary 결과를 보관하는 모듈-레벨 Map 이다. TTL(COST_USAGE_CACHE_TTL_MS = 30_000) 은 read-path 에서 stale 값을 무시할 때만 사용되고, 엔트리를 Map 에서 제거하는 경로는 프로덕션에 존재하지 않는다. operator 가 매일 usage dashboard 를 열면 todayStartMs 가 바뀌어 새 cacheKey 가 누적되고, 이전 키는 무한히 남는다.

발현 메커니즘

  1. operator / CLI 가 usage.cost 또는 sessions.usage RPC 호출.
  2. parseDateRangegetTodayStartMs(now, interpretation) 을 기준으로 (startMs, endMs) 계산. 기본은 last-30-days, optional 인자 days / startDate / endDate / utcOffset 조합.
  3. loadCostUsageSummaryCached(params)cacheKey = ${params.startMs}-${params.endMs}costUsageCache.get.
  4. miss 시 새 promise 를 만들어 costUsageCache.set(cacheKey, { inFlight }) (L346), resolve 후 costUsageCache.set(cacheKey, { summary, updatedAt }) (L328). 성공해도 엔트리는 계속 Map 에 남음.
  5. 다음 날 todayStartMs 가 하루 이동 → 이전 키는 더 이상 갱신/삭제되지 않음. Map 이 monotonically 커진다.

근본 원인 분석

loadCostUsageSummary 의 결과는 세션별 토큰/비용 집계를 포함하는 구조적으로 큰 객체다 (CostUsageSummary — src/agents/usage.ts). 이를 키별로 저장하는 캐시는 메모리 민감도가 높다. 같은 파일 내 다른 캐시들 (예: sessionTitleFieldsCache L63-69 의 while-loop FIFO evict; resolvedSessionKeyByRunId L22-30 의 oldest-first evict) 과 대조했을 때 usage cache 만 cap/evict 가 부재하다. 설계에서 TTL=30초 read-invalidation 만으로 "짧게 유지되는 캐시" 라고 오인한 것으로 보이지만, 실제로는 키 공간이 시간에 따라 팽창 하므로 stale 값이 읽히지 않을 뿐 Map 에는 남는다.

영향

  • 영향 유형: memory-growth (slow leak).
  • 관측: 프로세스 heap 이 시간에 따라 증가. 즉각적 OOM 아님.
  • 재현: usage.cost 를 매일 또는 매주 호출하며 days 값을 변형 → 30일 후 Map.size ≥ 30.
  • severity P3: 누적 속도 느리고 엔트리 크기는 세션 규모 의존. 몇 달 ~ 1년 이상의 장기 구동 서버에서 문제.

반증 탐색

카테고리 1 (이미 cleanup 있는지): R-3 Grep 으로 costUsageCache.(delete|clear|evict|splice|shift|pop) 탐색. 프로덕션 delete 경로 없음. __test.costUsageCache.clear() 는 테스트 전용. same-key overwrite 는 엔트리 수를 줄이지 못함.

카테고리 2 (외부 경계 장치): server-maintenance.ts 의 interval 들 — dedupe cleanup, tick, health — 은 이 캐시를 건드리지 않는다. 서버 종료 시 Map 은 해제되나 프로세스 재시작 전까지는 쌓임. graceful shutdown 에서 clear 경로 없음.

카테고리 3 (호출 맥락): usage.cost / sessions.usage 는 operator UI 의 usage dashboard 에서 호출. 일반 operator 는 최소 주 단위 접근. 자동 polling client 가 있으면 누적 속도 증가.

카테고리 4 (기존 테스트): usage.test.ts 27번 라인에서 __test.costUsageCache.clear() 로 reset. 누적 시나리오 테스트 없음.

카테고리 5 (주석/의도): 파일 내 "unbounded" 또는 "intentional" 주석 없음. 실수로 보임.

Primary-path inversion: "엔트리가 쌓이지 않는다" 가 참이려면 모든 요청이 동일 cacheKey 를 쓰거나 프로세스가 짧게 재시작되어야 한다. todayStartMs 매일 변화 + days/utcOffset 가변 → 성립 안 함.

Self-check

내가 확실한 근거

  • src/gateway/server-methods/usage.ts:65, 302-352 을 Read 로 확인. delete/prune 경로 부재.
  • R-3 Grep 으로 프로덕션 삭제 경로 match 0 확인.
  • 동일 파일 내 다른 캐시 (resolvedSessionKeyByRunId L22-30, sessionTitleFieldsCache L63-69) 의 cap/FIFO 구현과 비교 — 이 파일만 누락.

내가 한 가정

  • CostUsageSummary 크기가 세션당 수 KB 수준이라는 추정 (프로덕션 관측치 없음).
  • 프로세스 uptime 이 수 주 ~ 수 개월이라는 가정. launchd / systemd 환경에서는 타당.

확인 안 한 것 중 영향 가능성

  • loadCostUsageSummary 의 결과 객체 실제 크기 프로파일링 안 함. 세션 수가 적으면 P4 급으로 내려갈 가능성.
  • config reload 경로에서 이 Map 이 리셋되는지 — config-reload.ts 직접 trace 안 함. 만약 reload 시 clear 된다면 severity 추가 감소.
  • operator 가 실제로 얼마나 자주 usage.cost 를 호출하는지 metrics 부재.
</details>

<sub>이 이슈는 openclaw-audit 로컬 신뢰성 감사 파이프라인에서 생성됨. 재현 테스트와 수정은 별도 PR 에 포함됩니다.</sub>

<!-- openclaw-audit: cand=CAND-014 fingerprints=544464e55f3f at=2026-04-19T06:30:39+00:00 -->

extent analysis

TL;DR

Implement a cap and prune mechanism for the costUsageCache to prevent memory growth due to the accumulation of distinct (startMs, endMs) entries.

Guidance

  • Review the sessionTitleFieldsCache pattern in the same file for a possible implementation reference, which uses a MAX limit and a while-loop FIFO evict mechanism.
  • Consider adding a TTL prune-on-set mechanism to remove stale entries from the costUsageCache.
  • Investigate the actual size of CostUsageSummary objects to better understand the memory impact and adjust the cap accordingly.
  • Monitor the frequency of usage.cost calls by operators to assess the potential impact on memory growth.

Example

// Example of a simple cap and prune mechanism
const MAX_CACHE_SIZE = 1000; // Adjust based on memory constraints
const costUsageCache = new Map();

// ...

// When adding a new entry
if (costUsageCache.size >= MAX_CACHE_SIZE) {
  // Remove the oldest entry
  const oldestKey = Array.from(costUsageCache.keys()).sort((a, b) => a - b)[0];
  costUsageCache.delete(oldestKey);
}
costUsageCache.set(cacheKey, { summary, updatedAt });

Notes

  • The actual implementation should consider the specific requirements and constraints of the costUsageCache.
  • The example provided is a simplified illustration and may need to be adapted to the existing codebase.
  • Further investigation is needed to determine the optimal cap size and prune mechanism for the costUsageCache.

Recommendation

Apply a workaround by implementing a cap and prune mechanism for the costUsageCache, as the current implementation lacks a mechanism to remove stale entries, leading to memory growth.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING