openclaw - ✅(Solved) Fix Bug: agent-job agentRunCache grows unbounded under sustained run fan-out [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77976Fetched 2026-05-06 06:18:29
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
commented ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #77973: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out

Description (problem / solution / changelog)

Closes #77976

Summary

src/gateway/server-methods/agent-job.ts keeps an in-process agentRunCache: Map<runId, AgentRunSnapshot> populated by every terminal lifecycle event. There is a 10-minute TTL pruned on every set via pruneAgentRunCache, but no max-size cap. Under sustained subagent fan-out (lots of agent.run lifecycle traffic), the snapshot map grew in lockstep with run rate up to "all runs in the last 10 minutes" before the next TTL sweep could bound it. Same shape as the Discord REST entity cache fixed in #77952.

This change adds:

  • AGENT_RUN_CACHE_MAX_ENTRIES (default 5,000) and an enforceAgentRunCacheMaxEntries pass that runs after each agentRunCache.set, dropping oldest entries by Map insertion order until size ≤ max.
  • __testing.getAgentRunCacheSize(), __testing.resetAgentRunCache(), and __testing.agentRunCacheMaxEntries so the cap is regression-testable.
  • A regression test in server-methods.test.ts that drives MAX + 25 unique end-lifecycle events through the real emitAgentEvent path and asserts size pins at the cap.

The sibling Maps in this file (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) are already lifecycle-bound via start/end/wait-register paths; only the terminal-snapshot cache lacked a hard cap.

Verification

  • pnpm test src/gateway/server-methods/server-methods.test.ts → 82/82 pass including new caps agentRunCache at AGENT_RUN_CACHE_MAX_ENTRIES via FIFO drop.
  • pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/agent-job.ts src/gateway/server-methods/server-methods.test.ts CHANGELOG.md clean.
  • Live tsx runtime proof included below in Real behavior proof.

Real behavior proof

  • Behavior addressed: agentRunCache had a TTL but no hard cap; under sustained run fan-out it could hold every run snapshot from the past 10 minutes simultaneously, scaling with run rate × 10 min and capped only by process restart or quiet periods.
  • Real environment tested: local Node v22 runtime against the real patched agent-job.ts module (no mocks, no test framework). Driven via pnpm exec tsx against the actual emitAgentEvent lifecycle path that production code uses.
  • Exact steps or command run after this patch: ran /tmp/agent-job-cache-proof.mts which imports the real patched module and emitAgentEvent, then drives MAX + 50 = 5,050 unique terminal lifecycle events through the real listener and prints __testing.getAgentRunCacheSize() at intervals.
  • Evidence after fix: live console output captured directly from the node runtime:
agentRunCacheMaxEntries: 5000
emitting 5050 unique end-lifecycle events...
  emitted     0  cacheSize=1
  emitted  1000  cacheSize=1001
  emitted  2000  cacheSize=2001
  emitted  3000  cacheSize=3001
  emitted  4000  cacheSize=4001
  emitted  5000  cacheSize=5000
final cacheSize after 5050 unique runs: 5000
expected ≤ 5000: PASS
  • Observed result after fix: cache grows linearly to 5,000, then enforceAgentRunCacheMaxEntries pins it at exactly 5,000 across the remaining 50 inserts. Oldest snapshots evicted by Map insertion order (FIFO over insertion-order; on every set the new key moves to the tail, so this is effectively close-to-LRU for repeat runs and FIFO for unique-runId floods).
  • What was not tested: behavior against a real long-running gateway with thousands of concurrent subagents. The fix is purely a Map cap on a deterministic insert path and the runtime demo exercises the same listener / record path that production runs through.

Notes for reviewer

  • The TTL prune (pruneAgentRunCache) and the new cap (enforceAgentRunCacheMaxEntries) compose: TTL prunes before insert, cap enforces after insert. Either alone is insufficient — TTL doesn't bound under sustained insert rate, cap doesn't free expired-but-still-inside-cap entries.
  • Cap value 5,000 was chosen to mirror the Discord cache cap precedent and gives ~10 minutes of headroom at ~8 runs/sec sustained terminal events, which is generously above any realistic single-gateway run rate.
  • This is part of a small audit sweep following #77952; same shape, different module.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/gateway/server-methods/agent-job.ts (modified, +24/-0)
  • src/gateway/server-methods/server-methods.test.ts (modified, +17/-1)
RAW_BUFFERClick to expand / collapse

Problem

src/gateway/server-methods/agent-job.ts keeps an in-process agentRunCache: Map<runId, AgentRunSnapshot> populated by every terminal lifecycle event. There is a 10-minute TTL pruned on every set via pruneAgentRunCache, but no max-size cap.

Under sustained subagent fan-out (lots of agent.run lifecycle traffic), the snapshot map grows in lockstep with run rate up to "all runs in the last 10 minutes" before TTL prune can bound it. Same shape as the Discord REST entity cache leak (#77975).

Sibling Maps in this file (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) are already lifecycle-bound; only the terminal-snapshot cache lacks a hard cap.

Tracking PR

Fix in #77973.

extent analysis

TL;DR

Implement a max-size cap for the agentRunCache to prevent unbounded growth.

Guidance

  • Review the existing pruneAgentRunCache function to understand the current TTL-based pruning mechanism and consider how to integrate a size-based cap.
  • Investigate the sibling maps (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) to determine if their lifecycle-bound approach can be applied or adapted for agentRunCache.
  • Consider implementing a Least Recently Used (LRU) eviction policy or a simple size-based removal strategy to maintain a reasonable cache size.
  • Evaluate the trade-offs between cache size, performance, and data retention requirements to determine an appropriate max-size cap.

Notes

The fix should be guided by the principles applied to the sibling maps and the existing TTL pruning mechanism, ensuring consistency and effectiveness in managing the cache size.

Recommendation

Apply workaround: Implement a max-size cap for the agentRunCache to prevent unbounded growth, as seen in the fix proposed in #77973, to address the potential cache leak issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING