Fix Action

Fixed

Fixed by PR: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out (https://github.com/openclaw/openclaw/pull/77973)

PR fix notes

PR #77973: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out

fede-kamel · 2026-05-05T17:39:03Z

[openclaw] PR 77973: fix gateway : cap agentRunCache to prevent unbounded growth under run fan-out - Repository: openclaw/openclaw - Author: fede-kamel - State… # PR #77973: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out - Repository: openclaw/openclaw - Author: fede-kamel - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/77973 ## Description (problem / solution / changelog) Closes #77976 ## Summary `src/gateway/server-methods/agent-job.ts` keeps an in-process `agentRunCache: Map ` populated by every terminal lifecycle event. There is a 10-minute TTL pruned on every `set` via `pruneAgentRunCache`, but **no max-size cap**. Under sustained subagent fan-out (lots of `agent.run` lifecycle traffic), the snapshot map grew in lockstep with run rate up to "all runs in the last 10 minutes" before the next TTL sweep could bound it. Same shape as the Discord REST entity cache fixed in #77952. This change adds: - `AGENT_RUN_CACHE_MAX_ENTRIES` (default 5,000) and an `enforceAgentRunCacheMaxEntries` pass that runs after each `agentRunCache.set`, dropping oldest entries by Map insertion order until size ≤ max. - `__testing.getAgentRunCacheSize()`, `__testing.resetAgentRunCache()`, and `__testing.agentRunCacheMaxEntries` so the cap is regression-testable. - A regression test in `server-methods.test.ts` that drives `MAX + 25` unique end-lifecycle events through the real `emitAgentEvent` path and asserts size pins at the cap. The sibling Maps in this file (`agentRunStarts`, `agentRunWaiterCounts`, `pendingAgentRunErrors`, `pendingAgentRunTimeouts`) are already lifecycle-bound via `start`/`end`/wait-register paths; only the terminal-snapshot cache lacked a hard cap. ## Verification - `pnpm test src/gateway/server-methods/server-methods.test.ts` → 82/82 pass including new `caps agentRunCache at AGENT_RUN_CACHE_MAX_ENTRIES via FIFO drop`. - `pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/agent-job.ts src/gateway/server-methods/server-methods.test.ts CHANGELOG.md` clean. - Live tsx runtime proof included below in **Real behavior proof**. ## Real behavior proof - **Behavior addressed**: `agentRunCache` had a TTL but no hard cap; under sustained run fan-out it could hold every run snapshot from the past 10 minutes simultaneously, scaling with run rate × 10 min and capped only by process restart or quiet periods. - **Real environment tested**: local Node v22 runtime against the real patched `agent-job.ts` module (no mocks, no test framework). Driven via `pnpm exec tsx` against the actual `emitAgentEvent` lifecycle path that production code uses. - **Exact steps or command run after this patch**: ran `/tmp/agent-job-cache-proof.mts` which imports the real patched module and `emitAgentEvent`, then drives `MAX + 50 = 5,050` unique terminal lifecycle events through the real listener and prints `__testing.getAgentRunCacheSize()` at intervals. - **Evidence after fix**: live console output captured directly from the node runtime: ``` agentRunCacheMaxEntries: 5000 emitting 5050 unique end-lifecycle events... emitted 0 cacheSize=1 emitted 1000 cacheSize=1001 emitted 2000 cacheSize=2001 emitted 3000 cacheSize=3001 emitted 4000 cacheSize=4001 emitted 5000 cacheSize=5000 final cacheSize after 5050 unique runs: 5000 expected ≤ 5000: PASS ``` - **Observed result after fix**: cache grows linearly to 5,000, then `enforceAgentRunCacheMaxEntries` pins it at exactly 5,000 across the remaining 50 inserts. Oldest snapshots evicted by Map insertion order (FIFO over insertion-order; on every set the new key moves to the tail, so this is effectively close-to-LRU for repeat runs and FIFO for unique-runId floods). - **What was not tested**: behavior against a real long-running gateway with thousands of concurrent subagents. The fix is purely a `Map` cap on a deterministic insert path and the runtime demo exercises the same listener / record path that production runs through. ## Notes for reviewer - The TTL prune (`pruneAgentRunCache`) and the new cap (`enforceAgentRunCacheMaxEntries`) compose: TTL prunes before insert, cap enforces after insert. Either alone is insufficient — TTL doesn't bound under sustained insert rate, cap doesn't free expired-but-still-inside-cap entries. - Cap value 5,000 was chosen to mirror the Discord cache cap precedent and gives ~10 minutes of headroom at ~8 runs/sec sustained terminal events, which is generously above any realistic single-gateway run rate. - This is part of a small audit sweep following #77952; same shape, different module. ## Changed files - `CHANGELOG.md` (modified, +1/-0) - `src/gateway/server-methods/agent-job.ts` (modified, +24/-0) - `src/gateway/server-methods/server-methods.test.ts` (modified, +17/-1) ## Fixed - Fixed by PR: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out (https://github.com/openclaw/openclaw/pull/77973) ## Problem `src/gateway/server-methods/agent-job.ts` keep

Repository: openclaw/openclaw
Author: fede-kamel
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/77973

Description (problem / solution / changelog)

Closes #77976

Summary

src/gateway/server-methods/agent-job.ts keeps an in-process agentRunCache: Map<runId, AgentRunSnapshot> populated by every terminal lifecycle event. There is a 10-minute TTL pruned on every set via pruneAgentRunCache, but no max-size cap. Under sustained subagent fan-out (lots of agent.run lifecycle traffic), the snapshot map grew in lockstep with run rate up to "all runs in the last 10 minutes" before the next TTL sweep could bound it. Same shape as the Discord REST entity cache fixed in #77952.

This change adds:

AGENT_RUN_CACHE_MAX_ENTRIES (default 5,000) and an enforceAgentRunCacheMaxEntries pass that runs after each agentRunCache.set, dropping oldest entries by Map insertion order until size ≤ max.
__testing.getAgentRunCacheSize(), __testing.resetAgentRunCache(), and __testing.agentRunCacheMaxEntries so the cap is regression-testable.
A regression test in server-methods.test.ts that drives MAX + 25 unique end-lifecycle events through the real emitAgentEvent path and asserts size pins at the cap.

The sibling Maps in this file (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) are already lifecycle-bound via start/end/wait-register paths; only the terminal-snapshot cache lacked a hard cap.

Verification

pnpm test src/gateway/server-methods/server-methods.test.ts → 82/82 pass including new caps agentRunCache at AGENT_RUN_CACHE_MAX_ENTRIES via FIFO drop.
pnpm exec oxfmt --check --threads=1 src/gateway/server-methods/agent-job.ts src/gateway/server-methods/server-methods.test.ts CHANGELOG.md clean.
Live tsx runtime proof included below in Real behavior proof.

Real behavior proof

Behavior addressed: agentRunCache had a TTL but no hard cap; under sustained run fan-out it could hold every run snapshot from the past 10 minutes simultaneously, scaling with run rate × 10 min and capped only by process restart or quiet periods.
Real environment tested: local Node v22 runtime against the real patched agent-job.ts module (no mocks, no test framework). Driven via pnpm exec tsx against the actual emitAgentEvent lifecycle path that production code uses.
Exact steps or command run after this patch: ran /tmp/agent-job-cache-proof.mts which imports the real patched module and emitAgentEvent, then drives MAX + 50 = 5,050 unique terminal lifecycle events through the real listener and prints __testing.getAgentRunCacheSize() at intervals.
Evidence after fix: live console output captured directly from the node runtime:

agentRunCacheMaxEntries: 5000
emitting 5050 unique end-lifecycle events...
  emitted     0  cacheSize=1
  emitted  1000  cacheSize=1001
  emitted  2000  cacheSize=2001
  emitted  3000  cacheSize=3001
  emitted  4000  cacheSize=4001
  emitted  5000  cacheSize=5000
final cacheSize after 5050 unique runs: 5000
expected ≤ 5000: PASS

Observed result after fix: cache grows linearly to 5,000, then enforceAgentRunCacheMaxEntries pins it at exactly 5,000 across the remaining 50 inserts. Oldest snapshots evicted by Map insertion order (FIFO over insertion-order; on every set the new key moves to the tail, so this is effectively close-to-LRU for repeat runs and FIFO for unique-runId floods).
What was not tested: behavior against a real long-running gateway with thousands of concurrent subagents. The fix is purely a Map cap on a deterministic insert path and the runtime demo exercises the same listener / record path that production runs through.

Notes for reviewer

The TTL prune (pruneAgentRunCache) and the new cap (enforceAgentRunCacheMaxEntries) compose: TTL prunes before insert, cap enforces after insert. Either alone is insufficient — TTL doesn't bound under sustained insert rate, cap doesn't free expired-but-still-inside-cap entries.
Cap value 5,000 was chosen to mirror the Discord cache cap precedent and gives ~10 minutes of headroom at ~8 runs/sec sustained terminal events, which is generously above any realistic single-gateway run rate.
This is part of a small audit sweep following #77952; same shape, different module.

Changed files

CHANGELOG.md (modified, +1/-0)
src/gateway/server-methods/agent-job.ts (modified, +24/-0)
src/gateway/server-methods/server-methods.test.ts (modified, +17/-1)

Problem

Under sustained subagent fan-out (lots of agent.run lifecycle traffic), the snapshot map grows in lockstep with run rate up to "all runs in the last 10 minutes" before TTL prune can bound it. Same shape as the Discord REST entity cache leak (#77975).

Sibling Maps in this file (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) are already lifecycle-bound; only the terminal-snapshot cache lacks a hard cap.

extent analysis

TL;DR

Implement a max-size cap for the agentRunCache to prevent unbounded growth.

Guidance

Review the existing pruneAgentRunCache function to understand the current TTL-based pruning mechanism and consider how to integrate a size-based cap.
Investigate the sibling maps (agentRunStarts, agentRunWaiterCounts, pendingAgentRunErrors, pendingAgentRunTimeouts) to determine if their lifecycle-bound approach can be applied or adapted for agentRunCache.
Consider implementing a Least Recently Used (LRU) eviction policy or a simple size-based removal strategy to maintain a reasonable cache size.
Evaluate the trade-offs between cache size, performance, and data retention requirements to determine an appropriate max-size cap.

Notes

The fix should be guided by the principles applied to the sibling maps and the existing TTL pruning mechanism, ensuring consistency and effectiveness in managing the cache size.

Recommendation

Apply workaround: Implement a max-size cap for the agentRunCache to prevent unbounded growth, as seen in the fix proposed in #77973, to address the potential cache leak issue.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - ✅(Solved) Fix Bug: agent-job agentRunCache grows unbounded under sustained run fan-out [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #77973: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out

Description (problem / solution / changelog)

Summary

Verification

Real behavior proof

Notes for reviewer

Changed files

Problem

Tracking PR

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - ✅(Solved) Fix Bug: agent-job agentRunCache grows unbounded under sustained run fan-out [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #77973: fix(gateway): cap agentRunCache to prevent unbounded growth under run fan-out

Description (problem / solution / changelog)

Summary

Verification

Real behavior proof

Notes for reviewer

Changed files

Problem

Tracking PR

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING