openclaw - ✅(Solved) Fix Bug: skills refresh-state workspaceVersions map retains entries after watcher teardown [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77997Fetched 2026-05-06 06:18:04
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Timeline (top)
commented ×1cross-referenced ×1

Fix Action

Fix / Workaround

Drove the unpatched module against 5,000 distinct workspace dirs:

PR fix notes

PR #77998: fix(skills): clear workspaceVersions entry when skills watcher is disabled

Description (problem / solution / changelog)

Closes #77997

Summary

Two-commit fix that bounds the skills watcher subsystem to its working set:

Commit 1 (1eb1cbb8d1): clear workspaceVersions entry when a watcher is explicitly disabled.

src/agents/skills/refresh-state.ts:8 kept a module-level Map<workspaceDir, number> populated by every skills file-watch event but had no production removal path. The sibling watchers Map cleaned up on watchEnabled=false teardown, but workspaceVersions was forgotten. Adds clearSkillsSnapshotVersionForWorkspace and wires it into the existing teardown.

Commit 2 (376a9e0f03): evict idle chokidar watchers opportunistically on the next ensureSkillsWatcher call.

The bigger leak is watchers itself: ensureSkillsWatcher is called on every session update (auto-reply/reply/session-updates.ts:159) but never torn down on session end, so watchers accumulated one chokidar instance + one version entry per distinct workspaceDir over the gateway lifetime.

Adds lastEnsuredAt: number per watcher (refreshed on every ensureSkillsWatcher call) and an evictIdleSkillsWatchers sweep that runs at the end of every ensureSkillsWatcher invocation, disposing any watcher with lastEnsuredAt < now - 60min. Active sessions re-ensure on every turn so their watchers always stay fresh; only abandoned watchers (no session has touched them for >=1h) get torn down.

After both commits: watchers.size ≤ # workspaces with a session active in the last 1h, workspaceVersions ⊆ watchers as keysets.

Verification

  • pnpm test src/agents/skills/refresh.test.ts → 8/8 pass including 3 new cases (clears the per-workspace snapshot version when the watcher is disabled, evicts idle watchers (>1h since last ensure) on subsequent ensure calls, does not evict watchers refreshed within the idle TTL window).
  • pnpm exec oxfmt --check --threads=1 clean on touched files.
  • Live tsx runtime proof captured below.

Real behavior proof

  • Behavior addressed: workspaceVersions: Map<string, number> in src/agents/skills/refresh-state.ts:8 and watchers: Map<string, SkillsWatchState> in src/agents/skills/refresh.ts:36 accumulated one entry per distinct workspaceDir ever touched by a session. There was no public per-workspace clear API and no production teardown path beyond the rare watchEnabled=false config flip, so a long-lived gateway leaked one chokidar watcher (FDs + memory) and one version entry per distinct workspace forever.
  • Real environment tested: local Node v22 runtime against the patched refresh-state.ts module via pnpm exec tsx. No mocks, no test framework. Drove the public API (bumpSkillsSnapshotVersion, clearSkillsSnapshotVersionForWorkspace, getSkillsSnapshotVersion) with the same call shape production fires through ensureSkillsWatcher's debounced schedule callback.
  • Exact steps or command run after this patch: created /tmp/skills-version-cleanup-proof.mts that imports the real patched refresh-state.ts module, calls bumpSkillsSnapshotVersion for 5,000 distinct workspace dirs (matching the watcher-event path), reads getSkillsSnapshotVersion for each to confirm presence, then calls clearSkillsSnapshotVersionForWorkspace for each (the same call now wired into watcher teardown) and re-reads to confirm cleanup. Ran via pnpm exec tsx /tmp/skills-version-cleanup-proof.mts. Also drove the unpatched origin/main module the same way to demonstrate the leak shape.
  • Evidence after fix: live console output from the Node runtime.

Unpatched (origin/main) — leak demonstrated:

=== exported names ===
  bumpSkillsSnapshotVersion
  getSkillsSnapshotVersion
  registerSkillsChangeListener
  resetSkillsRefreshStateForTest
  setSkillsChangeListenerErrorHandler
  shouldRefreshSnapshotForVersion
per-workspace clear API present: false

=== after 5000 bumps + simulated teardown ===
workspaces still reporting non-zero version: 5000
workspaces back to zero (cleaned up):       0
fresh untouched workspace reports: 0

Patched — leak fixed:

=== before teardown: 5000 entries present ===
=== after teardown: 0 entries present ===
result: PASS
  • Observed result after fix: 5,000 distinct workspace entries created via the same call shape production fires on watcher events. After calling clearSkillsSnapshotVersionForWorkspace for each (the same call now wired into the watchEnabled=false teardown and the new idle-eviction sweep), all 5,000 entries are gone and getSkillsSnapshotVersion correctly reads 0. The new evictIdleSkillsWatchers invariant means the same cleanup happens automatically when watchers go idle for >=1h instead of requiring an explicit teardown event.
  • What was not tested: behavior against a real chokidar watcher on a real filesystem inside a long-lived gateway. The fix is a Map lifecycle wiring change on a deterministic add/remove path; the unit tests in refresh.test.ts cover the wiring against the same chokidar mock the rest of the file uses.

Correctness notes

  • getSkillsSnapshotVersion(workspaceDir) falls back to globalVersion when the entry is missing, and shouldRefreshSnapshotForVersion(cached, 0) returns cached > 0 — so removing an entry forces a refresh for any consumer holding a non-zero cached version. Conservative-correct.
  • The 1h idle TTL is calibrated for production usage: sessions re-call ensureSkillsWatcher on every turn, so practically only abandoned workspaces are evicted. If a session falls silent for >1h and a skills file changes during the silent window, the next turn re-attaches a fresh watcher and the global version bump path catches up.
  • The "watcher replaced because targets changed" path (line 168) is intentionally NOT routed through disposeSkillsWatcher — that path keeps the same workspaceDir watched, just with a different chokidar instance, so dropping the version entry would be an artificial regression.

Audit context

Part of the memory-leak audit sweep following #77952 (Discord cache), #77973 (agent-job cache), and #77987 (auth-rate-limit cap). Same theme: bind map lifetimes to the resources they describe.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/skills/refresh-state.ts (modified, +13/-0)
  • src/agents/skills/refresh.test.ts (modified, +81/-0)
  • src/agents/skills/refresh.ts (modified, +35/-6)

Code Example

=== exported names ===
  bumpSkillsSnapshotVersion
  getSkillsSnapshotVersion
  registerSkillsChangeListener
  resetSkillsRefreshStateForTest
  setSkillsChangeListenerErrorHandler
  shouldRefreshSnapshotForVersion
per-workspace clear API present: false

=== after 5000 bumps + simulated teardown ===
workspaces still reporting non-zero version: 5000
workspaces back to zero (cleaned up):       0
fresh untouched workspace reports: 0
RAW_BUFFERClick to expand / collapse

Problem

src/agents/skills/refresh.ts and src/agents/skills/refresh-state.ts accumulate two leaks under the existing call pattern:

  1. workspaceVersions: Map<string, number> in refresh-state.ts:8 — populated on every skills file-watch event via bumpSkillsSnapshotVersion({ workspaceDir }). No production removal path. Only resetSkillsRefreshStateForTest clears it (test-only).

  2. watchers: Map<string, SkillsWatchState> in refresh.ts:36 — populated on every ensureSkillsWatcher(workspaceDir) call. The only production teardown is the rare watchEnabled=false config path. The hot caller (auto-reply/reply/session-updates.ts:159) calls ensureSkillsWatcher on every session turn but never disposes on session end, so each distinct workspaceDir touched by any session adds a chokidar instance that lives for the gateway lifetime.

Why it leaks

Drove the unpatched module against 5,000 distinct workspace dirs:

=== exported names ===
  bumpSkillsSnapshotVersion
  getSkillsSnapshotVersion
  registerSkillsChangeListener
  resetSkillsRefreshStateForTest
  setSkillsChangeListenerErrorHandler
  shouldRefreshSnapshotForVersion
per-workspace clear API present: false

=== after 5000 bumps + simulated teardown ===
workspaces still reporting non-zero version: 5000
workspaces back to zero (cleaned up):       0
fresh untouched workspace reports: 0

Every entry persists; a fresh untouched workspace correctly reports 0, confirming the persistence is from real Map entries (not a global fallback).

Impact

  • Memory creep: ~150-600 bytes per workspaceVersions entry, plus full chokidar watcher (FDs + larger overhead) per watchers entry, scaling with distinct workspace dirs ever touched. Never freed until process restart.
  • FD pressure: each chokidar watcher holds inotify/fseventsd handles; on long-lived gateways managing many ephemeral sandboxes, this can hit FD limits before memory.
  • Lifecycle inconsistency: watchers cleans up on watchEnabled=false, workspaceVersions does not — both keyed by workspaceDir but different lifetimes.

Tracking PR

Fix in #77998. The PR adds (1) clearSkillsSnapshotVersionForWorkspace for safe per-workspace version cleanup, (2) disposeSkillsWatcher for unified teardown, and (3) opportunistic idle eviction (1h TTL) on every ensureSkillsWatcher call, so watchers.size ≤ # workspaces touched in the last 1h and workspaceVersions ⊆ watchers as keysets.

extent analysis

TL;DR

Apply the fixes from PR #77998 to address memory leaks caused by unclosed workspaceVersions and watchers maps.

Guidance

  • Review the refresh.ts and refresh-state.ts files to understand the current implementation of workspaceVersions and watchers maps.
  • Implement the clearSkillsSnapshotVersionForWorkspace function to safely clean up per-workspace version data.
  • Use the disposeSkillsWatcher function to unify teardown and prevent memory leaks.
  • Consider implementing opportunistic idle eviction with a 1-hour TTL to limit the size of the watchers map.

Example

No code snippet is provided as the issue does not contain sufficient information to generate a minimal example.

Notes

The provided PR #77998 seems to address the memory leak issues, but its implementation details are not available in the issue body. It is recommended to review the PR carefully before applying the fixes.

Recommendation

Apply the workaround by implementing the fixes from PR #77998, as it addresses the memory leaks and provides a unified teardown mechanism. This should help mitigate the memory creep and FD pressure issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING