openclaw - 💡(How to fix) Fix [Bug]: Gateway accumulates >12K read-only file descriptors on workspace memory/**; correlated with memory_search tool activity in 1 of 5 observed captures

openclaw2026-05-25 19:15:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Across 5 observed captures, the gateway accumulated read-only REG-type FDs on .md files under <workspace>/memory/, reaching up to 12,591 memory-tree REGs / 12,761 total FDs, with no FD release observed in 30s sampling windows; FD count stayed flat until process restart, and in 1 of the 5 captures a memory_search tool call was observed in the gateway log ~76 seconds before the FD count crossed the storm threshold.

Error Message

| chokidar v5's own fs.open invocation | One, in chokidar/handler.js:182, but guarded by if (isWindows && error.code === 'EPERM') — a Windows EPERM workaround, not active on macOS. |

Root Cause

External corroboration: #77750 (filed 2026-05-05) reports the same symptom — 14,039 open FDs of which 13,982 are .md files, EBADF on child_process.spawn — on 2026.5.3 upstream npm, which is a confounder-free reproduction from a different reporter. That issue's confidence in an upstream root cause weights the local-patches confounder here close to "not the cause".

Fix Action

Fix / Workaround

Audit target	Finding
`chokidar.watch` call sites in memory-core	Two: `manager-sync-ops.ts:460` (memory tree, gated at `:431` by `!this.sources.has("memory") \|\| !this.settings.sync.watch \|\| this.watcher`) and `qmd-manager.ts:1566` (QMD collections, gated at `:1550` by `!this.syncSettings?.watch \|\| this.watcher \|\| this.closed`). Default for `sync.watch` is `false` (`manager-sync-control.ts:184` shows `sync: { watch: false, onSessionStart: false, onSearch: false }`). I did not find a third `chokidar.watch` call in memory-core.
`fs.open()` / `fs.promises.open()` call sites in memory-core	Two: `manager-sync-ops.ts:699` (`countNewlines`) and `qmd-manager.ts:2173` (`readSelectedLines`). Both use `try { … } finally { await handle.close(); }`. Neither iterates over `<workspace>/memory/**`.
`fs.readFile` in memory-core load path	`manager-embedding-ops.ts:684` and `:714` use `await fs.readFile(entry.absPath, "utf-8")`, which auto-closes after read. Used in `indexFile`.
chokidar v5's own `fs.open` invocation	One, in `chokidar/handler.js:182`, but guarded by `if (isWindows && error.code === 'EPERM')` — a Windows EPERM workaround, not active on macOS.
Native readers / vec-store bindings	The memory-core bundle ships native bindings (sqlite-vec, embeddings). I did not audit native code; if a native binding opens files via `uv_fs_open` outside the JS-tracked path, that would match the lsof-vs-heap discrepancy (12,685 REG FDs vs 11 `FileHandle` objects).

#77750 (created 2026-05-05, OPEN) — spawn EBADF when gateway file descriptor count is high. Same FD-exhaustion symptom on 2026.5.3 upstream npm (14,039 FDs, 13,982 .md files). That issue scopes itself to a child-spawn EBADF fallback in src/process/supervisor/adapters/child.ts — i.e. a downstream mitigation — and acknowledges the root-cause leak is a separate problem (clawsweeper comment: "the related FD-leak fixes reduce one pressure source, but they do not implement the child-spawn fallback requested here"). This issue is intended to give that root cause concrete forensic shape.
#84820 (created 2026-05-21, OPEN, P1, impact:crash-loop) — Unclosed FileHandle on session JSONL lock crashes gateway on Node ≥24 under sustained session-store load. Same retention shape (unclosed FileHandle in long-running gateway), different code path (session .jsonl.lock vs memory tree .md). Both point at the same class of resource-cleanup gap.
#67461 (created 2026-04-16, OPEN, P1, impact:crash-loop) — Gateway leaks undici sockets on every streamed Anthropic API call (buildManagedResponse missing finalize on GC). Different resource (sockets, not file FDs), same long-running-gateway / resource-leak / crash-loop family.
#77997 (created 2026-05-05, OPEN) — skills refresh-state workspaceVersions map retains entries after watcher teardown. Same retention-after-watcher-teardown shape, in skills runtime instead of memory-core.
#71335 (created 2026-04-25, OPEN, P2, impact:crash-loop) — sync.watch should default to false in gateway mode. The default is already false in the version observed here (manager-sync-control.ts:184), but that issue's impact:crash-loop label corroborates that sync.watch's watcher fan-out has previously been associated with crash-loop pressure.
#40088 (created 2026-03-08, OPEN, P1) — memory_search: chokidar file watcher silently stops delivering events, index goes stale. Same memory_search + chokidar surface area, but opposite symptom (watcher silently stops vs my report's watcher overretention). Listed for completeness; not a duplicate.

Code Example

GW_PID=$(pgrep -f "openclaw.*gateway" | head -1)
   lsof -p $GW_PID | awk '$5=="REG" {print $NF}' | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -10

---

11968 REG       ← workspace memory/**.md files (the leak)
   16 unix
   13 IPv4
    4 PIPE
    4 DIR
    3 KQUEUE
    2 CHR
    1 systm
    1 NPOLICY
    1 IPv6

---

8694 <workspace>/memory/transcripts
1695 <workspace>/memory/transcripts.archived
 267 <workspace>/memory/structured-md/lessons
 214 <workspace>/memory/structured-md/decisions
 213 <workspace>/memory/structured-md/lessons.archived
 209 <workspace>/memory/structured-md/procedures
 151 <workspace>/memory/structured-md/decisions.archived
 137 <workspace>/memory
 126 <workspace>/memory/structured-md/procedures.archived
  81 <workspace>/memory/structured-md/projects
  63 <workspace>/memory/structured-md/projects.archived

---

node    <pid>   <user>  86r  REG  1,7   2262   <inode>   <workspace>/memory/transcripts/<uuid>.md

---

12,760  FSWatcher       (count close to memory-tree FD count)
12,755  FSEvent         (held as V8 Global handles)
12,765  listener        (closure)
12,756  handleEvent     (closure)
12,754  Stats           (object)
12,573  validate        (closure)
   415  WatchHelper     (object)
    11  FileHandle      (object)
     2  FSReqCallback

---

[object] WatchHelper  --(property:fsw)-->
  [object] FSWatcher  --(property:_closers)-->
    [object] Map  --(internal:table)-->
      [array]  --(internal:[N])-->
        [array] (object elements)  --(internal:[0])-->
          [closure]  --(internal:context)-->
            [object] system/Context  --(context:listener)-->
              [closure]  --(internal:context)-->
                [object] system/Context  --(context:wh)-->
                  [object] WatchHelper           ← cycle back

---

[WatchHelper] -> (fsw) -> [FSWatcher] -> (_closers) -> [Map] ->
  [Array elements] -> [closure listener] ->
    [system/Context: prevStats] -> [Stats]

---

[synthetic] (GC roots)  --(element:[10])-->
  [synthetic] (Global handles)  --(internal:[N])-->
    [object] FSEvent

RAW_BUFFERClick to expand / collapse

Update 2026-05-25 22:30 UTC — deterministic reproducer now available: a single authorized POST /tools/invoke memory_search call against a multi-thousand-.md workspace opens one FD per .md file in the memory tree, never released. Reproduced both on the deployed gateway that produced these captures (0 → 7,613 mem-tree REGs in ~150s, then 12,591 via natural traffic) and on clean upstream-main (HEAD 01c5ab8d13) against a synthetic 12,391-file workspace (0 → 12,392 REGs, flat after). See comment 4537621029 for the full setup, timing table, and second-call behaviour. The "Steps to reproduce" section below predates this finding and can be replaced by the curl-based recipe in that comment.

Bug type

Regression (worked before, now fails)

Beta release blocker

Summary

Steps to reproduce

No deterministic single-command repro is available; the leak is workload- and uptime-correlated. The closest-grounded reproduction sequence, derived from the 2026-05-25 storm B observation (13 min uptime to reproduce):

Run the gateway as a long-lived launchd daemon (KeepAlive), against a workspace whose memory/ directory contains hundreds-to-thousands of .md files across multiple subdirectories (including transcripts/, structured-md/, and *.archived/ siblings).
Run multiple agents performing coordinated cron / Matrix-channel work, with at least one agent configured to invoke the memory_search tool against the workspace.

After enough cumulative memory_search activity to traverse the tree (observed at 13 min uptime in storm B, 8h–17h uptime in earlier captures), inspect FD state:

GW_PID=$(pgrep -f "openclaw.*gateway" | head -1)
lsof -p $GW_PID | awk '$5=="REG" {print $NF}' | sed 's|/[^/]*$||' | sort | uniq -c | sort -rn | head -10

Observe many REG-type FDs under <workspace>/memory/** subdirectories, all in r mode, approximately one per .md file in the tree.

Expected behavior

After read operations complete, FD count for the gateway process should not grow over time. On an earlier dist of the same gateway (a 2026-05-13-era build) the gateway was observed with ~250 total FDs after 1d 4h uptime under similar workload, with zero REG FDs into <workspace>/memory/**.

Actual behavior

In each of 5 captures, the gateway holds open REG-type FDs in r mode pointing to .md files under <workspace>/memory/. FD counts at moment of capture (total FDs / memory-tree REG FDs):

11,830 / 10,103 (workspace-A, deep tree, 16h35m uptime, 2026-05-21)
12,013 / 11,877 (workspace-A, deep tree, 12h44m uptime, 2026-05-22)
~200 / 46 (workspace-B, flat tree with dated .md files, ~12h uptime, 2026-05-23 — early-stage capture; precise total FD count was not separately recorded for this row)
12,761 / 12,591 (workspace-A, deep tree, 8h58m uptime, 2026-05-25 storm A — V8 heap snapshot captured)
12,728 / 12,591 (workspace-A, same tree, ~13 minutes after a fresh gateway restart, 2026-05-25 storm B — second V8 heap snapshot captured; same per-subdirectory breakdown as storm A)

Across the 30-second sampling window used for each capture, no FD count decreases were observed; the count stayed flat until the process was restarted.

The gateway remained responsive (/health returned 200 in <10ms) at each capture. During the 2026-05-25 capture, a sub-agent process spawned during the same window reported EBADF. The causal chain between the gateway's high FD count and the sub-agent's EBADF was not isolated. EBADF is the bad-file-descriptor errno, not the typical FD-exhaustion errno (EMFILE/ENFILE).

OpenClaw version

2026.5.19. Local build off upstream/main merge-base a13468320c (public), with 14 unpublished local commits on top. The 14 commits touch src/agents/, src/auto-reply/, src/cron/, src/infra/, src/logging/, and src/trajectory/ only — no files under extensions/memory-core/, no chokidar usage, no fs-safe.ts, no fs.open/fs.watch call sites (verified by git diff --name-only a13468320c..HEAD).

Operating system

macOS 14.7.6 (Darwin 23.6.0, x86_64)

Install method

pnpm dev build (pnpm build), gateway running under launchd via /Library/LaunchDaemons/ai.openclaw.gateway.plist.

Model

openai-codex/gpt-5.5 (thinking=high, fast=off) — recorded as the agent model in the storm B gateway startup log line [gateway] agent model: openai-codex/gpt-5.5 (thinking=high, fast=off). The bug is at the gateway/process level and is not believed to depend on this; included for completeness.

Provider / routing chain

openclaw gateway → memory-core plugin → workspace memory/ tree on local disk. No external LLM-provider routing path is implicated; the FD accumulation occurs entirely inside the gateway parent process.

Additional provider/model setup details

External routing/auth: NOT_ENOUGH_INFO (no provider/auth path implicated).

Memory-plugin config state at the time of the captures (relevant only as triage context):

memory-core plugin is loaded at gateway start (gateway log shows http server listening (3 plugins: matrix, memory-core, tavily; 8.8s)).
plugins.entries does not contain memory-core; it appears in plugins.allow but with no explicit enable entry.
No agent has memorySearch explicitly enabled in openclaw.json. The memory_search tool is nevertheless callable by agents (observed in the 2026-05-25 storm B logs).
messages.sync.watch is not set in the config used during the captures.

Logs, screenshots, and evidence

FD type histogram (2026-05-22 capture, 12,013 total)

Only the REG count is anomalous:

11968 REG       ← workspace memory/**.md files (the leak)
   16 unix
   13 IPv4
    4 PIPE
    4 DIR
    3 KQUEUE
    2 CHR
    1 systm
    1 NPOLICY
    1 IPv6

FD-by-directory (top entries, 2026-05-22 capture)

All REG-type, all r-mode:

8694 <workspace>/memory/transcripts
1695 <workspace>/memory/transcripts.archived
 267 <workspace>/memory/structured-md/lessons
 214 <workspace>/memory/structured-md/decisions
 213 <workspace>/memory/structured-md/lessons.archived
 209 <workspace>/memory/structured-md/procedures
 151 <workspace>/memory/structured-md/decisions.archived
 137 <workspace>/memory
 126 <workspace>/memory/structured-md/procedures.archived
  81 <workspace>/memory/structured-md/projects
  63 <workspace>/memory/structured-md/projects.archived

The retained FD paths span multiple subdirectories under <workspace>/memory/, including *.archived/ sibling directories I use to keep older items out of the active set.

Representative lsof line

REG = regular file; 86r = FD 86, read-only mode; opened once, no close observed in the 30s sampling window:

node    <pid>   <user>  86r  REG  1,7   2262   <inode>   <workspace>/memory/transcripts/<uuid>.md

Sample stack trace — first observation

A macOS sample of the gateway during the leak shows libuv workers actively performing scandir/stat/open/close syscalls — but for other paths. The 11,877 memory FDs themselves sat idle during the 30s sampling window. The close path is operational for other FDs in the same process.

Sample stack trace — second observation

A second sample taken while the FD count was actively growing on workspace-B caught node::fs::ReadDir(v8::FunctionCallbackInfo<v8::Value> const&) (the libuv-level readdir entry point) at the bottom of ~89 frame samples, called via Builtins_CallApiCallbackOptimizedNoProfiling. The JS frames above the V8 builtin show as ??? (in <unknown binary>) — JIT-compiled, no symbols. So the active syscall path is fs.readdir; the JS-level entry point requires a heap snapshot to identify.

V8 heap snapshot — 2026-05-25 storm

Captured during the 2026-05-25 capture (gateway PID known, 12,761 total FDs / 12,591 memory-tree REGs at capture time, 535 MB .heapsnapshot, captured via --inspect=127.0.0.1:9229 armed in the launchd plist). Snapshot stats: 6,439,689 nodes / 26,210,859 edges / 425,208 strings.

Class histogram (counts of leak-relevant types):

12,760  FSWatcher       (count close to memory-tree FD count)
12,755  FSEvent         (held as V8 Global handles)
12,765  listener        (closure)
12,756  handleEvent     (closure)
12,754  Stats           (object)
12,573  validate        (closure)
   415  WatchHelper     (object)
    11  FileHandle      (object)
     2  FSReqCallback

Retainer chain — WatchHelper (matches chokidar's internal class layout):

[object] WatchHelper  --(property:fsw)-->
  [object] FSWatcher  --(property:_closers)-->
    [object] Map  --(internal:table)-->
      [array]  --(internal:[N])-->
        [array] (object elements)  --(internal:[0])-->
          [closure]  --(internal:context)-->
            [object] system/Context  --(context:listener)-->
              [closure]  --(internal:context)-->
                [object] system/Context  --(context:wh)-->
                  [object] WatchHelper           ← cycle back

Retainer chain — Stats (shows prevStats captured in listener context):

[WatchHelper] -> (fsw) -> [FSWatcher] -> (_closers) -> [Map] ->
  [Array elements] -> [closure listener] ->
    [system/Context: prevStats] -> [Stats]

Retainer chain — FSEvent (pinned as V8 Global handles, i.e., active C++-side subscriptions):

[synthetic] (GC roots)  --(element:[10])-->
  [synthetic] (Global handles)  --(internal:[N])-->
    [object] FSEvent

FileHandle observation: lsof shows 12,685 REG-type FDs on .md files in the same capture, but the heap snapshot contains only 11 FileHandle JS objects. The retained lsof FDs are therefore not represented as fs.promises.FileHandle objects in the heap snapshot. The FSReqCallback count of 2 indicates the snapshot did not contain thousands of in-flight async fs callbacks.

Trigger correlation observed in 2026-05-25 storm B

Storm B fired at 17:59:15 UTC on a gateway that started at 17:46:30 UTC — 13 minutes uptime. The gateway log + err log in the 17:46-17:59 window shows the following relevant lines (timestamps in UTC):

17:46:30 — gateway becomes ready ([gateway] ready), 3 plugins loaded: matrix, memory-core, tavily.
17:57:59 — liveness warning records agent:<redacted>:matrix:channel:<redacted> (processing/tool_call,q=1,age=54s last=tool:memory_search:started) — i.e., a memory_search tool call had been running for 54s.
17:58:53 — [agent/embedded] codex dynamic tool timeout: tool=memory_search toolTimeoutMs=30000; per-tool-call watchdog, not session idle — the 30-second per-tool watchdog for memory_search fired.
17:59:15 — local FD-edge monitor records FD BAND TRANSITION pid=<redacted> clean -> overgrown (total=10441 memory=10301 transcripts=7104 archived=2248) — the FD count had grown to >10K REGs by this moment; growth continued to 12,728 / 12,591 in the subsequent ~30 seconds.

So the order observed in the log + monitor data: memory_search tool call → 30s tool-watchdog fires → FD count climbs past the storm threshold within the next ~22 seconds. The log does not directly show the walker; the correlation is between the memory_search activity and the FD-count growth.

Storm A's logs at the equivalent pre-storm window were not exhaustively reviewed; storm A's call path may or may not be the same.

Negative log evidence

In the 15-minute window during which FDs grew from <500 to 12,013 on 2026-05-22 (06:54 UTC → 07:09 UTC), the gateway log contains only routine [agent/embedded] codex plugin thread config eligibility lines and a couple of WebSocket sessions.list / sessions.resolve round-trips. The opener did not write anything to the log during that window.

Hypothesis-test results (what was probed by observation)

Hypothesis	Test performed	Observed result
Isolated cron sub-agents leak FDs to the gateway parent	Triggered an isolated `QMD high-churn embed` cron via `openclaw cron run`; sampled gateway-parent FDs across 30s	Isolated sub-agents are separate processes; gateway-parent FD count did not change during the cron run.
Native vector-store (LanceDB / sqlite-vec) holds FDs open	Looked for `.lance/*` or memory DB FDs in `lsof`; checked storm heap snapshot for vec-store class names	None observed in `lsof` or heap.
FD leak goes through `fs.promises.open()`	Counted `FileHandle` objects in the storm heap snapshot	11 `FileHandle` JS objects vs 12,685 `lsof` REG FDs in the same capture — the retained FDs are not visible as `fs.promises.FileHandle` JS objects in this snapshot.
`chokidar` watcher is dormant per config gate	(a) Touched a file under `<workspace>/memory/` and watched `lsof` + gateway log for 30s; (b) counted `FSWatcher` objects in the storm heap snapshot	(a) No file-touch reaction observed; (b) heap snapshot contains 12,760 `FSWatcher` and 12,755 `FSEvent` JS objects. The test (a) result and observation (b) are not in direct conflict — (a) probed whether new file events were being delivered, while (b) shows the watchers exist in heap.

Impact and severity

Affected: The 2 workspaces and 1 macOS launchd-daemon configuration observed (workspace-A: 2026-05-21, 2026-05-22, 2026-05-25; workspace-B: 2026-05-23). Reproduced on the current dist (build off a13468320c) and on an earlier dist (separate branch rebased onto upstream/main 2026-05-20, observed 2026-05-21).
Severity: Stability risk. The gateway remained responsive at all 5 captures, but FD count reached the >12K range and one co-resident sub-agent reported EBADF during the 2026-05-25 storm A capture.
Frequency: 5 of 5 observed reproductions across 2 workspaces and 2 distinct bad-dist builds — 4 of 4 on the current dist (2026-05-22, 2026-05-23, 2026-05-25 storm A, 2026-05-25 storm B) + 1 of 1 on the earlier build (2026-05-21). 0 reproductions on a single 2026-05-13-era healthy-baseline observation (1d 4h uptime, ~250 total FDs, 0 memory-tree REGs). Storm B reproduced 13 minutes after a fresh restart in a single memory_search tool call window.
Consequence: In the captures we observed, a process restart was required to reset FD count; no in-process release was observed.

Additional information

Reproducibility timeline

Date	Build	Workspace	Uptime	Total FDs	Memory-tree FDs
2026-05-13 area	earlier build (pre-2026-05-13 healthy baseline)	workspace-A	1d 4h	~250	0 (healthy baseline)
2026-05-21 07:40 EDT	separate branch rebased onto upstream/main 2026-05-20	workspace-A (deep, transcripts)	16h 35m	11,830	10,103
2026-05-22 03:09 EDT	local build off `a13468320c`	workspace-A (deep, transcripts)	12h 44m	12,013	11,877
2026-05-23 14:01 UTC	same dist as row 3	workspace-B (flat, dated `.md`s)	~12h	~200	46 (early-stage)
2026-05-25 15:31 UTC	same dist as row 3	workspace-A (deep, transcripts)	8h 58m	12,761	12,591 (storm A — heap snapshot captured)
2026-05-25 17:59 UTC	same dist as row 3	workspace-A (deep, transcripts)	13 min (post-restart)	12,728	12,591 (storm B — second heap snapshot captured; `memory_search` tool call observed ~76s pre-storm in gateway log)

Last known good observed: row 1 (2026-05-13 area, ~250 total, 0 memory-tree). First known bad observed: row 2 (2026-05-21). The build delta between those rows has not been bisected.

All cycles cleared by process restart. Non-overnight observation windows on the current dist showed clean baselines (<200 FDs total).

Known confounders

All 5 bad-capture dists in the timeline include unpublished local commits on top of public openclaw merge-bases (14 commits ahead of a13468320c for the current-dist bad captures; a separate branch for the 2026-05-21 row). The 14 commits (verified via git diff --name-only a13468320c..HEAD) touch src/agents/, src/auto-reply/, src/cron/, src/infra/, src/logging/, and src/trajectory/ only — no files under extensions/memory-core/, no chokidar usage, no fs-safe.ts, no fs.open/fs.watch call sites. The bug has not yet been reproduced on a clean upstream-only build in my environment, so the local commits remain a formal confounder, but their diff has no plausible direct mechanism for accumulating r-mode REG FDs on <workspace>/memory/.

Source-code audit results (where I looked and what I found)

I audited the extensions/memory-core/src/memory/ tree at a13468320c for the obvious leak shapes. Results, so maintainers can skip re-deriving them:

Line citations below refer to upstream openclaw/openclaw at the merge-base a13468320c (verified against a clean upstream-main worktree at HEAD 01c5ab8d13 to confirm line numbers haven't drifted across nearby commits).

Audit target	Finding
`chokidar.watch` call sites in memory-core	Two: `manager-sync-ops.ts:460` (memory tree, gated at `:431` by `!this.sources.has("memory") \|\| !this.settings.sync.watch \|\| this.watcher`) and `qmd-manager.ts:1566` (QMD collections, gated at `:1550` by `!this.syncSettings?.watch \|\| this.watcher \|\| this.closed`). Default for `sync.watch` is `false` (`manager-sync-control.ts:184` shows `sync: { watch: false, onSessionStart: false, onSearch: false }`). I did not find a third `chokidar.watch` call in memory-core.
`fs.open()` / `fs.promises.open()` call sites in memory-core	Two: `manager-sync-ops.ts:699` (`countNewlines`) and `qmd-manager.ts:2173` (`readSelectedLines`). Both use `try { … } finally { await handle.close(); }`. Neither iterates over `<workspace>/memory/**`.
`fs.readFile` in memory-core load path	`manager-embedding-ops.ts:684` and `:714` use `await fs.readFile(entry.absPath, "utf-8")`, which auto-closes after read. Used in `indexFile`.
chokidar v5's own `fs.open` invocation	One, in `chokidar/handler.js:182`, but guarded by `if (isWindows && error.code === 'EPERM')` — a Windows EPERM workaround, not active on macOS.
Native readers / vec-store bindings	The memory-core bundle ships native bindings (sqlite-vec, embeddings). I did not audit native code; if a native binding opens files via `uv_fs_open` outside the JS-tracked path, that would match the lsof-vs-heap discrepancy (12,685 REG FDs vs 11 `FileHandle` objects).

So the visible-from-source openers all look correct, and the visible-from-source watchers all look gated off. The actual call path producing the 12,685 r-mode REG FDs and the 12,760 FSWatcher JS objects in the captured runs is not identified by source-only inspection.

Fix hypotheses (speculative — for maintainer evaluation, not for implementation)

Listing these only as starting points; I don't have evidence strong enough to prefer one.

Watcher leak across manager re-instantiations: if a per-agent memory manager is recreated (e.g. on session/cron lifecycle change) without await this.watcher.close() on the prior instance, the prior FSWatcher + its _closers Map of listener closures stay retained as observed (12,760 active in heap; FSEvent pinned as Global handles).
Watcher fired by a non-sync.watch path: if some other call site invokes chokidar.watch on the memory tree outside the audited two, it would bypass the gate.
Native plugin opener without close: if a native module (sqlite-vec / embedding / multimodal) opens .md files via uv_fs_open() to compute hashes or extract content, and the handles aren't released after the call, that would match the lsof-only / not-in-heap signature.
memory_search runtime path opening files we don't see: storm B correlated with a memory_search tool call that hit the 30s tool watchdog. If the search path enumerates and opens corpus files outside the audited code paths (e.g. in loadMemoryToolRuntime() or its transitive native dependencies), the leak could live there.

Open questions for maintainers

The 2026-05-25 storm B trigger appears to be a memory_search tool call (gateway log shows the call ~76s before the FD-count overgrown transition). Is that the intended call path for the walker, and is memory_search expected to leave per-file FDs open between successive calls?
The storm heap snapshot also contains 12,760 FSWatcher objects; I have not identified which code path creates them in this configuration (messages.sync.watch is unset). Is there a separate watcher fan-out distinct from memory_search?
Is there a code path that opens .md files via callback-form fs.open(...) / fs.openSync(...) / native uv_fs_open() and retains the resulting FDs? The 12,685 lsof REG FDs are not visible as fs.promises.FileHandle JS objects in the heap.
On a rescan/reload of the memory tree (or on completion of a memory_search call), what is the expected cleanup contract for prior FSWatcher instances and prior open FDs?

Related upstream issues (for cross-referencing during triage)

I checked the openclaw issue tracker before filing; the following are open and adjacent. None duplicate this report; this one is the upstream-side root-cause description that several of them point at downstream.

#77750 (created 2026-05-05, OPEN) — spawn EBADF when gateway file descriptor count is high. Same FD-exhaustion symptom on 2026.5.3 upstream npm (14,039 FDs, 13,982 .md files). That issue scopes itself to a child-spawn EBADF fallback in src/process/supervisor/adapters/child.ts — i.e. a downstream mitigation — and acknowledges the root-cause leak is a separate problem (clawsweeper comment: "the related FD-leak fixes reduce one pressure source, but they do not implement the child-spawn fallback requested here"). This issue is intended to give that root cause concrete forensic shape.
#84820 (created 2026-05-21, OPEN, P1, impact:crash-loop) — Unclosed FileHandle on session JSONL lock crashes gateway on Node ≥24 under sustained session-store load. Same retention shape (unclosed FileHandle in long-running gateway), different code path (session .jsonl.lock vs memory tree .md). Both point at the same class of resource-cleanup gap.
#67461 (created 2026-04-16, OPEN, P1, impact:crash-loop) — Gateway leaks undici sockets on every streamed Anthropic API call (buildManagedResponse missing finalize on GC). Different resource (sockets, not file FDs), same long-running-gateway / resource-leak / crash-loop family.
#77997 (created 2026-05-05, OPEN) — skills refresh-state workspaceVersions map retains entries after watcher teardown. Same retention-after-watcher-teardown shape, in skills runtime instead of memory-core.
#71335 (created 2026-04-25, OPEN, P2, impact:crash-loop) — sync.watch should default to false in gateway mode. The default is already false in the version observed here (manager-sync-control.ts:184), but that issue's impact:crash-loop label corroborates that sync.watch's watcher fan-out has previously been associated with crash-loop pressure.
#40088 (created 2026-03-08, OPEN, P1) — memory_search: chokidar file watcher silently stops delivering events, index goes stale. Same memory_search + chokidar surface area, but opposite symptom (watcher silently stops vs my report's watcher overretention). Listed for completeness; not a duplicate.

Artifacts available on request

535 MB .heapsnapshot from the 2026-05-25 capture (PID known, captured via --inspect=127.0.0.1:9229).
Full lsof capture (2.1 MB) from the same moment.
Three earlier diagnostic bundles (sample stacks, log tails, FD histograms) from 2026-05-21, 2026-05-22, 2026-05-23.
Companion Python heap-snapshot analyzer (handles snapshots >512 MB) used to extract the class histograms and retainer chains above.

Happy to attach a sanitized excerpt or a JSON node-by-class histogram if useful for triage.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: Gateway accumulates >12K read-only file descriptors on workspace memory/**; correlated with memory_search tool activity in 1 of 5 observed captures

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

FD type histogram (2026-05-22 capture, 12,013 total)

FD-by-directory (top entries, 2026-05-22 capture)

Representative lsof line

Sample stack trace — first observation

Sample stack trace — second observation

V8 heap snapshot — 2026-05-25 storm

Trigger correlation observed in 2026-05-25 storm B

Negative log evidence

Hypothesis-test results (what was probed by observation)

Impact and severity

Additional information

Reproducibility timeline

Known confounders

Source-code audit results (where I looked and what I found)

Fix hypotheses (speculative — for maintainer evaluation, not for implementation)

Open questions for maintainers

Related upstream issues (for cross-referencing during triage)

Artifacts available on request

FAQ

Expected behavior

Still need to ship something?

TRENDING