hermes - ✅(Solved) Fix lsp: idle subprocesses are never reaped — long-lived gateways accumulate ~200 MB per server forever [4 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#25016Fetched 2026-05-14 03:49:48
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Participants
Timeline (top)
cross-referenced ×5labeled ×3mentioned ×1subscribed ×1

agent/lsp/manager.py defines DEFAULT_IDLE_TIMEOUT = 600 (line 66) and assigns it to self._idle_timeout (line 167), but no reaper exists. Idle LSP subprocesses live forever inside a long-running gateway / CLI session.

A long-lived gateway process accumulates one LSP subprocess per (language, workspace) ever touched. Realistic memory cost:

  • pyright ~200 MB
  • gopls ~80 MB
  • tsserver ~150 MB
  • rust-analyzer ~300+ MB

A gateway that sees edits across 5–10 different repos in different languages over a day-long session ends up holding ~1–2 GB of dead LSP subprocesses with no path to reclaim until the gateway restarts.

Root Cause

agent/lsp/manager.py defines DEFAULT_IDLE_TIMEOUT = 600 (line 66) and assigns it to self._idle_timeout (line 167), but no reaper exists. Idle LSP subprocesses live forever inside a long-running gateway / CLI session.

A long-lived gateway process accumulates one LSP subprocess per (language, workspace) ever touched. Realistic memory cost:

  • pyright ~200 MB
  • gopls ~80 MB
  • tsserver ~150 MB
  • rust-analyzer ~300+ MB

A gateway that sees edits across 5–10 different repos in different languages over a day-long session ends up holding ~1–2 GB of dead LSP subprocesses with no path to reclaim until the gateway restarts.

Fix Action

Fixed

PR fix notes

PR #24467: consolidation: fix 7 defects in PR #24168 LSP feature (orphan procs, patch_replace double-lint, opt-in defaults, +4 more)

Description (problem / solution / changelog)

Consolidation of PR #24168 (LSP semantic diagnostics) — 7 defect fixes

Builds on top of #24168 HEAD (7fa5657). Supersedes the competing #24155 (simpler/smaller approach to the same feature). Addresses every issue raised in the #24168 review thread plus three additional defects found during a holistic re-review.

Context

PR #24168 adds a substantial LSP integration (~4200 LOC across 2 commits) — full agent/lsp/ module spawning real language servers (pyright, gopls, rust-analyzer, etc.) as subprocesses, feeding semantic diagnostics into write_file/patch_replace post-write lint. Architecture is sound; the issues below are wiring/lifecycle/defaults bugs that would bite production users.

Defects fixed

#DefectSeverity
D1patch_replace double-lint silently drops every LSP diagnostic. The recursive write_file call rolls the baseline forward; the outer _check_lint_delta then sees an empty delta. Net effect: patch_replace would never surface a single LSP semantic error. Independently diagnosed in the #24168 bot review addendum.Critical
D2Path-keyed _delta_baseline collides under concurrent gateway load — two chats editing the same path in different worktrees stomp each other's baseline.High
D3Idle LSP subprocesses never reaped. Long-lived gateway accumulates one server per (language, workspace) forever (pyright ~200 MB, gopls ~80 MB, tsserver ~150 MB). DEFAULT_IDLE_TIMEOUT = 600 existed but was never wired to a reaper.High
D4Workspace cache unbounded dict, not thread-safe. Long-lived process grows without ceiling; concurrent resolve can corrupt the dict.High
D5Default lsp.enabled: true + install_strategy: auto triggers silent npm install pyright on first .py edit in a git repo — SOC 2 / ISO 27001 finding in audited environments. Also raised as a blocker in the #24168 bot review.High
D6INSTALL_RECIPES unpinned (e.g. pyright@latest, gopls@latest). Supply-chain risk; same Dependabot discipline that applies to application deps applies to tools we shell out to.Medium
D7Orphaned LSP subprocesses on hermes exit. Daemon thread death doesn't propagate SIGTERM to child servers — they live on holding 80–200 MB each until the kernel reaps them. Raised as blocker (1) in the #24168 bot review.High

Fix summary

  • D1: Token-keyed _delta_baseline (UUID per snapshot). patch_replace captures its own token before the write, passes it to write_file via _lsp_baseline_token=, and the inner _check_lint_delta runs with skip_lsp=True so the token survives to the outer post-write check.
  • D2: Same token-keyed map. Each snapshot returns a fresh token; no implicit global state for concurrent writers to collide on.
  • D3: Background _reaper_loop coroutine scheduled on _BackgroundLoop. _reap_idle() is the unit-testable single pass.
  • D4: OrderedDict LRU (cap 1024) + threading.Lock. Added public invalidate_workspace_cache(path) + clear_workspace_cache() for callers observing workspace topology changes (git init, worktree add/remove).
  • D5: Defaults flipped to lsp.enabled: false + install_strategy: "manual". Both opt-in. Existing-binary discovery still works in manual mode (servers on PATH).
  • D6: Every auto-install recipe pinned to a specific verified version. Inline CVE notes for [email protected] (CVE-2026-33532, CVSS 4.3 low, DoS-only on deeply-nested YAML, server is sandboxed) and gopls v0.21.1 (CVE-2026-42503, CVSS 8.8 high — exploit gated on -listen/-port flags our default invocation doesn't pass).
  • D7: atexit.register(shutdown_service) on first successful get_service() + explicit shutdown_service() call wired into cli._run_cleanup. Belt + suspenders. shutdown_service is idempotent and swallows inner exceptions so an exit hook can never crash the process.

Test coverage

  • 204/204 unit tests pass (26 new regression tests in tests/agent/lsp/test_consolidation_regressions.py — one per defect, plus thread-safety + idempotency + ordering pins).
  • Live verification of D7: spawned a child Python process that imported the LSP service, opened a real pyright subprocess (PID 77138), and exited normally. After exit, no orphan pyright process survived. Atexit hook fired correctly.

Diffstat

agent/lsp/__init__.py                              |  24 +-
agent/lsp/install.py                               |  61 +-
agent/lsp/manager.py                               | 230 ++++++-
agent/lsp/workspace.py                             | 110 ++-
cli.py                                             |  10 +
hermes_cli/config.py                               |  29 +-
tests/agent/lsp/test_consolidation_regressions.py  | 902 ++++++++++++++++++++  (new)
tests/agent/lsp/test_service.py                    |  11 +-
tools/file_operations.py                           | 107 ++-
9 files changed, 1400 insertions(+), 84 deletions(-)

#24168 bot review cross-reference

Bot findingResolution here
(1) Orphan LSP server processes (blocker)✅ D7 — atexit + cli._run_cleanup, live-verified
(2) Enabled by default + auto-install too aggressive✅ D5
(3) Five test_client_e2e.py failuresAlready passing 6/6 on this branch (stale finding)
(4) Competing #24155This consolidation supersedes both
(5) Stale parent commits in #24168Will be addressed when @teknium1 rebases #24168; not in scope here
Minor: dead DEFAULT_IDLE_TIMEOUT✅ D3 — reaper is now live, constant is load-bearing
Addendum: patch_replace LSP double-call✅ D1 — independently diagnosed, same fix shape

Open items, non-blocking

  • CVE notes inline for yaml-language-server and gopls — both bounded by deployment posture (sandboxed YAML parsing, no flags). Track upstream patches separately.
  • Pre-existing race on _service singleton in get_service() not introduced by this PR; low probability in practice (single-threaded tool dispatch in hermes main loop).

Verifying locally

git fetch https://github.com/scubamount/hermes-agent.git lsp-consolidation-pr24168-fixes
git checkout FETCH_HEAD
python -m pytest tests/agent/lsp/ tests/tools/test_file_operations.py \
                 tests/tools/test_file_operations_edge_cases.py \
                 tests/tools/test_file_write_safety.py

Expected: 204 passed.


cc @teknium1 (PR #24168 author) — this is positioned as a friendly consolidation, not a takeover; happy to close in favor of you pulling these patches into #24168 directly, or to keep this PR live as a stacked target. Whichever you prefer.

Changed files

  • agent/lsp/__init__.py (added, +83/-0)
  • agent/lsp/cli.py (added, +270/-0)
  • agent/lsp/client.py (added, +930/-0)
  • agent/lsp/eventlog.py (added, +213/-0)
  • agent/lsp/install.py (added, +382/-0)
  • agent/lsp/manager.py (added, +692/-0)
  • agent/lsp/protocol.py (added, +196/-0)
  • agent/lsp/reporter.py (added, +78/-0)
  • agent/lsp/servers.py (added, +1025/-0)
  • agent/lsp/workspace.py (added, +323/-0)
  • cli.py (modified, +10/-0)
  • hermes_cli/config.py (modified, +64/-0)
  • hermes_cli/main.py (modified, +11/-0)
  • tests/agent/lsp/__init__.py (added, +1/-0)
  • tests/agent/lsp/_mock_lsp_server.py (added, +159/-0)
  • tests/agent/lsp/test_backend_gate.py (added, +108/-0)
  • tests/agent/lsp/test_client_e2e.py (added, +143/-0)
  • tests/agent/lsp/test_consolidation_regressions.py (added, +902/-0)
  • tests/agent/lsp/test_eventlog.py (added, +199/-0)
  • tests/agent/lsp/test_protocol.py (added, +197/-0)
  • tests/agent/lsp/test_reporter.py (added, +94/-0)
  • tests/agent/lsp/test_service.py (added, +152/-0)
  • tests/agent/lsp/test_workspace.py (added, +139/-0)
  • tools/file_operations.py (modified, +200/-21)
  • website/docs/reference/cli-commands.md (modified, +28/-0)
  • website/docs/user-guide/features/lsp.md (added, +223/-0)
  • website/sidebars.ts (modified, +1/-0)

PR #25021: fix(lsp): add idle-subprocess reaper to reclaim memory during long sessions

Description (problem / solution / changelog)

Summary

LSPService defines DEFAULT_IDLE_TIMEOUT = 600 and tracks _last_used timestamps per (server_id, workspace_root) key, but no reaper coroutine ever consumes them. Idle LSP subprocesses live forever inside a long-running gateway/CLI session, accumulating hundreds of MB of dead memory.

Changes

  • agent/lsp/manager.py:

    • Add _BackgroundLoop.schedule() method to fire-and-forget a coroutine on the background event loop.
    • Add LSPService._reaper_loop() — an async loop that wakes every idle_timeout / 2 seconds and calls _reap_idle().
    • Add LSPService._reap_idle() — single pass that iterates _last_used, shuts down clients idle beyond _idle_timeout, and removes them from _clients and _last_used.
    • Start the reaper in __init__ when LSP is enabled; cancel it in shutdown().
  • tests/agent/lsp/test_reaper.py (new): 3 unit tests covering:

    • Stale clients are reaped and shutdown() is awaited.
    • Active clients are preserved.
    • Reaper continues gracefully if client.shutdown() raises.

Test Results

All existing LSP tests pass; 3 new tests pass.

Fixes #25016

Changed files

  • agent/lsp/manager.py (modified, +43/-0)
  • tests/agent/lsp/test_reaper.py (added, +81/-0)

PR #25037: fix(lsp): default to opt-in (enabled=False, install_strategy=manual) (#25015)

Description (problem / solution / changelog)

Summary

Flips the two canonical LSP defaults in DEFAULT_CONFIG["lsp"] to opt-in so a fresh install never silently spawns pyright / gopls / tsserver or auto-installs language servers via npm / go / pip on the first file edit.

  • enabled: TrueFalse
  • install_strategy: "auto""manual"

Also flips the matching fallbacks in LSPService.create_from_config() so users on a config.yaml that predates the lsp: section see the same opt-in behaviour at runtime.

Fixes #25015.

The bug

hermes_cli/config.py:1516,1530 shipped with enabled: True + install_strategy: "auto". With LSP gated only on git-workspace detection, the first .py / .go / .rs / .ts edit inside any git repo silently triggers:

  • npm install -g pyright@latest (or go install gopls@latest, cargo install rust-analyzer, etc.) into <HERMES_HOME>/lsp/bin/
  • a long-lived language-server subprocess (pyright ~200 MB, gopls ~80 MB, tsserver ~150 MB)

For users in audited environments (SOC 2, ISO 27001) this falls under the same compliance umbrella that already gates pip / npm installs in CI: any background process or package fetch needs explicit consent.

Reported as defect D5 in scubamount's #24467 and verified-real on current main by @kshitijk4poor when that PR was split into three focused issues. This is the standalone follow-up they asked for.

The fix

Two literal flips in the canonical defaults, plus matching fallback flips in the consumer so existing configs behave the same way:

 # hermes_cli/config.py
 "lsp": {
-    "enabled": True,
+    "enabled": False,
-    "install_strategy": "auto",
+    "install_strategy": "manual",
 }
 # agent/lsp/manager.py — LSPService.create_from_config
-enabled = bool(lsp_cfg.get("enabled", True))
-install_strategy = lsp_cfg.get("install_strategy", "auto")
+enabled = bool(lsp_cfg.get("enabled", False))
+install_strategy = lsp_cfg.get("install_strategy", "manual")

Why also flip the manager fallbacks: LSPService.create_from_config() defaults are what kicks in when config.yaml predates the lsp: section. Flipping only DEFAULT_CONFIG would mean fresh installs get opt-in defaults while existing users silently keep enabled=True / install_strategy="auto" — divergent behaviour for the same product state, and the worst of both worlds.

Why "manual" and not "off": "manual" preserves existing-binary discovery. A user who opts back in with enabled: true (without also opting in to auto-install) still gets diagnostics from any pyright / gopls / tsserver already on PATH, with a clear "not installed" error otherwise. "off" would be a stronger downgrade that punishes users who already have the binaries.

Test plan

  • Focused regression test (TestLSPOptInDefaults, 3 tests): asserts DEFAULT_CONFIG["lsp"]["enabled"] is False, install_strategy == "manual", and that the manager fallbacks match.
  • Adjacent suite: tests/hermes_cli/test_config.py + tests/agent/lsp/ → 169 passed locally.
  • Regression guard: confirmed in-process that the old defaults (True / "auto") would fail the new assertions, and the new defaults pass.

Related

  • Source analysis: scubamount's closed #24467 (defect D5).
  • Sibling issues filed alongside #25015 by the same split:
    • #25016 — idle-subprocess reaper (already in flight as luyao618's #25021).
    • #25017 — INSTALL_RECIPES @latest pinning.

Sibling code paths that may need the same conservative-default treatment: agent/lsp/servers.py INSTALL_RECIPES (currently pulls @latest, #25017). Intentionally left out of this PR's scope to keep the diff focused on the opt-in toggle — happy to widen if preferred.

Changed files

  • agent/lsp/manager.py (modified, +6/-2)
  • hermes_cli/config.py (modified, +16/-6)
  • tests/hermes_cli/test_config.py (modified, +33/-0)

PR #25109: fix: reap idle LSP subprocesses to stop memory leak

Description (problem / solution / changelog)

This PR implements a missing reaping logic in the LSP manager to ensure idle subprocesses are terminated after the default timeout, fixing a long-standing memory leak issue.

Changed files

  • .repatch/state.json (added, +57/-0)
  • agent/lsp/manager.py (modified, +68/-0)

Code Example

async def _reaper_loop(self):
    while True:
        await asyncio.sleep(self._idle_timeout / 2)
        self._reap_idle()
RAW_BUFFERClick to expand / collapse

Summary

agent/lsp/manager.py defines DEFAULT_IDLE_TIMEOUT = 600 (line 66) and assigns it to self._idle_timeout (line 167), but no reaper exists. Idle LSP subprocesses live forever inside a long-running gateway / CLI session.

A long-lived gateway process accumulates one LSP subprocess per (language, workspace) ever touched. Realistic memory cost:

  • pyright ~200 MB
  • gopls ~80 MB
  • tsserver ~150 MB
  • rust-analyzer ~300+ MB

A gateway that sees edits across 5–10 different repos in different languages over a day-long session ends up holding ~1–2 GB of dead LSP subprocesses with no path to reclaim until the gateway restarts.

Repro

Spawn a gateway, edit .py in repo A, edit .go in repo B, edit .rs in repo C, leave the gateway running for an hour, watch ps aux | grep -E 'pyright|gopls|rust-analyzer' — none of them get reaped.

Locations

  • agent/lsp/manager.py:66DEFAULT_IDLE_TIMEOUT = 600
  • agent/lsp/manager.py:157idle_timeout: float = DEFAULT_IDLE_TIMEOUT,
  • agent/lsp/manager.py:167self._idle_timeout = idle_timeout

grep -n '_reaper_loop\|_reap_idle' agent/lsp/manager.py returns no hits — the constant is wired into the constructor signature but nothing consumes it.

Proposed fix

Background reaper coroutine scheduled on _BackgroundLoop:

async def _reaper_loop(self):
    while True:
        await asyncio.sleep(self._idle_timeout / 2)
        self._reap_idle()

_reap_idle() is the unit-testable single pass — iterate self._last_used, tear down clients whose timestamp is older than now - self._idle_timeout. Schedule via _BackgroundLoop.schedule() in __init__ when enabled; cancel the handle in shutdown().

A note on cleanup: agent/lsp/__init__.py:75 already registers an atexit hook that tears down everything on process exit, so this issue is purely about reclaiming memory during a long-lived session, not at exit.

This was verified-real in scubamount's PR #24467 (defect D3). Credit to @scubamount for the original analysis. Filing as a standalone issue with a focused fix scope rather than the bundled refactor in that PR.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix lsp: idle subprocesses are never reaped — long-lived gateways accumulate ~200 MB per server forever [4 pull requests, 1 participants]