hermes - 💡(How to fix) Fix RFC: Production-grade multi-profile gateway supervisor with adopt-orphan + staggered spawn + Windows-native lifecycle

hermes2026-05-24 12:57:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Each of these is solvable in 5–50 lines of glue. The problem is every multi-profile user reinvents the same 200–500 lines because none of it lives in core. I have it; #14009 (orchestrator profiles via ACP) implies others do too.

Code Example

┌──────────────────────┐
                                                       │ Profile A (executor) │
                                                       └──────────────────────┘
Windows Scheduled Task                                   ┌──────────────────────┐
"Hermes_Gateway"  ───►  hidden VBS  ───►  Node     ───► │ Profile B (advisor)  │
(trigger: At Logon)     (no console)      supervisor    └──────────────────────┘
                                          ▲              ┌──────────────────────┐
                                          │              │ Profile C (analyst)  │
                                          │              └──────────────────────┘
                                          │              ▲
                                          └──────────────┴── single watchdog,
                                                            adopt-orphan policy,
                                                            staggered spawn (2.5s)

---

# New CLI surface (all backward-compatible — single-profile path unchanged)
hermes gateway --multi-profile profile_a,profile_b,profile_c \
               --stagger 2.5 \
               --single-watchdog \
               --adopt-orphan

---

# Windows-native lifecycle install (idempotent)
hermes service install --platform windows
# → creates Scheduled Task "Hermes_Gateway" + .cmd entry point + hidden VBS wrapper

hermes service status
hermes service start
hermes service uninstall

---

# Documented lock + PID semantics
~/.hermes/gateway/.supervisor.lock         # advisory; two supervisors detect each other
~/.hermes/gateway/<profile>.pid            # stale-PID reclaim built in
~/.hermes/gateway/<profile>.adopt.log      # per-profile orphan-adoption audit

RAW_BUFFERClick to expand / collapse

TL;DR

Hermes today targets the single-profile + cloud-VPS deployment shape beautifully — hermes gateway --supervisor plus $5 VPS / GPU cluster / Modal / Daytona covers most users in one or two commands.

It does not target a deployment shape that turns out to matter a lot in production: multi-profile on a Windows desktop, surviving login/logout cycles, with one watchdog supervising N independent gateways and a documented recovery contract. I've been running exactly this for ~3 months — three Hermes profiles, 24/7, on a Windows workstation that gets rebooted irregularly — and almost every meaningful incident has been about supervision, not about the gateway itself.

Filing this issue to gauge whether upstreaming the supervision layer (with explicit adopt-orphan + stagger + lock-file semantics + Windows service install) belongs in core or out-of-tree.

Why this is real

Most Hermes users have one gateway running. If it crashes, they restart their terminal. They don't notice and don't care.

Once you go to N>1 profiles on the same host — a common arrangement is "fast executor + reasoning advisor + long-context analyst", three profiles sharing one chat channel — the operational story changes:

Problem	Single-profile	N-profile
One process dies	restart manually	the other N-1 must keep running, no cascading restart
Two watchdogs spawn (race during boot)	unusual	likely; needs lock + `--replace` semantics
Gateways start simultaneously	OK	port-bind / Feishu-websocket-handshake race; needs spawn stagger
Crash during sleep/lock	restart manually on wake	needs to come back without user intervention
User logs out	gateway dies	unacceptable; must survive logout via service / Scheduled Task
One profile gets stuck in an API hang	restart that one	must NOT trigger restart of healthy siblings

What's actually running here (~3 months production)

                                                       ┌──────────────────────┐
                                                       │ Profile A (executor) │
                                                       └──────────────────────┘
Windows Scheduled Task                                   ┌──────────────────────┐
"Hermes_Gateway"  ───►  hidden VBS  ───►  Node     ───► │ Profile B (advisor)  │
(trigger: At Logon)     (no console)      supervisor    └──────────────────────┘
                                          ▲              ┌──────────────────────┐
                                          │              │ Profile C (analyst)  │
                                          │              └──────────────────────┘
                                          │              ▲
                                          └──────────────┴── single watchdog,
                                                            adopt-orphan policy,
                                                            staggered spawn (2.5s)

The supervision contract (the part that matters)

Single watchdog owns all profiles. Not one watchdog per gateway. Two watchdogs both holding the same set is an own-goal — they --replace each other and you end up with an empty fleet plus a confused log.
adopt-orphan on profile death. When profile B's gateway exits non-zero, the supervisor spawns a new B; profile A and C are not touched. Any "rolling restart" semantics are explicitly rejected — they create false correlation in the incident timeline.
Staggered spawn (2.5s default). Three gateways trying to bind their per-profile sockets simultaneously is racy on Windows. 2.5s is a sloppy upper bound that prevents the race without making cold-start painful.
Lock file + PID file protocol. A ~/.hermes/gateway/.supervisor.lock (flock-like, advisory) + ~/.hermes/gateway/<profile>.pid (with a stale-PID reclaim check) prevents the "two supervisors both --replace-ing each other" failure mode I hit week 1.
Stale PID reclaim. PID file present but the OS doesn't recognize the PID → reclaim it; don't refuse to start. Windows process death doesn't reliably clean up PID files; the supervisor has to.
Loopback proxy bridging. Gateways in China-region deployments must route Feishu / Telegram websockets through a local SOCKS/HTTP proxy (v2ray / clash, typically 127.0.0.1:10808). Bare websocket → WinError 1225. The supervisor sets NO_PROXY=127.0.0.1,localhost before spawning so internal health checks bypass the proxy (sibling issue: #31421).

The Windows-native lifecycle layer

The other half of "production on Windows" is making sure the supervisor itself survives:

Scheduled Task "At Logon" rather than a Windows Service. Service requires admin install; Scheduled Task is per-user and easier to onboard.
.cmd entry point because Scheduled Task → .cmd is the path of least resistance for environment-variable plumbing.
Hidden VBS wrapper so the user doesn't see a console window after every boot. Pure cosmetics, but mandatory for desktop use — without it the workflow is unacceptable.
Supervisor does not self-restart. Trigger is At Logon only; if the supervisor dies post-login, the next Start-ScheduledTask (manual or scripted) is what brings it back. Self-restart loops are a bigger footgun than the missed-restart cost.

Why this isn't covered today

Reading hermes_cli/gateway.py + hermes_cli/gateway_windows.py + the existing supervisor scaffolding, the building blocks are there for single-profile + single-watchdog. What's missing is:

A first-class "supervisor owns N profiles" command shape
Standardized adopt-orphan semantics (not just "the supervisor restarts the gateway")
A Windows-service install command that wires the Scheduled Task + .cmd + VBS chain
A standardized lock-file / PID-reclaim protocol that future multi-profile users don't have to re-derive
Documented loopback-proxy bridging for China-region deployments (this one is also addressed at the HTTP-layer in #31421, but the gateway/supervisor needs to set NO_PROXY before spawning anything)

Proposed contribution (if in scope)

# New CLI surface (all backward-compatible — single-profile path unchanged)
hermes gateway --multi-profile profile_a,profile_b,profile_c \
               --stagger 2.5 \
               --single-watchdog \
               --adopt-orphan

# Windows-native lifecycle install (idempotent)
hermes service install --platform windows
# → creates Scheduled Task "Hermes_Gateway" + .cmd entry point + hidden VBS wrapper

hermes service status
hermes service start
hermes service uninstall

# Documented lock + PID semantics
~/.hermes/gateway/.supervisor.lock         # advisory; two supervisors detect each other
~/.hermes/gateway/<profile>.pid            # stale-PID reclaim built in
~/.hermes/gateway/<profile>.adopt.log      # per-profile orphan-adoption audit

Roughly 600–800 LoC of cross-platform Python + a tested Scheduled Task / launchd / systemd-user adapter. Already running in production; the upstream port is mostly extracting the right abstractions and adding the multi-OS launcher matrix.

Why this matters as a system feature (not just an ops trick)

The orchestrator proposals (#14009, #18493, #31392) talk about agent-to-agent task relay — that's the application layer. This is the process layer underneath: agents can't relay tasks if their gateways aren't reliably alive.

The combination of:

multi-profile supervision (this issue),
multi-agent task relay (#31392),
shared memory governance (#31388),
skill scheduling (#31399),
output compression (#31400),
subprocess execution bridge (#31385),
write-intent registry (#31401),

...is what made it possible to run three profiles 24/7 without me sitting at the keyboard. Each one alone is a P3 feature; the assembled stack is a multi-agent OS for project work. The supervision layer is the kernel that keeps it running.

Questions before opening a PR

In scope? Does multi-profile + Windows-native lifecycle belong in the hermes CLI core, or is the right answer "out-of-tree hermes-fleet extension"?
Adopt-orphan semantics conflict? Is there an existing process-supervision model in Hermes (subagent isolation, supervisor abstractions) that this would interact with strangely?
Service backend matrix. Windows Scheduled Task + macOS launchd + Linux systemd-user covers the Big 3. Acceptable scope, or want to start narrower (Windows-only Phase 1)?
Naming. hermes gateway --multi-profile vs hermes fleet vs hermes supervisor?

Happy to open a draft PR if there's interest, or stage as a series (Phase 1: supervisor + adopt-orphan + stagger; Phase 2: Windows service install; Phase 3: launchd / systemd-user) if the surface is too wide for a single review.

cc @alt-glitch / @teknium1. Related: #31385 / #31388 / #31392 / #31399 / #31400 / #31401 / #31421.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering