hermes - 💡(How to fix) Fix Dynamic s6 gateway stop records want down but leaves live child running

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When gateway-default is registered as a dynamic s6 service, both hermes gateway stop and raw /command/s6-svc -d /run/service/gateway-default return successfully and record want down, but the live gateway child remains running under the same s6-supervise gateway-default process.

The strongest diagnostic evidence is that S6ServiceManager().unregister_profile_gateway("default") removes /run/service/gateway-default, after which s6-svstat can no longer observe the service, but the same supervisor and child remain alive. The supervisor continues through deleted service-directory handles:

cwd=/run/service/gateway-default (deleted)
3 -> /run/service/gateway-default/supervise/lock (deleted)
4 -> /run/service/gateway-default/supervise/control (deleted)
5 -> /run/service/gateway-default/supervise/control (deleted)
6 -> anon_inode:[signalfd]

This creates an unobservable-but-still-running dynamic gateway state.

Root Cause

S6ServiceManager().unregister_profile_gateway("default") also returns ok and removes the service directory. After that, s6-svstat fails because the service directory is gone, but the same supervisor and child remain alive through deleted cwd/control/lock handles.

Code Example

cwd=/run/service/gateway-default (deleted)
3 -> /run/service/gateway-default/supervise/lock (deleted)
4 -> /run/service/gateway-default/supervise/control (deleted)
5 -> /run/service/gateway-default/supervise/control (deleted)
6 -> anon_inode:[signalfd]

---

hermes gateway stop -> ok_child_alive
/command/s6-svc -d /run/service/gateway-default -> ok_child_alive

---

up (...), normally down, want down

---

deploy/docker-shadow/smoke-hermes-gateway-s6-child-exit.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-p2-p3-compare.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-unregister-teardown.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-p4-supervisor.sh

---

HOS_REPO_ROOT=/path/to/repo \
  deploy/docker-shadow/smoke-hermes-gateway-s6-child-exit.sh

---

HOS_REPO_ROOT=/path/to/repo \
  deploy/docker-shadow/smoke-hermes-gateway-s6-p4-supervisor.sh

---

#!/command/with-contenv sh
set -e
export HOME=/opt/data
cd /opt/data
. /opt/hermes/.venv/bin/activate
export HERMES_S6_SUPERVISED_CHILD=1
exec s6-setuidgid hermes hermes gateway run
RAW_BUFFERClick to expand / collapse

Summary

When gateway-default is registered as a dynamic s6 service, both hermes gateway stop and raw /command/s6-svc -d /run/service/gateway-default return successfully and record want down, but the live gateway child remains running under the same s6-supervise gateway-default process.

The strongest diagnostic evidence is that S6ServiceManager().unregister_profile_gateway("default") removes /run/service/gateway-default, after which s6-svstat can no longer observe the service, but the same supervisor and child remain alive. The supervisor continues through deleted service-directory handles:

cwd=/run/service/gateway-default (deleted)
3 -> /run/service/gateway-default/supervise/lock (deleted)
4 -> /run/service/gateway-default/supervise/control (deleted)
5 -> /run/service/gateway-default/supervise/control (deleted)
6 -> anon_inode:[signalfd]

This creates an unobservable-but-still-running dynamic gateway state.

Environment

  • Hermes pinned release: v2026.5.29
  • Pinned fixture commit: e71a2bd11b733f3be7cf99deafde0066c343d462
  • Current-upstream executable comparison previously reproduced the child-exit failure at 32899279a744805350be891ccf3ae08289efc702.
  • Later P2/P3 comparison reproduced the same failure shape at 8f4c8e7c8297ffe0d11914e761cd0e738ab05b0d.
  • Issue-draft-time main was observed at 1fc7bdc5e64e052bc61d3ddb9e6f96cf6c7461dc, but that moving ref has not yet been smoke-tested.
  • Pre-submission main spot-check observed 04bb74c58eff5ac972e31bcf2fa2c7c7aaf5105b; that later moving ref has also not been smoke-tested.

The reproducer uses a slim Docker fixture with s6-overlay. It does not mount production state, auth stores, host Docker socket, or host ports.

Expected behavior

One of these should happen when a live gateway-default child is stopped via Hermes' intended s6 path:

  1. hermes gateway stop and raw /command/s6-svc -d terminate the live child;
  2. or they return a visible failure/timeout when the child remains alive.

Silent success with want down plus a live child should not be reported as a successful stop.

Actual behavior

Both stop paths return successfully:

hermes gateway stop -> ok_child_alive
/command/s6-svc -d /run/service/gateway-default -> ok_child_alive

The service state changes to:

up (...), normally down, want down

but the same gateway child PID remains alive under the same s6-supervise gateway-default process.

S6ServiceManager().unregister_profile_gateway("default") also returns ok and removes the service directory. After that, s6-svstat fails because the service directory is gone, but the same supervisor and child remain alive through deleted cwd/control/lock handles.

Reproducer

The investigation used these fixture-only smoke scripts:

deploy/docker-shadow/smoke-hermes-gateway-s6-child-exit.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-p2-p3-compare.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-unregister-teardown.sh
deploy/docker-shadow/smoke-hermes-gateway-s6-p4-supervisor.sh

Primary reproducer:

HOS_REPO_ROOT=/path/to/repo \
  deploy/docker-shadow/smoke-hermes-gateway-s6-child-exit.sh

Supervisor/fd diagnostic:

HOS_REPO_ROOT=/path/to/repo \
  deploy/docker-shadow/smoke-hermes-gateway-s6-p4-supervisor.sh

Both scripts build deploy/docker-shadow/Dockerfile.hermes-s6-fixture, which clones https://github.com/NousResearch/hermes-agent.git at a configurable HERMES_REF.

Key observations

  • gateway-default/run is generated as:
#!/command/with-contenv sh
set -e
export HOME=/opt/data
cd /opt/data
. /opt/hermes/.venv/bin/activate
export HERMES_S6_SUPERVISED_CHILD=1
exec s6-setuidgid hermes hermes gateway run
  • There is no surviving shell wrapper between s6-supervise gateway-default and the Hermes gateway child.
  • /command/s6-svc -d /run/service/gateway-default reaches the supervisor path enough for s6-svstat to report want down.
  • The same child remains parented to the same supervisor one second after stop intent.
  • s6-svwait -D detects the never-down state with nonzero status, but does not terminate the child.
  • Removing or re-adding the persistent down marker did not affect child exit.
  • Unregister removes the service directory but leaves the supervisor and child alive through deleted service-directory handles.

Known limitations

  • The issue-draft-time main ref 1fc7bdc5e64e052bc61d3ddb9e6f96cf6c7461dc is recorded for moving-target awareness, but it has not yet been smoke-tested.
  • The pre-submission main ref 04bb74c58eff5ac972e31bcf2fa2c7c7aaf5105b is recorded for moving-target awareness only and has not been smoke-tested.
  • The hardened fixture keeps --cap-drop ALL and does not add SYS_PTRACE. Because of that, the supervisor fd/cwd snapshot is complete, but the child fd/cwd snapshot returns PermissionError:13.
  • Process IDs vary between runs. The invariant is the relationship: the same child stays parented to the same s6-supervise gateway-default process.
  • The one-second post-stop sample does not prove no signal was ever sent. It proves that after stop intent, the same child and supervisor remain alive and s6-svstat reports only want down.

Requested maintainer guidance

I would appreciate guidance on the intended contract for dynamic gateway services:

  1. Should s6-svc -d terminate a currently live child for a service that is up, normally down because of a persistent down marker?
  2. Is it expected that unregister_profile_gateway() removes the service directory while the corresponding s6-supervise process and child remain alive?
  3. Should S6ServiceManager.stop() verify child exit and return failure if the process remains up after stop intent?
  4. Should unregister treat a still-running supervisor/child after service-dir removal as a failure rather than successful teardown?

Proposed next step

I can contribute a regression test that asserts live child exit, or at minimum asserts that S6ServiceManager.stop() does not silently report success when s6-svc -d records want down but the child remains alive.

I do not want to propose a local force-kill fallback without maintainer input, because that may bypass intended s6 semantics rather than fix the dynamic service contract.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

One of these should happen when a live gateway-default child is stopped via Hermes' intended s6 path:

  1. hermes gateway stop and raw /command/s6-svc -d terminate the live child;
  2. or they return a visible failure/timeout when the child remains alive.

Silent success with want down plus a live child should not be reported as a successful stop.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Dynamic s6 gateway stop records want down but leaves live child running