hermes - ✅(Solved) Fix Add deployment-aware gateway/scheduler health status [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18641Fetched 2026-05-03 04:55:17
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×1

Error Message

  • Operators can run one command and get machine-readable health with explicit ok, warn, critical, or unknown state.

Root Cause

Hermes cron jobs depend on the gateway process because the scheduler runs inside the gateway. In containerized or sandboxed deployments, hermes cron status can be incomplete or misleading across PID namespaces, so operators may not know whether the scheduler is actually alive unless they separately inspect Docker/systemd/service state, gateway logs, recent cron run timestamps, and local health artifacts.

Fix Action

Fixed

PR fix notes

PR #18640: docs: clarify gateway scheduler health checks

Description (problem / solution / changelog)

Summary

  • Adds gateway troubleshooting guidance that cron jobs depend on gateway/scheduler liveness.
  • Warns that hermes cron status can be incomplete or misleading across Docker/sandbox PID namespaces.
  • Regenerates the bundled Hermes Agent skill website page from the source skill.

Validation

  • python website/scripts/generate-skill-docs.py — passed.
  • python /work/.hermes-data/skills/software-development/requesting-code-review/scripts/static_scan_diff.py --cached — passed.
  • HERMES_TDD_EVIDENCE="N/A docs-only change; validation: python website/scripts/generate-skill-docs.py; static_scan_diff.py --cached; independent reviewer passed" /work/.hermes-data/scripts/code_work_guard.py --mode final — passed.
  • Independent reviewer passed; no security concerns or blocking issues. Suggested wording precision was applied.

Notes

  • Runtime watchdog scripts and local health reports were intentionally not included; they are environment-specific local operations artifacts.
  • Strict RED/GREEN TDD is not applicable for this docs-only change.

Follow-up

A viable follow-up issue remains: add a deployment-aware JSON health/status surface for gateway + cron scheduler liveness. Filed follow-up issue: https://github.com/NousResearch/hermes-agent/issues/18641

Changed files

  • .devlogs/2026-05-02-gateway-health-docs.md (added, +85/-0)
  • .hermes/plans/2026-05-02-gateway-health-docs.md (added, +35/-0)
  • skills/autonomous-ai-agents/hermes-agent/SKILL.md (modified, +2/-0)
  • website/docs/user-guide/skills/bundled/autonomous-ai-agents/autonomous-ai-agents-hermes-agent.md (modified, +2/-0)
RAW_BUFFERClick to expand / collapse

Problem / Opportunity

Hermes cron jobs depend on the gateway process because the scheduler runs inside the gateway. In containerized or sandboxed deployments, hermes cron status can be incomplete or misleading across PID namespaces, so operators may not know whether the scheduler is actually alive unless they separately inspect Docker/systemd/service state, gateway logs, recent cron run timestamps, and local health artifacts.

This came up while stabilizing a Hermes environment where the gateway was down, cron jobs existed but were not firing, and a local sandbox process could make status checks ambiguous relative to the dedicated gateway container.

Goal

Provide a first-class, deployment-aware gateway/scheduler health signal that operators and external monitors can trust.

Proposed System

  • Add or improve a command such as hermes gateway status --json or hermes health --json that reports:
    • gateway process/container/service identity when discoverable
    • scheduler liveness
    • configured cron job count
    • most recent scheduler tick / job run timestamp
    • stale/missing health artifact warnings when configured
    • clear uncertainty when process visibility is limited by PID namespaces
  • Document recommended Docker/systemd healthcheck integration.
  • Make status output distinguish:
    • verified alive
    • verified down
    • unknown/inconclusive due to namespace/supervisor boundary

Initial Scope

In scope:

  • A reliable JSON health/status surface for gateway + cron scheduler liveness.
  • Documentation for outside-in monitoring in Docker/systemd/containerized deployments.
  • Tests or fixtures covering inconclusive PID namespace/service-state scenarios where feasible.

Out of scope:

  • Building a full monitoring system.
  • Hosting SaaS alerts.
  • Changing cron job execution semantics.

Evidence to Capture

  • Current implementation path for gateway status and cron status.
  • Gateway logs and scheduler state artifacts.
  • Docker/systemd process-discovery behavior, especially when Hermes runs inside a container or sandbox.

Acceptance Criteria

  • Operators can run one command and get machine-readable health with explicit ok, warn, critical, or unknown state.
  • The command does not report a false healthy state when the gateway process is unreachable from the current PID namespace.
  • Documentation recommends external healthchecks for gateway/scheduler liveness.
  • Tests cover status rendering for alive, down, and inconclusive states.

Open Questions

  • Should this live under hermes gateway status --json, hermes cron status --json, hermes doctor, or a new hermes health command?
  • Should Hermes persist scheduler heartbeat/tick state to disk so status can avoid relying only on process discovery?
  • What is the canonical status schema for external monitors?

extent analysis

TL;DR

Implement a deployment-aware hermes health command to provide a reliable JSON health/status surface for gateway and cron scheduler liveness.

Guidance

  • Introduce a new hermes health command with a JSON output to report gateway process identity, scheduler liveness, and cron job status.
  • Document recommended Docker/systemd healthcheck integration to ensure seamless monitoring in containerized deployments.
  • Develop tests to cover scenarios where process visibility is limited by PID namespaces, ensuring the command accurately reports unknown or inconclusive states.
  • Consider persisting scheduler heartbeat/tick state to disk to improve status accuracy and reduce reliance on process discovery.

Example

{
  "gateway": {
    "status": "ok",
    "identity": "hermes-gateway-123"
  },
  "scheduler": {
    "liveness": true,
    "last_tick": "2023-02-20T14:30:00Z"
  },
  "cron_jobs": {
    "count": 5,
    "last_run": "2023-02-20T14:25:00Z"
  },
  "health": {
    "status": "ok",
    "warnings": []
  }
}

Notes

The proposed solution focuses on providing a reliable health signal for the gateway and scheduler. However, the canonical status schema for external monitors and the persistence of scheduler heartbeat state require further discussion and clarification.

Recommendation

Apply a workaround by introducing a new hermes health command, as it provides a clear and reliable way to report gateway and scheduler liveness, addressing the current limitations and ambiguities in the hermes cron status command.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING