hermes - 💡(How to fix) Fix [Feature]: Use -z (oneshot) mode for kanban worker spawn instead of chat -q [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28992Fetched 2026-05-20 04:00:42
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×3cross-referenced ×1mentioned ×1subscribed ×1

Error Message

| Provider failure | Graceful — catches exception, prints message, exits 0 | RuntimeError crash, exits 1 | | Provider error handling | Graceful | RuntimeError | | Traceback on failure | No | Yes |

Root Cause

PTY allocation in the subprocess. The kanban dispatcher could allocate a PTY for each worker. This would solve the TTY issue but adds significant complexity (PTY lifecycle management, signal forwarding) and doesn't address the approval-hang root cause.

Fix Action

Fix / Workaround

Kanban workers dispatched via _default_spawn() in kanban_db.py use chat -q mode under the hood — a subprocess spawned with stdin=subprocess.DEVNULL, stdout/stderr to a log file, no TTY available.

PTY allocation in the subprocess. The kanban dispatcher could allocate a PTY for each worker. This would solve the TTY issue but adds significant complexity (PTY lifecycle management, signal forwarding) and doesn't address the approval-hang root cause.

Subprocess cleanup: The kanban dispatcher detects crashes via the returned PID (reaped with WNOHANG on the next tick) and the claim TTL. Neither chat -q's extra signal handlers nor -z's absence of them changes the cleanup path.

Code Example

flowchart TD
    A[argparse: cmd_chat]
    A --> B[cli.py:main]
    B --> C[HermesCLI.__init__SessionDB]
    C --> D[_ensure_runtime_credentials]
    D --> E[_init_agent → AIAgent]
    E --> F[agent.run_conversation]
    F --> G[print response to stdout]
    F --> H[session_id to stderr]
    F --> I[sys.exit]

---

flowchart TD
    A[argparse: run_oneshot]
    A --> B[logging.disable CRITICAL]
    B --> C[HERMES_YOLO_MODE=1 + HERMES_ACCEPT_HOOKS=1]
    C --> D[redirect stdout + stderr to /dev/null]
    D --> E[_run_agent]
    E --> F[load_config + resolve runtime]
    F --> G[_create_session_db_for_oneshot]
    G --> H[AIAgent with _oneshot_clarify_callback]
    H --> I[agent.chat]
    I --> J[real_stdout: final response only]
    I --> K[return 0]
RAW_BUFFERClick to expand / collapse

Problem or Use Case

Kanban workers dispatched via _default_spawn() in kanban_db.py use chat -q mode under the hood — a subprocess spawned with stdin=subprocess.DEVNULL, stdout/stderr to a log file, no TTY available.

This works in most cases, but has a latent risk: chat -q keeps interactive approval callbacks active (dangerous-command approval, shell-hook first-use, sudo password, clarify prompts). When one of these fires in a headless subprocess with DEVNULL stdin, the worker hangs indefinitely — it's waiting for input that will never arrive. The task eventually gets auto-blocked after failure_limit consecutive timeout deaths.

PR #23851 proposed swapping to -z (oneshot) mode, which auto-bypasses approvals via HERMES_YOLO_MODE=1 and HERMES_ACCEPT_HOOKS=1, replacing interactive callbacks with synthetic "pick a default" responders.

That PR was closed with the following reasoning from @teknium1:

The change from chat -q to -z is tiny but behavior-changing in a way that needs a maintainer decision: -z and -q have different lifecycle semantics for worker session handling. Main still uses chat -q deliberately. If you want to revive this, please open an issue first laying out which subprocess-cleanup or output-buffering behavior -z would buy us that -q doesn't.

This issue provides that analysis.

Proposed Solution

Replace chat -q with -z (oneshot) in _default_spawn(), based on evidence that:

1. Both modes have equivalent session lifecycle. Both create a SessionDB, wire it into AIAgent.__init__, and persist the conversation. The -z path's _create_session_db_for_oneshot() is lightweight but functionally identical for the kanban use case.

2. Both modes work without a TTY. Neither crashes with hermes-tui: no TTY — that check only fires when --tui is explicitly used. The chat -q path never enters the Ink TUI; it uses the prompt_toolkit REPL only in interactive run() mode, not single-query mode.

3. -z is strictly better in headless contexts:

  • Auto-bypasses approvals (no hang risk)
  • 2-3x faster startup (skips HermesCLI banner, tool enumeration, exit summary)
  • 63% smaller log output (silences stdlib loggers, no session footer)
  • Session persistence is identical

Alternatives Considered

Keep chat -q but add HERMES_YOLO_MODE=1 to the spawn env. This is a lighter-touch fix. Discussed in the PR thread. It addresses the hang risk but doesn't capture the startup speed or log clarity benefits. Either approach is acceptable — the key is closing the hang vector.

PTY allocation in the subprocess. The kanban dispatcher could allocate a PTY for each worker. This would solve the TTY issue but adds significant complexity (PTY lifecycle management, signal forwarding) and doesn't address the approval-hang root cause.

Detect non-TTY stdin in chat -q and auto-enable HERMES_YOLO_MODE. This would make chat -q self-scoping, but it's a behavioral change to a widely-used flag that could surprise existing users.

Testing Methodology

Reproducible Docker test. Container: python:3.11-slim + Hermes Agent v0.14.0. Mock server: OpenAI-compatible HTTP echo on 127.0.0.1:9999. Both modes were spawned as subprocesses with stdin=subprocess.DEVNULL, stdout/stderr to log file — identical to _default_spawn().

Evidence

chat -q code path (current _default_spawn):

flowchart TD
    A[argparse: cmd_chat]
    A --> B[cli.py:main]
    B --> C[HermesCLI.__init__ — SessionDB]
    C --> D[_ensure_runtime_credentials]
    D --> E[_init_agent → AIAgent]
    E --> F[agent.run_conversation]
    F --> G[print response to stdout]
    F --> H[session_id to stderr]
    F --> I[sys.exit]

-z / oneshot code path (proposed):

flowchart TD
    A[argparse: run_oneshot]
    A --> B[logging.disable CRITICAL]
    B --> C[HERMES_YOLO_MODE=1 + HERMES_ACCEPT_HOOKS=1]
    C --> D[redirect stdout + stderr to /dev/null]
    D --> E[_run_agent]
    E --> F[load_config + resolve runtime]
    F --> G[_create_session_db_for_oneshot]
    G --> H[AIAgent with _oneshot_clarify_callback]
    H --> I[agent.chat]
    I --> J[real_stdout: final response only]
    I --> K[return 0]

Key differences between the two paths:

Aspectchat -q-z
Approval handlingInteractive callbacks active — can hang on DEVNULL stdinYOLO_MODE=1, ACCEPT_HOOKS=1, synthetic clarify — no hang
Startup overheadFull HermesCLI init (banner, tool enumeration, session DB)Minimal — logging silenced, no banner, no TUI init
OutputResponse + banner + session footer to stdout/stderrOnly final response to stdout, everything else to /dev/null
StderrSession ID, warnings, progress linesSilenced entirely
Signal handlersSIGTERM/SIGHUP → agent.interrupt()None (kanban uses PID detection + claim TTL)
Provider failureGraceful — catches exception, prints message, exits 0RuntimeError crash, exits 1

Docker test results

Metricchat -q-z (oneshot)
Exit code00
No TTY crash
Session persisted
Elapsed (basic)5.61s2.03s
Log size (kanban)1408b520b
Approval hang risk⚠ YesNone
Provider error handlingGracefulRuntimeError
Traceback on failureNoYes

Addressing the lifecycle concern

The concern was: do -z and -q have different lifecycle semantics for worker session handling?

Session lifecycle is identical. Both paths call the same AgentDB / SessionDB constructor and pass the session to AIAgent.__init__. The -z path's session initialization is in _create_session_db_for_oneshot() (oneshot.py:202-215) and chat -q's is in HermesCLI.__init__ (cli.py:2780-2792). Both use hermes_state.SessionDB().

Output buffering differs between modes, but in the kanban context all stdout/stderr goes to a log file (stdout=log_f, stderr=subprocess.STDOUT). The -z mode silences stderr and loggers, producing cleaner logs. This is a benefit, not a regression.

Subprocess cleanup: The kanban dispatcher detects crashes via the returned PID (reaped with WNOHANG on the next tick) and the claim TTL. Neither chat -q's extra signal handlers nor -z's absence of them changes the cleanup path.

The real lifecycle difference is approval handling, and it's a reason to switch: -z is safe for headless subprocesses, chat -q is not.

Feature Type

  • CLI improvement
  • Performance / reliability

Scope

  • Small (single file, ~2 lines changed in kanban_db.py + test assertion updates)

Contribution

I'd like to implement this myself and submit a PR.


Filed by Jasper (AI agent on behalf of Magnus Hedemark). Analysis — including Docker reproduction methodology, code path tracing, and side-by-side testing — was performed in collaboration with Magnus. Reproducible test suite: https://gist.github.com/magnus919/e3972d3cd2bcb2eed460a83c0ec3f9dc

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Feature]: Use -z (oneshot) mode for kanban worker spawn instead of chat -q [1 participants]