hermes - ✅(Solved) Fix Kanban dispatcher should validate assignee profile readiness before spawning workers [4 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#20054Fetched 2026-05-06 06:39:01
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Participants
Timeline (top)
cross-referenced ×4labeled ×3closed ×1

Error Message

Kanban is a multi-profile orchestration primitive. A stale or half-created profile currently creates an opaque failure mode: the board says a task is assigned and dispatchable, but the worker cannot actually start correctly. Failing fast with a precise profile-readiness error would save debugging time and prevent dispatcher churn.

Root Cause

Kanban is a multi-profile orchestration primitive. A stale or half-created profile currently creates an opaque failure mode: the board says a task is assigned and dispatchable, but the worker cannot actually start correctly. Failing fast with a precise profile-readiness error would save debugging time and prevent dispatcher churn.

Fix Action

Fix / Workaround

Kanban task assignment treats assignee as a Hermes profile slug. If the profile directory exists but is incomplete or not runnable, the dispatcher can still claim the task and spawn hermes -p <profile> chat -q .... The failure then surfaces later as confusing worker/provider/auth behavior instead of a clear Kanban/profile-readiness diagnostic.

A Kanban task assigned to debugger-v1 was still eligible for dispatch. The resulting worker startup path failed indirectly through provider/auth fallback behavior rather than reporting something like:

Current dispatch path appears to normalize the assignee but not validate profile readiness before claiming/spawning:

PR fix notes

PR #20065: fix(kanban): validate worker profile before spawn

Description (problem / solution / changelog)

Summary

  • Add a cheap pre-spawn readiness guard for Kanban worker profiles
  • Fail fast when a non-default assignee profile does not exist or lacks config.yaml
  • Prevent _default_spawn from launching hermes -p <profile> for half-created profile directories
  • Update spawn-env tests to model runnable profiles explicitly

Fixes #20054

Scope

This intentionally checks deterministic local readiness only:

  • profile exists
  • profile has config.yaml

It does not attempt provider-specific credential validation; bad/expired credentials can still fail during worker startup and are reported through the existing spawn-failure path.

Test Plan

  • venv/bin/python -m pytest tests/hermes_cli/test_kanban_boards.py::TestWorkerSpawnEnv::test_default_spawn_rejects_half_created_profile -q -o 'addopts=' — watched fail before implementation, then pass
  • venv/bin/python -m pytest tests/hermes_cli/test_kanban_boards.py::TestWorkerSpawnEnv -q -o 'addopts=' → 3 passed
  • venv/bin/python -m pytest tests/hermes_cli/test_kanban_boards.py tests/hermes_cli/test_kanban_db.py tests/hermes_cli/test_kanban_cli.py -q -o 'addopts=' → 124 passed
  • venv/bin/python -m pytest tests/tools/test_kanban_tools.py tests/plugins/test_kanban_dashboard_plugin.py -q -o 'addopts=' → 89 passed, 2 unrelated deprecation warnings
  • venv/bin/python -m py_compile hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_boards.py tests/hermes_cli/test_kanban_db.py → passed

Changed files

  • hermes_cli/kanban_db.py (modified, +30/-0)
  • tests/hermes_cli/test_kanban_boards.py (modified, +39/-0)
  • tests/hermes_cli/test_kanban_db.py (modified, +6/-0)

PR #20067: fix(cli): validate assignee profile readiness before kanban dispatch

Description (problem / solution / changelog)

Validate that the assignee profile is runnable (directory exists and contains config.yaml) before the Kanban dispatcher claims a task and spawns hermes -p <profile> chat -q .... Half-created profile directories now fail fast with a precise diagnostic instead of cascading into opaque provider/auth errors.

What changed and why

  • Add profile_readiness_error(name) in hermes_cli/profiles.py that returns None for runnable profiles or a diagnostic string identifying what's missing (invalid name / dir missing / config.yaml missing). The default profile is always runnable since it IS HERMES_HOME.
  • In hermes_cli/kanban_db.py::dispatch_once, validate the assignee profile before claiming for the default-spawn path. Unrunnable profiles are claimed and auto-blocked immediately via the existing _record_spawn_failure circuit breaker with failure_limit=1 — readiness errors don't fix themselves between ticks, so retrying N times before blocking is wasted churn.
  • The gate is skipped when a custom spawn_fn is passed (tests, simulators, alternate worker hosts) so existing assignee semantics are preserved.
  • Updated test_workspace_resolution_failure_also_counts to seed a runnable worker profile so the new readiness gate doesn't pre-empt the workspace-resolution test path.

How to test

  • pytest tests/hermes_cli/test_profiles.py::TestProfileReadinessError -q
  • pytest tests/hermes_cli/test_kanban_db.py -q -k dispatch
  • pytest tests/hermes_cli/test_kanban_core_functionality.py -q
  • New test_dispatch_blocks_unrunnable_assignee_profile reproduces the issue scenario (profile dir with SOUL.md but no config.yaml) and asserts the task ends in blocked with a profile_not_runnable: ... reason in last_spawn_error.
  • New test_dispatch_allows_runnable_assignee_profile confirms a profile with config.yaml passes the gate and reaches the spawn path.

What platforms tested on

  • macOS on darwin-arm64 (local) — full tests/hermes_cli/test_kanban*.py and tests/hermes_cli/test_profiles.py pass (347 tests).

Fixes #20054

<!-- autocontrib:worker-id=issue-new-f392c31a kind=pr-open -->

Changed files

  • hermes_cli/kanban_db.py (modified, +25/-0)
  • hermes_cli/profiles.py (modified, +34/-0)
  • tests/hermes_cli/test_kanban_core_functionality.py (modified, +7/-0)
  • tests/hermes_cli/test_kanban_db.py (modified, +58/-0)
  • tests/hermes_cli/test_profiles.py (modified, +46/-0)

PR #20105: fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile

Description (problem / solution / changelog)

Summary

The kanban dispatcher's _default_spawn invokes hermes -p <task.assignee> chat -q .... When assignee names a control-plane lane (e.g. an interactive Claude Code terminal like orion-cc / orion-research) instead of a real Hermes profile, the subprocess fails on startup with Profile 'X' does not exist, gets reaped as a zombie, the TTL/crash detector reclaims the task back to ready, and the next tick re-spawns the same crashing worker.

Result: a permanent crash loop emitting spawned=N reclaimed=0 crashed=N in the gateway log every minute, two zombie processes per affected task, and CPU burn until someone notices.

Reproduce

# 1. Create a kanban task whose assignee names a non-profile.
hermes kanban create --assignee orion-cc --status ready \
    --title "Review PR #N" --body "..."
# 2. Start the gateway with the embedded dispatcher.
hermes gateway run

# gateway.log emits every minute:
#   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
# Per-task log /home/<u>/.hermes/<profile>/kanban/logs/<task_id>.log:
#   Error: Profile 'orion-cc' does not exist. Create it with:
#       hermes profile create orion-cc
# ps -ef | grep '[h]ermes.*defunct' — zombies pile up until reaped.

Fix

dispatch_once() now pre-checks hermes_cli.profiles.profile_exists(assignee) before claiming. If the profile does NOT exist, the row is appended to skipped_unassigned (semantically: it's unassigned to an executable profile) and the dispatcher moves on without claiming, spawning, or counting a crash.

The import is locally scoped + try/except wrapped, so if profile_exists is missing or fails to import (test isolation, future module restructure) the original behaviour is preserved unchanged.

Why profile-existence over a config flag

The kanban task body (t_2bab06e3 on Brecht-H's local kanban) hinted at gating behind a config flag like assignee=hermes|auto. Profile-existence is a strictly tighter check:

  • Self-documenting — the operator already knows whether they have an orion-cc profile; no allowlist to maintain.
  • Forward-compatible — the moment a new lane gets a real hermes profile create <name>, it auto-qualifies for spawn.
  • No new config surface — zero new keys in config.yaml.

Operators who want the "config flag" semantics can still opt in via creating an empty placeholder profile.

Validated live (Orion machine)

Two orion-research-assigned tasks (t_a14dc1d5 Bug-C investigation, t_646c96f2 provider-routing validation) had been crash-looping since 2026-05-05 06:58 UTC after Mac switched the lane workflow to kanban-pull-by-terminal. Pre-patch:

2026-05-05 07:30:05 INFO gateway.run: kanban dispatcher: tick spawned=2 reclaimed=0 crashed=2 timed_out=0 promoted=0 auto_blocked=0
2026-05-05 07:31:05 INFO gateway.run: kanban dispatcher: tick spawned=2 reclaimed=0 crashed=2 timed_out=0 promoted=0 auto_blocked=0
... (every minute, 2 hours+)

Post-patch (gateway restart at 07:41:39):

2026-05-05 07:41:39 INFO gateway.run: kanban dispatcher: embedded in gateway (interval=60.0s)
( silent — spawn_any=False on every tick, log line guarded behind `if res.spawned` )

Live state:

  • Stale running claims auto-reclaimed to ready on the first post-patch tick.
  • Tasks now sit at status=ready, claim_lock=None, worker_pid=None, spawn_failures=0 — clean, ready for terminal pull.
  • Dashboard / telegram / freqtrade / committee_listener all unaffected (only the dispatcher path changed).

Test plan

  • Live verification on Orion: 2-hour crash loop terminated, dispatcher silent, no defuncts pile up
  • Tasks reclaim cleanly to ready post-restart
  • Existing well-behaved tasks (assignee=daily) still spawn (counterfactual: profile_exists("daily") = True confirmed via Python REPL)
  • Defensive import — if hermes_cli.profiles ever moves, fall-through to original behaviour

🤖 Generated with Claude Code

Changed files

  • hermes_cli/kanban_db.py (modified, +17/-0)

PR #20165: fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)

Description (problem / solution / changelog)

Kanban dispatcher no longer crash-loops on tasks assigned to names that aren't real Hermes profiles, and the stuck-queue warning only fires when there's genuine spawnable work sitting idle.

Root cause: dispatch_once() claimed any ready+assigned task and shelled out hermes -p <assignee> chat -q .... When <assignee> named a control-plane terminal lane (e.g. orion-cc) rather than a profile on disk, the subprocess died with "Profile 'X' does not exist", was reaped as a zombie, the TTL detector released the claim back to ready, and the next tick re-spawned the same failing worker — forever.

Salvaged from #20105 + #20134 (@Brecht-H).

Changes

  • hermes_cli/kanban_db.py: dispatch_once() pre-checks profile_exists(assignee) before claiming; non-matches route into a new DispatchResult.skipped_nonspawnable bucket (separate from skipped_unassigned).
  • hermes_cli/kanban_db.py: new has_spawnable_ready(conn) helper returns True only if ≥1 ready+assigned+unclaimed task has an assignee that resolves to a real profile.
  • gateway/run.py + hermes_cli/kanban.py: both dispatchers swap their ready_nonempty probe to has_spawnable_ready, so "dispatcher stuck" WARN no longer fires on multi-lane hosts where the queue is healthy but none of the ready tasks target a spawnable profile.
  • tests/hermes_cli/conftest.py: new all_assignees_spawnable fixture monkeypatches profile_exists → True for tests that use synthetic assignees. Threaded through 8 dispatcher tests that the profile-exists guard would otherwise have silently broken.

Defensive import: both profile_exists lookups fall back to legacy "any ready+assigned" behavior if hermes_cli.profiles is unimportable, so degraded installs still surface the original warn.

Validation

BeforeAfter
Task assigned to orion-cc (not a profile)permanent crash loop, 2 zombies/tick, spawned=1 crashed=1 every minutesilent skip, skipped_nonspawnable=1, no claim, no zombie
Multi-lane queue full of terminal-lane assigneesdispatcher stuck WARN every 5 minsilent — has_spawnable_ready=False
Real profile missing PATH/venv/credsdispatcher stuck WARN still fires after 6 ticksunchanged (safety net intact)
Targeted tests246/246 pass (test_kanban_{db,cli,boards,core_functionality})

Live-verified by @Brecht-H on his Orion multi-lane host: 2-hour crash loop on t_a14dc1d5 + t_646c96f2 terminated on gateway restart; dispatcher silent on every subsequent tick; stale running claims reclaimed cleanly to ready.

Closes #20054 Closes #20105 Closes #20134 Supersedes #20065 (readiness check lives at a tighter call site — before claim, not before spawn)

Co-authored-by: Brecht-H [email protected]

Changed files

  • gateway/run.py (modified, +12/-7)
  • hermes_cli/kanban.py (modified, +16/-8)
  • hermes_cli/kanban_db.py (modified, +64/-0)
  • tests/hermes_cli/conftest.py (added, +19/-0)
  • tests/hermes_cli/test_kanban_core_functionality.py (modified, +6/-6)
  • tests/hermes_cli/test_kanban_db.py (modified, +51/-3)

Code Example

~/.hermes/profiles/debugger-v1/

---

assignee profile debugger-v1 is not runnable: missing config.yaml / credentials

---

cmd = ["hermes", "-p", profile_arg, "--skills", "kanban-worker", "chat", "-q", prompt]

---

kanban dispatcher stuck: ready queue non-empty ... Check profile health (venv, PATH, credentials)

---

Profile debugger-v1 is not runnable: missing ~/.hermes/profiles/debugger-v1/config.yaml
RAW_BUFFERClick to expand / collapse

Bug Description

Kanban task assignment treats assignee as a Hermes profile slug. If the profile directory exists but is incomplete or not runnable, the dispatcher can still claim the task and spawn hermes -p <profile> chat -q .... The failure then surfaces later as confusing worker/provider/auth behavior instead of a clear Kanban/profile-readiness diagnostic.

Observed Behavior

While recovering a local multi-agent Kanban setup, I found a profile directory:

~/.hermes/profiles/debugger-v1/

The directory existed and contained profile material such as skills / SOUL.md, but was missing the runnable profile prerequisites:

  • config.yaml
  • .env
  • auth.json

A Kanban task assigned to debugger-v1 was still eligible for dispatch. The resulting worker startup path failed indirectly through provider/auth fallback behavior rather than reporting something like:

assignee profile debugger-v1 is not runnable: missing config.yaml / credentials

After creating a proper profile config and syncing credential state, the same assignee worked and completed a smoke-test Kanban task.

Code Path / Evidence

Current dispatch path appears to normalize the assignee but not validate profile readiness before claiming/spawning:

  • hermes_cli/kanban_db.py::_canonical_assignee() only calls normalize_profile_name(assignee).
  • hermes_cli/kanban_db.py::dispatch_once() claims ready assigned tasks before spawn.
  • hermes_cli/kanban_db.py::_default_spawn() builds:
cmd = ["hermes", "-p", profile_arg, "--skills", "kanban-worker", "chat", "-q", prompt]

but does not appear to check that profile_arg is a runnable profile before spawning.

There is profile-name validation support in hermes_cli/profiles.py, including:

  • normalize_profile_name()
  • validate_profile_name()
  • profile_exists()

However, profile_exists() only checks that the profile directory exists. A half-created profile directory can therefore pass the rough existence condition while still being unrunnable.

The gateway dispatcher has aggregate stuck-queue telemetry:

kanban dispatcher stuck: ready queue non-empty ... Check profile health (venv, PATH, credentials)

but the per-task failure could be diagnosed earlier and more precisely.

Expected Behavior

Before claiming/spawning a Kanban worker, the dispatcher should validate that the assignee profile is runnable. For example:

  • profile name is valid and normalized
  • profile exists, unless default
  • profile has a readable config.yaml, or documented/default inheritance applies
  • provider/model can be resolved from the profile config
  • required credential state is present or explicitly inherited
  • optional: a cheap non-interactive profile startup/config check passes

If validation fails, Kanban should not spawn the worker. It should either:

  1. leave the task unclaimed and comment with the readiness failure, or
  2. auto-block the task with a clear reason, e.g.:
Profile debugger-v1 is not runnable: missing ~/.hermes/profiles/debugger-v1/config.yaml

Why This Matters

Kanban is a multi-profile orchestration primitive. A stale or half-created profile currently creates an opaque failure mode: the board says a task is assigned and dispatchable, but the worker cannot actually start correctly. Failing fast with a precise profile-readiness error would save debugging time and prevent dispatcher churn.

Related Issues / Distinction

This is related to Kanban multi-profile robustness, but distinct from:

  • #18442 — Kanban DB profile-scoping / shared board visibility
  • #18498 — case sensitivity in assignee/profile validation

This report is specifically about validating whether an assignee profile is runnable before worker spawn.

Environment

  • Hermes Agent v0.12.0 (2026.4.30)
  • Source commit: b816fd4e2
  • OS: Linux desktop 6.17.0-22-generic x86_64
  • Python: 3.11.15

extent analysis

TL;DR

Validate the assignee profile's readiness before claiming and spawning a Kanban worker to prevent opaque failure modes.

Guidance

  • Modify the dispatch_once() function in hermes_cli/kanban_db.py to call a new validate_profile_readiness() function that checks for the existence of required files like config.yaml, .env, and auth.json in the profile directory.
  • Use existing functions like normalize_profile_name() and profile_exists() from hermes_cli/profiles.py as a starting point for the new validation function.
  • Consider adding a cheap non-interactive profile startup/config check to the validation function to ensure the profile is runnable.
  • If validation fails, leave the task unclaimed and comment with the readiness failure or auto-block the task with a clear reason.

Example

def validate_profile_readiness(profile_name):
    profile_dir = os.path.join("~/.hermes/profiles", profile_name)
    required_files = ["config.yaml", ".env", "auth.json"]
    for file in required_files:
        if not os.path.exists(os.path.join(profile_dir, file)):
            return False
    return True

Notes

The proposed solution assumes that the required files are necessary for a profile to be considered runnable. Additional validation may be necessary depending on the specific requirements of the Hermes Agent and Kanban setup.

Recommendation

Apply a workaround by modifying the dispatch_once() function to validate the assignee profile's readiness before claiming and spawning a Kanban worker. This will prevent opaque failure modes and provide more informative error messages.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING