hermes - ✅(Solved) Fix [Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23209Fetched 2026-05-11 03:30:30
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
labeled ×3cross-referenced ×1

Root Cause

The proposed approach is better because it closes the concrete failure chain with a small number of reviewable changes.

Fix Action

Fix / Workaround

This shows up as a cluster of related failure modes rather than one isolated bug:

  • tasks can be created with invalid skills values that are actually toolset names
  • ready tasks can remain dispatchable even when they are obviously undispatchable
  • stale / reclaimed / superseded workers can still begin work unless ownership is checked at startup
  • operators sometimes need diagnostics and narrow recovery commands for stuck task state

The problem is not that Kanban needs a redesign. The problem is that a few important invariants are not yet enforced consistently across task creation, dispatch, and worker startup.

  1. Dispatch preflight for obviously undispatchable tasks
  • hard-skip ready tasks that are clearly invalid before spawn
  • initial hard-skip cases:
    • invalid persisted task skills
    • missing assignee profile
  • keep softer capability concerns advisory-only rather than blocking dispatch

PR fix notes

PR #23334: fix(kanban): harden worker ownership and recovery paths

Description (problem / solution / changelog)

What does this PR do?

This PR is a focused Kanban resubmission against current main.

Create-time validation for toolset names in task.skills has already landed separately in #23273. This PR intentionally does not reimplement that path.

It resubmits only the still-novel pieces from the earlier stacked PRs:

  • diagnostics for stuck / undispatchable tasks
  • narrow recovery commands and defensive dispatch preflight
  • worker startup ownership guard before any model API call

Related Issue

Related to #22925, #22926, #22927.

Follow-up to #23209.

Builds on #23273, which already landed create-time validation for toolset names in task.skills.

Supersedes the earlier stacked PRs #22974, #23154, and #23183 by resubmitting only the still-novel Kanban hardening pieces against current main.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Add read-only diagnostics for invalid_task_skills, assignee_profile_not_found, and stale_running_claim
  • Add narrow operator recovery commands:
    • hermes kanban edit <task> --clear-skills
    • hermes kanban edit <task> --reset-failures
    • hermes kanban edit <task> --clear-claim
  • Teach dispatch to hard-skip ready tasks whose persisted state already proves they cannot spawn correctly:
    • tasks with historical invalid persisted skills
    • tasks assigned to missing profiles
  • Add a read-only worker startup ownership guard before any model API call
  • Require the Kanban startup guard to validate task status, run ownership, and claim lock ownership
  • Treat malformed Kanban worker ownership env as a benign startup-guard skip instead of silently disabling the check
  • Factor the startup-guard early return through a shared helper in run_agent.py instead of hand-rolling large inline result dicts
  • Add targeted regression tests for diagnostics, recovery commands, dispatch skip behavior, and worker startup guard behavior

How to Test

  1. Run targeted Kanban tests: pytest -q tests/hermes_cli/test_kanban_diagnostics.py tests/hermes_cli/test_kanban_db.py::test_reset_task_failures_clears_counter_and_emits_event tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_on_non_running_task tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_keeps_terminal_run_terminal tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_on_non_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_reclaimed_run tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_superseded_run_without_failure tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_requires_claim_lock tests/run_agent/test_kanban_worker_startup_guard.py
  2. Run the focused dispatch-preflight regression: pytest -q tests/hermes_cli/test_kanban_db.py::test_dispatch_skips_invalid_task_skills_and_keeps_ready
  3. Create or patch a ready task so tasks.skills contains a toolset name, then run hermes kanban dispatch --json and verify it is reported under skipped_invalid_skills and remains ready
  4. Use hermes kanban edit <task_id> --reset-failures and hermes kanban edit <task_id> --clear-claim to verify both recovery actions succeed on eligible tasks
  5. Start a dispatcher-spawned worker against a reclaimed or superseded task and confirm the worker exits before any model API call

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux (WSL-style dev environment)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • Targeted verification passed: 44 passed in 38.48s
  • Focused dispatch-preflight regression passed locally

Changed files

  • hermes_cli/kanban.py (modified, +115/-3)
  • hermes_cli/kanban_db.py (modified, +242/-5)
  • hermes_cli/kanban_diagnostics.py (modified, +165/-0)
  • run_agent.py (modified, +132/-0)
  • tests/hermes_cli/test_kanban_core_functionality.py (modified, +203/-1)
  • tests/hermes_cli/test_kanban_db.py (modified, +123/-0)
  • tests/hermes_cli/test_kanban_diagnostics.py (modified, +78/-0)
  • tests/run_agent/test_kanban_worker_startup_guard.py (added, +140/-0)
RAW_BUFFERClick to expand / collapse

Problem or Use Case

Kanban has recently grown into a durable multi-process task queue, but some queue invariants are still only enforced indirectly through prompts, skill docs, or manual operator recovery.

This shows up as a cluster of related failure modes rather than one isolated bug:

  • tasks can be created with invalid skills values that are actually toolset names
  • ready tasks can remain dispatchable even when they are obviously undispatchable
  • stale / reclaimed / superseded workers can still begin work unless ownership is checked at startup
  • operators sometimes need diagnostics and narrow recovery commands for stuck task state

Related issues include:

  • #22921
  • #22922
  • #22924
  • #22925
  • #22926
  • #22927

The problem is not that Kanban needs a redesign. The problem is that a few important invariants are not yet enforced consistently across task creation, dispatch, and worker startup.

Proposed Solution

Do a narrow Kanban hardening pass across three layers:

  1. Task validity and diagnostics
  • reject known toolset names in task.skills at create time
  • surface historical bad rows through diagnostics
  • add minimal recovery support for invalid persisted skills
  1. Dispatch preflight for obviously undispatchable tasks
  • hard-skip ready tasks that are clearly invalid before spawn
  • initial hard-skip cases:
    • invalid persisted task skills
    • missing assignee profile
  • keep softer capability concerns advisory-only rather than blocking dispatch
  1. Worker startup ownership guard
  • verify task/run/claim ownership at worker startup before useful work begins
  • benign-exit stale / reclaimed / superseded workers
  • avoid counting those ownership/lifecycle exits as real worker failures

This should stay intentionally narrow:

  • no required_toolsets manifest
  • no default profile permission expansion
  • no full capability model
  • no broad dispatcher redesign

Alternatives Considered

A few broader approaches were considered, but deferred on purpose:

  • Full capability modeling for profiles and tasks
    • likely useful later, but too large for the current bug cluster
  • Expanding default profile toolsets
    • changes default permissions and invites a bigger policy debate
  • A broader dispatcher redesign
    • unnecessary for the current failure chain
  • Relying only on prompt/skill guidance
    • this is the current weak point; core ownership and validity checks should live in the system, not only in model behavior

The proposed approach is better because it closes the concrete failure chain with a small number of reviewable changes.

Feature Type

Reliability / correctness

Scope

Medium (few files, < 300 lines)

Contribution

  • I'd like to implement this myself and submit a PR

Linked / stacked PRs

  • #22974 — initial validity + diagnostics + recovery surface
  • #23154 — dispatch preflight + recovery follow-up
  • #23183 — worker startup ownership guard

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership [1 pull requests, 1 participants]