hermes - ✅(Solved) Fix [Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership [1 pull requests, 1 participants]

qWaitCrypto · 2026-05-10T13:38:46Z

[hermes] PR 23334: fix kanban : harden worker ownership and recovery paths - Repository: NousResearch/hermes-agent - Author: qWaitCrypto - State: open | merged… # PR #23334: fix(kanban): harden worker ownership and recovery paths - Repository: NousResearch/hermes-agent - Author: qWaitCrypto - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/23334 ## Description (problem / solution / changelog) ## What does this PR do? This PR is a focused Kanban resubmission against current `main`. Create-time validation for toolset names in `task.skills` has already landed separately in #23273. This PR intentionally does not reimplement that path. It resubmits only the still-novel pieces from the earlier stacked PRs: - diagnostics for stuck / undispatchable tasks - narrow recovery commands and defensive dispatch preflight - worker startup ownership guard before any model API call ## Related Issue Related to #22925, #22926, #22927. Follow-up to #23209. Builds on #23273, which already landed create-time validation for toolset names in `task.skills`. Supersedes the earlier stacked PRs #22974, #23154, and #23183 by resubmitting only the still-novel Kanban hardening pieces against current `main`. ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [x] ✅ Tests (adding or improving test coverage) - [x] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - Add read-only diagnostics for `invalid_task_skills`, `assignee_profile_not_found`, and `stale_running_claim` - Add narrow operator recovery commands: - `hermes kanban edit --clear-skills` - `hermes kanban edit --reset-failures` - `hermes kanban edit --clear-claim` - Teach dispatch to hard-skip ready tasks whose persisted state already proves they cannot spawn correctly: - tasks with historical invalid persisted `skills` - tasks assigned to missing profiles - Add a read-only worker startup ownership guard before any model API call - Require the Kanban startup guard to validate task status, run ownership, and claim lock ownership - Treat malformed Kanban worker ownership env as a benign startup-guard skip instead of silently disabling the check - Factor the startup-guard early return through a shared helper in `run_agent.py` instead of hand-rolling large inline result dicts - Add targeted regression tests for diagnostics, recovery commands, dispatch skip behavior, and worker startup guard behavior ## How to Test 1. Run targeted Kanban tests: `pytest -q tests/hermes_cli/test_kanban_diagnostics.py tests/hermes_cli/test_kanban_db.py::test_reset_task_failures_clears_counter_and_emits_event tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_on_non_running_task tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_keeps_terminal_run_terminal tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_on_non_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_reclaimed_run tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_superseded_run_without_failure tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_requires_claim_lock tests/run_agent/test_kanban_worker_startup_guard.py` 2. Run the focused dispatch-preflight regression: `pytest -q tests/hermes_cli/test_kanban_db.py::test_dispatch_skips_invalid_task_skills_and_keeps_ready` 3. Create or patch a `ready` task so `tasks.skills` contains a toolset name, then run `hermes kanban dispatch --json` and verify it is reported under `skipped_invalid_skills` and remains `ready` 4. Use `hermes kanban edit --reset-failures` and `hermes kanban edit --clear-claim` to verify both recovery actions succeed on eligible tasks 5. Start a dispatcher-spawned worker against a reclaimed or superseded task and confirm the worker exits before any model API call ## Checklist ### Code - [x] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [x] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [x] I searched for [existing PRs](https://github.com/NousRes

hermes2026-05-10 13:38:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#23209•Fetched 2026-05-11 03:30:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

qWaitCrypto

Participants

qWaitCrypto

Timeline (top)

labeled ×3cross-referenced ×1

Root Cause

The proposed approach is better because it closes the concrete failure chain with a small number of reviewable changes.

RAW_BUFFERClick to expand / collapse

Problem or Use Case

Kanban has recently grown into a durable multi-process task queue, but some queue invariants are still only enforced indirectly through prompts, skill docs, or manual operator recovery.

This shows up as a cluster of related failure modes rather than one isolated bug:

tasks can be created with invalid skills values that are actually toolset names
ready tasks can remain dispatchable even when they are obviously undispatchable
stale / reclaimed / superseded workers can still begin work unless ownership is checked at startup
operators sometimes need diagnostics and narrow recovery commands for stuck task state

Related issues include:

#22921
#22922
#22924
#22925
#22926
#22927

The problem is not that Kanban needs a redesign. The problem is that a few important invariants are not yet enforced consistently across task creation, dispatch, and worker startup.

Proposed Solution

Do a narrow Kanban hardening pass across three layers:

Task validity and diagnostics

reject known toolset names in task.skills at create time
surface historical bad rows through diagnostics
add minimal recovery support for invalid persisted skills

Dispatch preflight for obviously undispatchable tasks

hard-skip ready tasks that are clearly invalid before spawn
initial hard-skip cases:
- invalid persisted task skills
- missing assignee profile
keep softer capability concerns advisory-only rather than blocking dispatch

Worker startup ownership guard

verify task/run/claim ownership at worker startup before useful work begins
benign-exit stale / reclaimed / superseded workers
avoid counting those ownership/lifecycle exits as real worker failures

This should stay intentionally narrow:

no required_toolsets manifest
no default profile permission expansion
no full capability model
no broad dispatcher redesign

Alternatives Considered

A few broader approaches were considered, but deferred on purpose:

Full capability modeling for profiles and tasks
- likely useful later, but too large for the current bug cluster
Expanding default profile toolsets
- changes default permissions and invites a bigger policy debate
A broader dispatcher redesign
- unnecessary for the current failure chain
Relying only on prompt/skill guidance
- this is the current weak point; core ownership and validity checks should live in the system, not only in model behavior

The proposed approach is better because it closes the concrete failure chain with a small number of reviewable changes.

Feature Type

Reliability / correctness

Scope

Medium (few files, < 300 lines)

Contribution

I'd like to implement this myself and submit a PR

Linked / stacked PRs

#22974 — initial validity + diagnostics + recovery surface
#23154 — dispatch preflight + recovery follow-up
#23183 — worker startup ownership guard

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - ✅(Solved) Fix [Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #23334: fix(kanban): harden worker ownership and recovery paths

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Problem or Use Case

Proposed Solution

Alternatives Considered

Feature Type

Scope

Contribution

Linked / stacked PRs

Still need to ship something?

TRENDING

hermes - ✅(Solved) Fix [Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #23334: fix(kanban): harden worker ownership and recovery paths

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Problem or Use Case

Proposed Solution

Alternatives Considered

Feature Type

Scope

Contribution

Linked / stacked PRs

Still need to ship something?

RELATED_DISCOVERY

TRENDING