hermes - ✅(Solved) Fix [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23216Fetched 2026-05-11 03:30:28
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×4cross-referenced ×1referenced ×1

A kanban-worker subprocess that hits max_turns (iteration budget exhaustion) exits with rc=0 after the agent loop's "asking model to summarise" path, without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusing gave_up after 1 failure, with no clear recovery signal for the operator.

Error Message

The dispatcher (hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up after effective_limit: 1 failure with no ability for operator-driven recovery.

Root Cause

run_agent.py:14232 is the iteration-exhaustion path:

f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) "
"— asking model to summarise"

This path:

  1. Asks the model to produce a final summary message.
  2. Returns the summary as the conversation's final response.
  3. Exits the agent loop with rc=0.

The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call kanban_complete or kanban_block before exiting) lives entirely in the kanban-worker SKILL prompt at skills/devops/kanban-worker/SKILL.md. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.

The dispatcher (hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up after effective_limit: 1 failure with no ability for operator-driven recovery.

Fix Action

Fix / Workaround

A kanban-worker subprocess that hits max_turns (iteration budget exhaustion) exits with rc=0 after the agent loop's "asking model to summarise" path, without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusing gave_up after 1 failure, with no clear recovery signal for the operator.

The worker process then exited rc=0. The dispatcher recorded:

event_kind = "protocol_violation"
error_text = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"
gave_up { 'failures': 1, 'effective_limit': 1, 'limit_source': 'dispatcher' }

PR fix notes

PR #23228: fix(kanban): call kanban_block on iteration-budget exhaustion to prevent protocol violation

Description (problem / solution / changelog)

Summary

When a kanban worker subprocess hits the iteration budget, the agent loop strips tools and asks the model for a summary via _handle_max_iterations(). The model cannot call kanban_block at that point (tools are gone), so the process exits with rc=0 without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation but treats it as fatal — giving up after 1 failure with effective_limit: 1, stranding all downstream tasks.

Root cause

The iteration-exhaustion path in run_agent.py (line ~14944) calls _handle_max_iterations() which makes a toolless API call for a summary. After the summary returns, the agent loop exits normally. There is no hook to notify the kanban dispatcher that the worker could not complete its task.

The kanban worker contract (call kanban_complete or kanban_block before exiting) lives in the kanban-worker SKILL prompt, but the iteration-exhaustion path bypasses the skill entirely — the model receives the summary directive from the agent loop, not from the skill.

Fix

After _handle_max_iterations() returns, check if HERMES_KANBAN_TASK is set (indicating the agent is running as a kanban worker). If so, call kanban_block via handle_function_call with a reason describing the exhaustion. The dispatcher then sees a clean block transition instead of a protocol violation, and the task can be retried or escalated by a human.

The kanban_block call is wrapped in a try/except to prevent failures from crashing the agent loop — if the block call fails, we log a warning and continue with the normal exit path.

Regression coverage

Two new tests in tests/run_agent/test_run_agent.py:

  1. test_kanban_block_called_on_iteration_exhaustion — sets HERMES_KANBAN_TASK, exhausts the iteration budget, and asserts that handle_function_call is called exactly once with kanban_block and the correct task_id/reason.

  2. test_no_kanban_block_when_not_in_kanban_mode — exhausts the iteration budget without HERMES_KANBAN_TASK set and asserts that kanban_block is never called (no spurious side effects).

Testing

  • All 325 tests in tests/run_agent/test_run_agent.py pass (including the 2 new ones).
  • No pre-existing test failures introduced.

Fixes [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks #23216

Changed files

  • run_agent.py (modified, +35/-1)
  • tests/run_agent/test_run_agent.py (modified, +82/-0)

Code Example

⚠️  Iteration budget reached (30/30) — response may be incomplete

---

Not completed:
- I did not yet assemble the final render-payload.json.
- I did not yet run workers.helpers.render_charts.
- I did not yet run workers.helpers.render_html.
- I did not yet generate Telegram text.
- I did not yet run workers.helpers.writer_postwrite.
- I did not yet complete the kanban task via complete_validated.

---

event_kind = "protocol_violation"
error_text = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"
gave_up { 'failures': 1, 'effective_limit': 1, 'limit_source': 'dispatcher' }

---

f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) "
"— asking model to summarise"
RAW_BUFFERClick to expand / collapse

Summary

A kanban-worker subprocess that hits max_turns (iteration budget exhaustion) exits with rc=0 after the agent loop's "asking model to summarise" path, without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusing gave_up after 1 failure, with no clear recovery signal for the operator.

Environment

  • Hermes Agent: v0.13.0 (v2026.5.7), commit eeef486 baseline + two local cherry-picks (aaa700c65 = PR #12953 keepalive bypass, 4ce6c96e2 = PR #19485 runtime TLS).
  • Kanban-driven workload: a multi-stage DAG of profile-specific worker tasks (e.g. digest-writer).
  • Worker config: agent.max_turns: 30 in profile config (per-profile cap, lower than global 100).

Real-world reproduction

A morning-cron-driven digest pipeline ran 2026-05-10 07:18 CT. Lanes T1 through T8 completed cleanly. T9 writer (t_b1376310) was claimed by a kanban-worker subprocess (PID 13754) at 07:49 CT. The worker progressed through ~30 successful agent iterations, did initial preparation work (read upstream payload, validate evidence), then hit:

⚠️  Iteration budget reached (30/30) — response may be incomplete

The agent's final response listed unfinished steps:

Not completed:
- I did not yet assemble the final render-payload.json.
- I did not yet run workers.helpers.render_charts.
- I did not yet run workers.helpers.render_html.
- I did not yet generate Telegram text.
- I did not yet run workers.helpers.writer_postwrite.
- I did not yet complete the kanban task via complete_validated.

The worker process then exited rc=0. The dispatcher recorded:

event_kind = "protocol_violation"
error_text = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"
gave_up { 'failures': 1, 'effective_limit': 1, 'limit_source': 'dispatcher' }

The downstream T10 deliverer task remained todo and never fired. The morning digest did not deliver to the user.

Root cause

run_agent.py:14232 is the iteration-exhaustion path:

f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) "
"— asking model to summarise"

This path:

  1. Asks the model to produce a final summary message.
  2. Returns the summary as the conversation's final response.
  3. Exits the agent loop with rc=0.

The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call kanban_complete or kanban_block before exiting) lives entirely in the kanban-worker SKILL prompt at skills/devops/kanban-worker/SKILL.md. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.

The dispatcher (hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up after effective_limit: 1 failure with no ability for operator-driven recovery.

Why the worker can't fix itself

The kanban-worker skill text already documents the contract clearly. But:

  • The model can't call kanban_block from inside the iteration-exhaustion summary because at that point the agent loop has already taken control of the prompt and is asking for a summary, not a final tool call.
  • A model attempting to call kanban_block from the summary path would still emit rc=0 if the block call landed in the summary text rather than as a real tool invocation, leaving the dispatcher confused either way.

Proposed fix shapes

Three reasonable fix surfaces, in order of invasiveness:

1. Runtime patch in run_agent.py

When the iteration budget is exhausted AND the agent is running under a kanban-worker context (detect via env var HERMES_KANBAN_TASK_ID or equivalent), auto-emit a kanban_block tool call as the final action before returning the summary. The block reason would be "iteration budget exhausted (N/N); state preserved at <workspace>".

Pros: most explicit, reliable. Cons: cross-cuts the agent loop with kanban-specific behavior.

2. Dispatcher policy change in hermes_cli/kanban_db.py

Map the protocol_violation event to auto_blocked instead of gave_up on first occurrence (the effective_limit: 1 path). This way the task ends up explicitly blocked with a clear reason, rather than gave_up which suggests the dispatcher gave up on retrying.

Pros: smallest surface area, no run_agent changes. Cons: doesn't fix the underlying contract violation, just relabels its outcome. But: produces the right operator UX (task is blocked, can be unblock'd, dispatcher will retry).

3. Skill prompt adjustment in skills/devops/kanban-worker/SKILL.md

Add explicit text: "If you sense you are approaching max_turns and have not yet completed the task, your last act must be a real kanban_block tool call, not a free-text response."

Pros: zero runtime change. Cons: depends on model compliance; the summary prompt is added by the agent loop, not the skill, and the model may not honor the skill instruction once the summary directive is in play.

Asks

  1. Confirm the protocol-violation pattern is intended behavior or a known gap.
  2. Pick a fix shape (or combination) you'd accept upstream.
  3. If a runtime patch is welcome, willing to send a PR.

Related

  • Detection of the protocol violation: hermes_cli/kanban_db.py:3119-3170 (correct, just under-acted-upon).
  • Iteration-exhaustion summary path: run_agent.py:14232.
  • Per-task max_retries override (PR #21330, merged): provides a control surface for retry policy but doesn't address the violation path itself.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks [1 pull requests, 1 participants]