hermes - 💡(How to fix) Fix Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Kanban worker exits cleanly (rc=0) with protocol_violation when the agent ends its final turn with a text response (finish_reason=stop) instead of calling kanban_complete or kanban_block. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a consecutive_failures tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.

Error Message

I have full agent.log and gateway.log excerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:

  • 20260516_145050_c0bdcb — failure shape 1 (workspace contention + model auth, mixed signal)
  • 20260516_152454_13e840 — failure shape 2 (clean model auth failure on gpt-5.5-codex)
  • 20260516_163300_efbf57 — failure shape 3 (clean gpt-5.3-codex run, agent finished with text response without checkpoint — the canonical repro of this issue)

Root Cause

I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:

Fix Action

Workaround

For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, tool_turns > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:

  • hermes kanban --board <slug> unblock <task_id> and let the dispatcher retry, OR
  • Transfer the work to another agent / do it manually, then unblockcomplete --result "..." --metadata '{...}'. Note: complete doesn't accept --note; use comment first if you want a narrative record.

Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.

Code Example

Turn ended: reason=text_response(finish_reason=stop)
     model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
     last_msg_role=assistant response_len=668
     session=20260516_163300_efbf57

---

protocol_violation: worker exited cleanly (rc=0) without calling
     kanban_complete or kanban_block — protocol violation
RAW_BUFFERClick to expand / collapse

Summary

Kanban worker exits cleanly (rc=0) with protocol_violation when the agent ends its final turn with a text response (finish_reason=stop) instead of calling kanban_complete or kanban_block. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a consecutive_failures tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.

Environment

  • Hermes version: v0.13.0 (2026.5.7)
  • Platform: macOS (Darwin 25.3.0)
  • Provider: openai-codex (ChatGPT subscription OAuth)
  • Model: gpt-5.3-codex
  • Board: custom legal board on a single-host dir workspace (workspace: dir @ /path/to/repo)
  • Worker invocation: dispatched by ai.hermes.gateway launchd service (60s tick)

Reproduction

  1. Create a kanban task that asks the agent to author a substantial markdown file (in this repro: a California family-law authority note, ~5 KB target with structured sections).
  2. Gateway dispatcher claims the task and spawns a worker process.
  3. The agent works through ~28 tool turns / ~29 API calls over ~3 minutes — search_files, read_file, terminal. One tool error in this trace (pdftotext: command not found, exit_code 127) but the agent recovered.
  4. Final API call returns:
    Turn ended: reason=text_response(finish_reason=stop)
      model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
      last_msg_role=assistant response_len=668
      session=20260516_163300_efbf57
  5. The worker proceeds with post-turn housekeeping — the bg-review skill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0.
  6. The agent never called kanban_complete or kanban_block at any point in the session, and never wrote any file either.
  7. Dispatcher logs:
    protocol_violation: worker exited cleanly (rc=0) without calling
      kanban_complete or kanban_block — protocol violation
    and ticks the failure counter. After 3 such occurrences the task is set to blocked with diagnostic Agent crash x3.

In the original case I hit, the dispatcher even applied effective_limit: 1, limit_source: 'dispatcher' on the first failure (not the max-retries: 2 from the task config), making it harder to recover without manual unblock.

Why this is a worker-side bug, not a model bug

The agent reached finish_reason=stop with a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself.

I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:

  1. Workspace contention — worker spawned while another worker for an earlier task was still using the same dir workspace. Crashed in <60s.
  2. Model authconfig.yaml default had been flipped to a model not available on the user's ChatGPT account (gpt-5.5-codex); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only reads auth.json (not the fallback_providers block in config.yaml), so the worker exited rc=0.
  3. Text-response-without-checkpoint — described above; agent finished its turn cleanly and exited.

All three surface to the dispatcher with the identical protocol_violation string, which made root-cause diagnosis significantly slower than it needed to be.

Suggested fixes (any one would help)

  1. Inject a checkpoint reminder on finish_reason=stop without prior checkpoint. Before the worker exits, if the agent ended its turn with a text response and never called kanban_complete / kanban_block, inject a <system-reminder>-style message back into the conversation: "You ended your turn without calling kanban_complete or kanban_block. Please call the appropriate one to checkpoint your work." Give the agent one more turn to comply.

  2. Differentiate the dispatcher error message. protocol_violation should at minimum distinguish between:

    • "worker process crashed unexpectedly" (true crash)
    • "agent reached finish_reason=stop without calling protocol tool" (clean exit, no checkpoint)
    • "model/auth error caused immediate exit" (worker never got a turn)

    Different upstream causes deserve different remediation paths.

  3. Render the checkpoint requirement at every turn, not just the initial prompt. Kanban-work system prompts typically describe the checkpoint requirement in the opening text — by turn 28 it's outside the model's effective recency window for instruction-following. A short, persistent reminder appended to each user/tool turn (à la Claude Code's <system-reminder> blocks) would substantially reduce this failure mode.

  4. Don't auto-block on the first checkpoint-omission failure. The dispatcher's effective_limit: 1 override of the task's max-retries: 2 makes this failure mode unrecoverable without manual unblock. Where the worker exit shape is "agent thinks it's done, just forgot to checkpoint," a single automatic retry with an injected reminder is much more useful than immediate hard-block.

Logs

I have full agent.log and gateway.log excerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:

  • 20260516_145050_c0bdcb — failure shape 1 (workspace contention + model auth, mixed signal)
  • 20260516_152454_13e840 — failure shape 2 (clean model auth failure on gpt-5.5-codex)
  • 20260516_163300_efbf57 — failure shape 3 (clean gpt-5.3-codex run, agent finished with text response without checkpoint — the canonical repro of this issue)

Workaround

For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, tool_turns > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:

  • hermes kanban --board <slug> unblock <task_id> and let the dispatcher retry, OR
  • Transfer the work to another agent / do it manually, then unblockcomplete --result "..." --metadata '{...}'. Note: complete doesn't accept --note; use comment first if you want a narrative record.

Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block