hermes - 💡(How to fix) Fix Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban

StepCodex · 2026-05-16T23:58:08Z

[hermes] Kanban worker exits cleanly rc=0 with protocol violation when the agent ends its final turn with a text response finish reason=stop instead of calling… Kanban worker exits cleanly (rc=0) with `protocol_violation` when the agent ends its final turn with a text response (`finish_reason=stop`) instead of calling `kanban_complete` or `kanban_block`. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a `consecutive_failures` tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work. ## Workaround For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, `tool_turns` > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either: - `hermes kanban --board unblock ` and let the dispatcher retry, OR - Transfer the work to another agent / do it manually, then `unblock` → `complete --result "..." --metadata '{...}'`. Note: `complete` doesn't accept `--note`; use `comment` first if you want a narrative record. Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish. ## Summary Kanban worker exits cleanly (rc=0) with `protocol_violation` when the agent ends its final turn with a text response (`finish_reason=stop`) instead of calling `kanban_complete` or `kanban_block`. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a `consecutive_failures` tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work. ## Environment - **Hermes version:** v0.13.0 (2026.5.7) - **Platform:** macOS (Darwin 25.3.0) - **Provider:** `openai-codex` (ChatGPT subscription OAuth) - **Model:** `gpt-5.3-codex` - **Board:** custom `legal` board on a single-host `dir` workspace (`workspace: dir @ /path/to/repo`) - **Worker invocation:** dispatched by `ai.hermes.gateway` launchd service (60s tick) ## Reproduction 1. Create a kanban task that asks the agent to author a substantial markdown file (in this repro: a California family-law authority note, ~5 KB target with structured sections). 2. Gateway dispatcher claims the task and spawns a worker process. 3. The agent works through ~28 tool turns / ~29 API calls over ~3 minutes — `search_files`, `read_file`, `terminal`. One tool error in this trace (`pdftotext: command not found, exit_code 127`) but the agent recovered. 4. Final API call returns: ``` Turn ended: reason=text_response(finish_reason=stop) model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28 last_msg_role=assistant response_len=668 session=20260516_163300_efbf57 ``` 5. The worker proceeds with post-turn housekeeping — the `bg-review` skill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0. 6. **The agent never called `kanban_complete` or `kanban_block`** at any point in the session, and never wrote any file either. 7. Dispatcher logs: ``` protocol_violation: worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation ``` and ticks the failure counter. After 3 such occurrences the task is set to `blocked` with diagnostic `Agent crash x3`. In the original case I hit, the dispatcher even applied `effective_limit: 1, limit_source: 'dispatcher'` on the first failure (not the `max-retries: 2` from the task config), making it harder to recover without manual `unblock`. ## Why this is a worker-side bug, not a model bug The agent reached `finish_reason=stop` with a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself. I observed three different root causes for this same dispatcher-level symptom across three runs of the same task: 1. **Workspace contention** — worker spawned while another worker for an earlier task was still using the same `dir` workspace. Crashed in <60s. 2. **Model auth** — `config.yaml` default had been flipped to a model not available on the user's ChatGPT account (`gpt-5.5-codex`); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only reads `auth.json` (not the `fallback_providers` block in `config.yaml`), so the worker exited rc=0. 3. **Text-response-without-checkpoint** — described above; agent finished its turn cleanly and exited. All three surface to the dispatcher with the identical `protocol_violation` string, which made root-cause diagnosis significantly slower than it needed to be. ## Suggested fixes (any one would help) 1. **Inject a checkpoint reminder on `finish_reason=stop` without prior checkpoint.** Before the worker exits

Error Message

I have full agent.log and gateway.log excerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:

20260516_145050_c0bdcb — failure shape 1 (workspace contention + model auth, mixed signal)
20260516_152454_13e840 — failure shape 2 (clean model auth failure on gpt-5.5-codex)
20260516_163300_efbf57 — failure shape 3 (clean gpt-5.3-codex run, agent finished with text response without checkpoint — the canonical repro of this issue)

Fix Action

Workaround

For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, tool_turns > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:

hermes kanban --board <slug> unblock <task_id> and let the dispatcher retry, OR
Transfer the work to another agent / do it manually, then unblock → complete --result "..." --metadata '{...}'. Note: complete doesn't accept --note; use comment first if you want a narrative record.

Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.

Code Example

Turn ended: reason=text_response(finish_reason=stop)
     model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
     last_msg_role=assistant response_len=668
     session=20260516_163300_efbf57

---

protocol_violation: worker exited cleanly (rc=0) without calling
     kanban_complete or kanban_block — protocol violation

Summary

Kanban worker exits cleanly (rc=0) with protocol_violation when the agent ends its final turn with a text response (finish_reason=stop) instead of calling kanban_complete or kanban_block. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a consecutive_failures tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.

Environment

Hermes version: v0.13.0 (2026.5.7)
Platform: macOS (Darwin 25.3.0)
Provider: openai-codex (ChatGPT subscription OAuth)
Model: gpt-5.3-codex
Board: custom legal board on a single-host dir workspace (workspace: dir @ /path/to/repo)
Worker invocation: dispatched by ai.hermes.gateway launchd service (60s tick)

Reproduction

Create a kanban task that asks the agent to author a substantial markdown file (in this repro: a California family-law authority note, ~5 KB target with structured sections).
Gateway dispatcher claims the task and spawns a worker process.
The agent works through ~28 tool turns / ~29 API calls over ~3 minutes — search_files, read_file, terminal. One tool error in this trace (pdftotext: command not found, exit_code 127) but the agent recovered.

Final API call returns:

Turn ended: reason=text_response(finish_reason=stop)
  model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
  last_msg_role=assistant response_len=668
  session=20260516_163300_efbf57

The worker proceeds with post-turn housekeeping — the bg-review skill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0.
The agent never called kanban_complete or kanban_block at any point in the session, and never wrote any file either.
Dispatcher logs:
```
protocol_violation: worker exited cleanly (rc=0) without calling
  kanban_complete or kanban_block — protocol violation
```
and ticks the failure counter. After 3 such occurrences the task is set to blocked with diagnostic Agent crash x3.

In the original case I hit, the dispatcher even applied effective_limit: 1, limit_source: 'dispatcher' on the first failure (not the max-retries: 2 from the task config), making it harder to recover without manual unblock.

Why this is a worker-side bug, not a model bug

The agent reached finish_reason=stop with a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself.

I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:

Workspace contention — worker spawned while another worker for an earlier task was still using the same dir workspace. Crashed in <60s.
Model auth — config.yaml default had been flipped to a model not available on the user's ChatGPT account (gpt-5.5-codex); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only reads auth.json (not the fallback_providers block in config.yaml), so the worker exited rc=0.
Text-response-without-checkpoint — described above; agent finished its turn cleanly and exited.

All three surface to the dispatcher with the identical protocol_violation string, which made root-cause diagnosis significantly slower than it needed to be.

Suggested fixes (any one would help)

Inject a checkpoint reminder on finish_reason=stop without prior checkpoint. Before the worker exits, if the agent ended its turn with a text response and never called kanban_complete / kanban_block, inject a <system-reminder>-style message back into the conversation: "You ended your turn without calling kanban_complete or kanban_block. Please call the appropriate one to checkpoint your work." Give the agent one more turn to comply.
Differentiate the dispatcher error message. protocol_violation should at minimum distinguish between:
- "worker process crashed unexpectedly" (true crash)
- "agent reached finish_reason=stop without calling protocol tool" (clean exit, no checkpoint)
- "model/auth error caused immediate exit" (worker never got a turn)
Different upstream causes deserve different remediation paths.
Render the checkpoint requirement at every turn, not just the initial prompt. Kanban-work system prompts typically describe the checkpoint requirement in the opening text — by turn 28 it's outside the model's effective recency window for instruction-following. A short, persistent reminder appended to each user/tool turn (à la Claude Code's <system-reminder> blocks) would substantially reduce this failure mode.
Don't auto-block on the first checkpoint-omission failure. The dispatcher's effective_limit: 1 override of the task's max-retries: 2 makes this failure mode unrecoverable without manual unblock. Where the worker exit shape is "agent thinks it's done, just forgot to checkpoint," a single automatic retry with an injected reminder is much more useful than immediate hard-block.

Logs

20260516_145050_c0bdcb — failure shape 1 (workspace contention + model auth, mixed signal)
20260516_152454_13e840 — failure shape 2 (clean model auth failure on gpt-5.5-codex)
20260516_163300_efbf57 — failure shape 3 (clean gpt-5.3-codex run, agent finished with text response without checkpoint — the canonical repro of this issue)

Workaround

hermes kanban --board <slug> unblock <task_id> and let the dispatcher retry, OR
Transfer the work to another agent / do it manually, then unblock → complete --result "..." --metadata '{...}'. Note: complete doesn't accept --note; use comment first if you want a narrative record.

Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Why this is a worker-side bug, not a model bug

Suggested fixes (any one would help)

Logs

Workaround

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Why this is a worker-side bug, not a model bug

Suggested fixes (any one would help)

Logs

Workaround

Still need to ship something?

RELATED_DISCOVERY

TRENDING