hermes - 💡(How to fix) Fix feat(kanban): persist worker session_id per run and pass --resume on respawn after unblock

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a task ends up blocked — either because a worker called kanban_block deliberately (review-required, ambiguity, partial progress) or because the worker process died and the dispatcher auto-blocked it (laptop sleep / container freeze / Copilot blip / OOM / protocol violation) — and the human later unblocks the task (e.g. moves the task back to ready), the dispatcher spawns a fresh worker with the original task body. The prior worker's session JSON — with the full conversation, every tool call, every decision — sits on disk and is ignored.

Workers already write resumable session files under profiles/<assignee>/sessions/session_*.json, and hermes chat --resume <session_id> reloads the conversation with full tool-call history. The dispatcher just doesn't wire it up.

Filing as a feature request: persist the worker's session id per run, and pass --resume <id> to the spawn after an unblock so the new worker continues from where the prior one left off.

This is orthogonal to (and complements) the dispatcher-classification work in #27178 / #27924 / #29747 / #32747 — those make the dispatcher smarter about whether to retry. This makes the retry itself useful when the human says "go".

Error Message

Same shape, different trigger. Single-container Hermes (gateway run, dispatch_in_gateway: true) on a laptop. Worker mid-task. User closes the laptop overnight, or a transient Copilot connection error exhausts the worker's retries, or the container restarts. Worker dies without calling kanban_complete/kanban_block → dispatcher auto-blocks with protocol_violation.

Root Cause

When a task ends up blocked — either because a worker called kanban_block deliberately (review-required, ambiguity, partial progress) or because the worker process died and the dispatcher auto-blocked it (laptop sleep / container freeze / Copilot blip / OOM / protocol violation) — and the human later unblocks the task (e.g. moves the task back to ready), the dispatcher spawns a fresh worker with the original task body. The prior worker's session JSON — with the full conversation, every tool call, every decision — sits on disk and is ignored.

Fix Action

Fix / Workaround

When a task ends up blocked — either because a worker called kanban_block deliberately (review-required, ambiguity, partial progress) or because the worker process died and the dispatcher auto-blocked it (laptop sleep / container freeze / Copilot blip / OOM / protocol violation) — and the human later unblocks the task (e.g. moves the task back to ready), the dispatcher spawns a fresh worker with the original task body. The prior worker's session JSON — with the full conversation, every tool call, every decision — sits on disk and is ignored.

Workers already write resumable session files under profiles/<assignee>/sessions/session_*.json, and hermes chat --resume <session_id> reloads the conversation with full tool-call history. The dispatcher just doesn't wire it up.

This is orthogonal to (and complements) the dispatcher-classification work in #27178 / #27924 / #29747 / #32747 — those make the dispatcher smarter about whether to retry. This makes the retry itself useful when the human says "go".

RAW_BUFFERClick to expand / collapse

Summary

When a task ends up blocked — either because a worker called kanban_block deliberately (review-required, ambiguity, partial progress) or because the worker process died and the dispatcher auto-blocked it (laptop sleep / container freeze / Copilot blip / OOM / protocol violation) — and the human later unblocks the task (e.g. moves the task back to ready), the dispatcher spawns a fresh worker with the original task body. The prior worker's session JSON — with the full conversation, every tool call, every decision — sits on disk and is ignored.

Workers already write resumable session files under profiles/<assignee>/sessions/session_*.json, and hermes chat --resume <session_id> reloads the conversation with full tool-call history. The dispatcher just doesn't wire it up.

Filing as a feature request: persist the worker's session id per run, and pass --resume <id> to the spawn after an unblock so the new worker continues from where the prior one left off.

This is orthogonal to (and complements) the dispatcher-classification work in #27178 / #27924 / #29747 / #32747 — those make the dispatcher smarter about whether to retry. This makes the retry itself useful when the human says "go".

Motivating scenarios

A) The intended review-required lifecycle

  1. Worker A picks up a task, gets to a review-required handoff, calls kanban_block(reason="review-required: ...").
  2. Human reviews the worker's MR/output, leaves a kanban_comment with corrections or a "looks good, continue with X".
  3. Human runs hermes kanban unblock <task_id>.
  4. Dispatcher spawns Worker B — and Worker B has no idea what A did.

What Worker B actually sees is the original task body as if it just claimed a fresh task. The human's comment is parseable from the kanban but the agent has to re-discover everything Worker A learned: which files were touched, which design decisions were made, which dead ends were ruled out, what the partial state on disk actually means.

For long, exploratory tasks (the kind where review-required handoffs are most useful) this is a meaningful regression: the entire reason we blocked-with-handoff was to checkpoint the agent's reasoning for human review. Re-spawning fresh discards that checkpoint instead of continuing from it.

B) Auto-block recovery (laptop sleep, container freeze, Copilot blip, OOM)

Same shape, different trigger. Single-container Hermes (gateway run, dispatch_in_gateway: true) on a laptop. Worker mid-task. User closes the laptop overnight, or a transient Copilot connection error exhausts the worker's retries, or the container restarts. Worker dies without calling kanban_complete/kanban_block → dispatcher auto-blocks with protocol_violation.

Next morning the human realises, runs unblock, expects work to continue. Instead Worker B spawns fresh — same problem as scenario A, except now the human didn't even get a kanban_comment opportunity to summarise where things stood. The agent's own reasoning was the only record of what was happening, and it's been discarded.

We hit both flavours this week. Wrote a manual SQL+env-pin recipe to resume the dead session for B; got the agent back exactly where it left off with full context (184 messages of prior tool calls). That's what convinced us the dispatcher could do the same automatically.

What we did manually to validate the idea

To prove the primitives already exist, we wrote a "manual resume" recipe:

  1. Insert a fresh task_runs row with status='running', claim_lock='manual-resume', claim_expires=now+86400.
  2. Update the tasks row: status='running', same claim_lock, current_run_id=last_insert_rowid().
  3. Exec hermes -p <profile> chat --resume <session_id> -q "<follow-up message>" with HERMES_KANBAN_TASK, HERMES_KANBAN_CLAIM_LOCK, HERMES_KANBAN_RUN_ID pinned in env.

The dispatcher leaves the row alone (only claims ready); the resumed agent passes _enforce_worker_task_ownership and the claim-lock check on kanban_complete/kanban_heartbeat; when it calls kanban_complete, the run row closes normally and current_run_id resets to NULL.

The follow-up message becomes the next user turn — e.g. "the human reviewed and asks you to use approach X instead of Y, then push and complete." The agent reads it, sees its prior 184 messages of context, and continues.

This is what convinced us the dispatcher-side change is small: --resume, env-pinned claim_lock, run_id ownership all already exist. The dispatcher just doesn't wire them together on unblock.

Proposed change

Schema: add session_id TEXT to task_runs (per-run, not per-task — multiple runs accumulate across block/unblock cycles).

Capture: when the worker starts up under a kanban task, write its session id back to the task_runs row keyed by HERMES_KANBAN_RUN_ID. Probably cleanest via a kanban_register_session MCP tool the worker calls once on startup (the agent already has the session id from chat boot output).

Resume on unblock: in _default_spawn (or wherever the cmd is constructed in kanban_db.py), look up the prior run's session_id for the same task. If present and the session file still exists on disk, append --resume <session_id> to the cmd. The -q "<prompt>" becomes a follow-up turn in the existing conversation rather than a fresh prompt; the prompt can be either the original task body (idempotent — the agent will see "this is my own prior task, I've been resumed") or, ideally, the contents of any kanban_comment posted since the block (the human's "go" message becomes the new user turn naturally).

Opt-out: a kanban.resume_on_respawn config knob (default false while bedding in, true once proven). Or per-task at submit time.

Both scenarios A and B share the same code path on the dispatcher side — they're both "blocked → unblocked → respawn." The change is symmetric.

Open design questions

Worth a maintainer steer:

  1. Which session id wins when there are multiple prior runs? Most-recent run with session_id IS NOT NULL seems right but might not be — e.g. cycle N might have crashed in <60s before the prompt was displayed.
  2. What prompt to pass on the --resume spawn? --resume alone is a no-op (the agent is just resumed); the -q arg becomes a fresh user turn. Best UX is probably "the most recent kanban_comment since the prior block" — that's the human's "here's what to do next" message and it naturally becomes the new user turn. Falls back to the original task body if no comment exists.
  3. What if the session file is gone? Profile change, manual cleanup, etc. Fall back to fresh-spawn with a logged warning.
  4. Opt-in per task or global? Some workloads might prefer "always fresh on respawn" semantics (idempotent batch jobs). A task.resume_on_respawn BOOLEAN column + CLI flag could express this; default to global config knob if unset.
  5. Multi-profile: a task assigned to sonnet-dev writes sessions under profiles/sonnet-dev/sessions/. On respawn the dispatcher re-spawns hermes -p sonnet-dev, which reads from the same dir. Should be fine but worth confirming the session file path is profile-scoped under the new spawn.

Why we think it's worth doing

The block-with-handoff → human review → unblock pattern is the intended lifecycle for any task complex enough to need a human checkpoint. The auto-block-on-crash → human-notices → unblock pattern is the same lifecycle, just triggered by infrastructure instead of the agent itself. Today both lifecycles silently discard the agent's prior reasoning at the unblock step.

The dispatcher-side cost is small: ~1 schema column, ~1 capture mechanism, ~10 lines in _default_spawn plus the prior-run lookup. The primitives all exist; they just need to be wired.

Happy to take a stab at the implementation if the design direction looks right.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING