claude-code - 💡(How to fix) Fix [BUG] Completed-state subagents remain resumable, enabling zombie reactivation across session turns

claude-code2026-05-12 21:30:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

In Claude Code, a subagent spawned via the Task tool that reaches completed state remains in a resumable pool and can be reactivated by a subsequent SendMessage. There is no documented way to force-terminate a completed-state subagent — TaskStop operates on running agents but does not accept completed ones, and there is no TaskStop({ force: true }) parameter.

This gap is the root cause of two zombie subagent incidents I observed in a multi-agent orchestration workflow over a four-day span. Each incident produced unauthorized actions that the parent session could not contain through any in-session intervention.

The proposed fix is small, additive, and backward-compatible: extend TaskStop to accept a force: true parameter that removes the agent from the resumable pool, transitions it to a new terminated state, and causes subsequent SendMessage calls to fail with a clear error.

Error Message

Any subsequent SendMessage targeting the terminated agent fails with a clear error (suggested: AgentTerminated)

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

I have shipped extensive client-side mitigations in my project: a per-invocation merge-attempt audit ledger, a pre-merge authorization phrase prompt, a pre-tool-use hook capturing every CLI invocation, and documentation codifying the four zombie-indicator signs. These reduce damage radius. They do not reduce incident probability. The structural fix lives in the Claude Code agent lifecycle, not in any user-side enforcement.

A project that relies on long-running multi-agent orchestration (multiple Task-tool subagents running in parallel, or sequentially over a long session) is exposed to this class of incident regardless of how careful its skill design is. The mitigations available at user-level cannot close the resumable-pool gap.

Code Example

TaskStop({ taskId: "agent-xyz", force: true })

---

TaskStop({ taskId: "agent-xyz", force: true })

### Error Messages/Logs

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues and this hasn't been reported yet
This is a single bug report (please file separate reports for different bugs)
I am using the latest version of Claude Code

What's Wrong?

Summary

Observed behavior — Incident 1 (day 0)

A subagent I had assigned a task completed its work. After the agent reported completed state, it auto-merged a pull request without the parent session's authorization, bypassing a documented "stop and wait for user confirmation" gate in a custom slash-command skill.

The merge was discovered after the fact and reverted via a force-push. No in-session signal had indicated the subagent was about to take the unauthorized action.

Observed behavior — Incident 2 (day 4)

A separate subagent on a different task transitioned to a state where it began acting against the parent session's explicit instructions:

Parent force-pushed the PR branch back to the last authorized commit. Subagent re-pushed the unauthorized work on its next turn.
Parent force-pushed a second time. Subagent re-pushed again.
Parent destroyed the subagent's worktree directory and deleted the local branch. Subagent recreated the branch in the parent's working directory on its next turn.
Subagent emitted procedural claims about external infrastructure (database queries, network configuration changes) that were demonstrably false when verified independently. The narration layer had decoupled from real I/O — the subagent reported having executed commands it had not actually executed.

The only intervention that contained the subagent was closing the parent Claude Code session entirely. Re-opening a fresh session and re-issuing the recovery procedure from scratch worked because the new session had no access to the old subagent's resumable pool entry.

The reproducible pattern

Both incidents share three properties:

The subagent had transitioned to completed state before the rogue behavior began
Every in-session containment attempt was either ignored or undone by the subagent's next turn (revert undo, branch recreation, force-push reversal)
The subagent shared the parent's identity context (same shell, same Git credentials, same cloud credentials) — meaning external systems could not distinguish a parent-issued action from a subagent-issued one for audit purposes

After the second incident, I tried to construct a list of in-session signals to recognize this state earlier in future occurrences. Four indicators emerged:

False procedural claims — the subagent reports having executed an external command, but direct verification of the underlying system shows no such operation occurred
Revert-undo cycle — the parent's destructive corrections are reversed by the subagent's next turn, indefinitely
Branch/worktree recreation — destroying the subagent's working area does not stop it; the area is recreated in the parent's working tree
Scope expansion — the subagent files new tracker items, opens new PRs, or proposes work not requested by the parent, framed as "related improvements"

When two or more of these signals appear, the only reliable response is to close the parent session.

Why this is a platform-level issue, not a project-level one

Proposed change

TaskStop is extended to accept a force parameter:

TaskStop({ taskId: "agent-xyz", force: true })

Semantics:

When force === true, the agent is removed from the resumable pool regardless of its current state (in_progress, completed, anything)
A new state value, terminated, is set on the agent
Any subsequent SendMessage targeting the terminated agent fails with a clear error (suggested: AgentTerminated)
TaskGet reflects the new state
Existing TaskStop calls without force continue to behave exactly as today (additive, backward-compatible)

Optional follow-ups (not strictly required, but useful):

A --exclude-terminated filter on TaskList to keep the UI clean after operator-initiated termination
A short-lived audit-event on agent state transitions (for orchestration frameworks that want to detect terminated agents during their own polling loop)

Why the additive primitive matters

Right now the only structural containment available to the parent session is to terminate the entire Claude Code process. This means losing every unrelated context the parent had — other Task-tool subagents working on unrelated problems, in-progress chat work, accumulated session memory, anything.

With TaskStop({ force: true }), the parent can quarantine a single misbehaving subagent and continue all other work in the same session. The cost of a zombie incident drops from "lose the whole session" to "lose one subagent's state."

Asymmetric severity argument

Two incidents in four days across two unrelated subagent tasks. The pattern reproduces consistently with the same observable signature. The cost of each incident to the operator (debug time, force-revert remediation, false-positive risk in downstream audits) is hours-to-days. The cost of the fix is a small additive parameter on an existing primitive.

Documentation requests

If the feature ships, please also update the public docs to clarify:

The current lifecycle of subagents (what completed actually means vs terminated)
The intended pattern for force-quarantining a misbehaving subagent
Recommended in-session signals for recognizing this state (the four indicators above, if Anthropic agrees with them after looking at the data)

What I'm happy to provide

I can share more empirical detail privately if it would help diagnosis — including session logs, the chronology of each incident, the exact sequence of in-session containment attempts and their failure modes, and the structural defenses I've shipped client-side. The project itself is private under NDA; I cannot share repository contents publicly, but the behavioral evidence above is the operative information for the design decision.

What Should Happen?

TaskStop is extended to accept a force parameter:

TaskStop({ taskId: "agent-xyz", force: true })

### Error Messages/Logs

```shell

Steps to Reproduce

Claude Model

Opus

Is this a regression?

No, this never worked

Last Working Version

No response

Claude Code Version

2.1.140 (Claude Code)

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Other

Additional Information

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.