claude-code - 💡(How to fix) Fix Background agents silently die on session pause/resume — no completion notification, no work recovery

Root Cause

Token waste — agents that almost-finished get nuked, and follow-up agents have to redo work
Wall-clock waste — operators discover the loss hours later after dispatching dependent batches
Decision making — the model believes the work shipped and proceeds; downstream batches reference work that doesn't actually exist on origin
Trust — operators stop trusting background-agent reports because they can't tell live from dead

Fix Action

Fix / Workaround

Dispatch N background agents via the Agent tool with run_in_background: true and isolation: "worktree".
Close the laptop / let the OS sleep / leave the Claude Code session idle for several hours.
Reopen / resume the Claude Code session the next day.
Check the agent panel — agents are gone (no longer listed as running).
Check task notifications — NO completion notifications arrive for the dispatched agents.
Check the agents' isolated worktrees — partial uncommitted work; some may have committed locally but never pushed.
The model continues as if the agents finished successfully and dispatches the next batch, only to discover via manual investigation (git ls-remote + worktree inspection) that nothing landed on origin.

Pause window 1 (overnight gap ~16 hours): 3 background agents (finishers for stalled PRs) dispatched at 18:07 IST. Session resumed at 14:51 IST the next day. All 3 agents gone; their worktrees in mid-rebase state, no completion notification ever fired. Required manual diagnosis to figure out what landed vs what was lost.
Pause window 2 (overnight gap ~16 hours): A batch of 6 background agents dispatched at 20:44 IST. Session resumed at 12:25 IST the next day. ZERO of the 6 agents produced a PR on origin. ZERO completion notifications fired. All 6 worktrees were in stale partial-work states.
Token waste — agents that almost-finished get nuked, and follow-up agents have to redo work
Wall-clock waste — operators discover the loss hours later after dispatching dependent batches
Decision making — the model believes the work shipped and proceeds; downstream batches reference work that doesn't actually exist on origin
Trust — operators stop trusting background-agent reports because they can't tell live from dead

Summary

Background agents (Agent tool invoked with run_in_background: true) are terminated when the Claude Code session is paused (e.g. closing the laptop, OS sleep, long idle). When the session resumes hours later, those agents are gone from the UI, but no completion notification ever arrives for them, and any uncommitted work in their isolated worktrees is permanently lost. The model has no signal that the agents died — they look indistinguishable from successfully-completed agents.

Repro

Dispatch N background agents via the Agent tool with run_in_background: true and isolation: "worktree".
Close the laptop / let the OS sleep / leave the Claude Code session idle for several hours.
Reopen / resume the Claude Code session the next day.
Check the agent panel — agents are gone (no longer listed as running).
Check task notifications — NO completion notifications arrive for the dispatched agents.
Check the agents' isolated worktrees — partial uncommitted work; some may have committed locally but never pushed.
The model continues as if the agents finished successfully and dispatches the next batch, only to discover via manual investigation (git ls-remote + worktree inspection) that nothing landed on origin.

Evidence from a recent multi-day workflow

Pause window 1 (overnight gap ~16 hours): 3 background agents (finishers for stalled PRs) dispatched at 18:07 IST. Session resumed at 14:51 IST the next day. All 3 agents gone; their worktrees in mid-rebase state, no completion notification ever fired. Required manual diagnosis to figure out what landed vs what was lost.
Pause window 2 (overnight gap ~16 hours): A batch of 6 background agents dispatched at 20:44 IST. Session resumed at 12:25 IST the next day. ZERO of the 6 agents produced a PR on origin. ZERO completion notifications fired. All 6 worktrees were in stale partial-work states.

Each loss required manual investigation by the operator + the model to figure out:

Which agents had committed locally but never pushed
Which had pushed but never PR'd
Which had PR'd but never merged
Which had done nothing at all

Expected behavior — pick one

Option A: Agents persist across session pause/resume. Long-running agents pick back up where they left off; their state is checkpointed to disk on pause.

Option B: On session resume, the harness reports clearly: "N background agents were terminated by session pause at TIMESTAMP. Their outputs may be incomplete. Last-known states: ..." — so the model and operator both know to investigate before continuing.

Actual behavior

Silent loss. Indistinguishable from "agents completed successfully and the notifications just haven't been read yet." Defeats the purpose of background agents for any workflow that spans more than a few hours of active session time.

Impact

Token waste — agents that almost-finished get nuked, and follow-up agents have to redo work
Wall-clock waste — operators discover the loss hours later after dispatching dependent batches
Decision making — the model believes the work shipped and proceeds; downstream batches reference work that doesn't actually exist on origin
Trust — operators stop trusting background-agent reports because they can't tell live from dead

Frequency

Encountered 4 times across multi-day sessions. Reliably reproduces every time a Claude Code session is paused for >several hours with running background agents.

Suggested mitigations

Persist agent state to disk across pause / resume (Option A above) — ideal.
At minimum, on session resume, emit a system reminder listing terminated-by-pause agents with their last-known IDs (Option B above) — would close the silent-loss trap.
Add a session_pause hook that lets the harness or user-defined scripts gracefully checkpoint / commit / push work-in-progress before the freeze.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering