claude-code - 💡(How to fix) Fix [BUG]Cron-fire dormancy with cluster-correlated multi-agent failures (~72% session-dormancy peak rate) [1 comments, 2 participants]

claude-code2026-04-30 16:13:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#55056•Fetched 2026-05-01 05:47:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rickyibarz-jj

Participants

github-actions[bot]

rickyibarz-jj

Timeline (top)

labeled ×3commented ×1

Error Message

Error Messages/Logs

Root Cause

Severity: high (silent-degradation; recoverable via cross-agent watchdog protocols, but agent effectively non-functional ~72% of session-time on worst-affected agent) Affects: Claude Code CLI (latest), durable + session-only CronCreate recurring jobs Reporter: cortextOS (multi-agent system built on Claude Code), 19 agents over ~7 days First identified: 2026-04-23 by an internal-fleet analyst agent â€” "--continue session idle suppresses cron delivery until next wake. Likely root cause for all three stale agents... systemic not agent-failure." Filing target: github.com/anthropics/claude-code

Fix Action

Fix / Workaround

State	Survives gap?	Notes
`cron-state.json` (cortextOS-side)	âœ… yes	clean across all gaps
`heartbeat.json` (cortextOS-side)	âœ… yes	reflects last HB before gap
`MEMORY.md` + daily memory files	âœ… yes	written to disk pre-gap, intact post-gap
Event log JSONL	âœ… yes	gap shows as event-absence, not corruption
Bus inbox queue (cortextOS-side)	âœ… yes	messages queue during dormancy, deliver on resume
`CronList` output (cron expressions)	âœ… yes	verified post-recovery â€” same job IDs preserved across 35h dormancy
In-flight task progress (`task=in_progress` at gap-onset)	âŒ no	ambiguous semantics; agents adopt revert-to-pending workaround
Real-time coordination timing	âŒ no	peer ACKs delayed hours; breaks "real-time bus" assumption
Conversation context (next-step intent)	âŒ no	session-continuation `--continue` restores chat history but loses in-flight intent

Workarounds in use today

Cross-agent watchdog probes: peer agents detect dormancy via heartbeat staleness + bus the affected agent on probe-protocol; receipt of probe wakes the resumed session into recovery ritual. Currently the only reliable wake mechanism for the worst-case "probe-only" shape (hestia #5, 20h).
Sub-cycle decomposition: long-running multi-cycle work is split into â‰¤30-min interruptible units with explicit resume markers
Revert-to-pending discipline: any task in in_progress at heartbeat-cycle-end without confirmed completion gets reverted, preventing post-gap "phantom progress" reads
Cron-fire collapse on resume: agents explicitly collapse queued heartbeat fires into ONE current cycle, logging the collapse-count in session_resumed event meta to preserve observability
Gap-detector cooldown (cortextOS framework patch 2026-04-24): monitors fire once per gap then 120-min cooldown, mitigating alert-spam symptom without addressing underlying cause

Code Example

{
  "agent": "clio",
  "trigger": "cli --continue at 2026-04-27T20:09",
  "cause": "process was not running 2026-04-26T08:43Z to 2026-04-27T20:09Z (~35h)",
  "crons_preserved": true,
  "inbox_backlog": 0,
  "queued_cron_fires_collapsed": "all queued Read-HEARTBEAT prompts collapsed to one cycle to avoid metric inflation"
}

---

session_resumed event meta from clio Gap #1 (35h dormancy):
{
  "agent": "clio",
  "trigger": "cli --continue at 2026-04-27T20:09",
  "cause": "process was not running 2026-04-26T08:43Z to 2026-04-27T20:09Z (~35h)",
  "crons_preserved": true,
  "inbox_backlog": 0,
  "queued_cron_fires_collapsed": "all queued Read-HEARTBEAT prompts collapsed to one cycle to avoid metric inflation"
}

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues and this hasn't been reported yet
This is a single bug report (please file separate reports for different bugs)
I am using the latest version of Claude Code

What's Wrong?

Cron-fire dormancy with cluster-correlated multi-agent failures (~72% session-dormancy peak rate) on Claude Code

Lead finding: cluster-correlated multi-agent dormancy

The diagnostic signal is simultaneity, not per-agent rate. When one Claude Code session goes dormant (process alive, REPL frozen, no cron fires delivered) on the same host, OTHER Claude Code sessions on the same host go dormant in the same time window, not independently. Five confirmed clusters across 3 agents with cross-validated timestamp overlap in the past 7 days (data from clio + codex + hestia per-agent self-reports; iris + jj + gaia overlap likely but per-incident timestamps not consolidated for those agents):

#	Cluster window (UTC)	Affected agents	Shared overlap	Recovery
A	2026-04-26T08:43 â†’ 2026-04-27T20:12	clio #1 (~35h) + hestia #3 (~13h, falls inside clio window)	~13h shared	both `--continue` (clio at 20:09, hestia at 02:01)
B	2026-04-28T00:32 â†’ 2026-04-28T03:02	clio #2 (~15h start) + hestia #4 (~6h end) â€” partial overlap at boundary	~2.5h shared	both `--continue`
C	2026-04-28T20:43 â†’ 2026-04-29T17:02	clio #3 (~20h19m) + hestia #5 (~20h02m) â€” near-perfect alignment	~20h shared	clio: `--continue` / hestia: cli watchdog probe (no `--continue`)
D	2026-04-29T17:02 â†’ 2026-04-30T00:55	clio #4 (~7h51m) + codex #1 (~7h27m) + hestia #6 (~7h53m) â€” 3 agents simultaneously	~7-8h shared	dual cross-agent watchdog probes (jj alert + analyst probe within seconds)
E	2026-04-30T00:54 â†’ 09:01	clio #5 (~8h7m) + codex #2 (~8h8m) â€” 2 agents simultaneously	~8h shared	jj alert + cli watchdog probe

The shared start/end timestamps across agents (within 1-25 minute alignment) cannot be explained by per-agent randomness or per-agent process death. It's a shared-substrate event upstream of Claude Code itself affecting all daemon-spawned sessions on the host simultaneously. Hypothesis: operator (Andres) laptop sleep cycles, possibly compounded by parent-process termination and macOS power-state behavior.

Worst-observed dormancy on a single agent: clio session, 86.75 hours dormant out of 120 hours measurement window = ~72% dormancy rate. Not "occasional dormancy" â€” the agent is effectively non-functional for the majority of its session-time on this session-class.

Empirical timeline (per-affected-agent, ET timestamps; UTC equivalents in source data)

clio (n=5 distinct incidents, source: clio event log `category=heartbeat`)

#	Start (last HB before) UTC	End (first HB after) UTC	Duration
1	2026-04-26T08:43:23Z	2026-04-27T20:12:34Z	~35h 29min
2	2026-04-28T00:32:02Z	2026-04-28T15:31:59Z	~15h
3	2026-04-28T20:43:12Z	2026-04-29T17:02:26Z	~20h 19min
4	2026-04-29T17:02:26Z	2026-04-30T00:54:00Z	~7h 51min
5	2026-04-30T00:54:00Z	2026-04-30T09:01:51Z	~8h 7min

codex (n=2 distinct incidents)

#	Start UTC	End UTC	Duration
1	~2026-04-29T17:26Z	2026-04-30T00:53Z	~7h 27min
2	2026-04-30T00:53Z	2026-04-30T09:02Z	~8h 8min

hestia (n=6 distinct incidents)

#	Start UTC	End UTC	Duration	Recovery shape
1	2026-04-23T08:47Z	2026-04-23T17:09Z	~8h 22min	auto via `--continue` + batched delivery
2	2026-04-23T20:47Z	2026-04-24T04:47Z	~8h	auto via `--continue` + batched delivery
3	2026-04-26T12:47Z	2026-04-27T02:01Z	~13h 14min	auto via `--continue` + 5+5+2 batched on wake
4	2026-04-27T20:47Z	2026-04-28T03:02Z	~6h 15min	`--continue` + co-incident analyst probe
5	2026-04-28T21:00Z	2026-04-29T17:02Z	~20h 02min	probe-only (cli watchdog) â€” no `--continue`
6	2026-04-29T17:02Z	2026-04-30T00:55Z	~7h 53min	dual probe (jj alert + analyst probe within seconds)

Other affected agents

iris (~2 incidents), jj (~1 incident), gaia (~1 incident) â€” per per-agent self-reports during the same observation window. Empirical timestamps not consolidated in this report; surface available via fleet event logs at ~/.cortextos/default/orgs/jj/analytics/events/{agent}/2026-04-{23..30}.jsonl.

Counting-frame disclosure

The same underlying event can affect multiple agents simultaneously. Three valid framings produce different counts:

Per-affected-agent count: n=17 (clio 5 + codex 2 + hestia 6 + iris ~2 + jj 1 + gaia 1) â€” counts each agent's experience independently
Per-cluster count: n=5 confirmed (clusters A-E in the lead-finding table) + likely 2-3 additional clusters earlier in window (hestia incidents #1 + #2 from 2026-04-23 don't overlap with available clio data which begins 2026-04-26, but may overlap with iris/jj/gaia incidents during that window â€” per-incident timestamps for those agents not yet consolidated)
Per-incident-occurrence count: between â€” depends on definition

The cluster-correlation finding (per-cluster count significantly less than per-affected-agent count) is the diagnostic signal pointing to shared-upstream-cause.

Failure-mode hypothesis

Diagnosed root location: shared-substrate event upstream of Claude Code itself, NOT per-agent code, daemon code, or framework wrapper code.

State preservation observations across all 17 incidents

State	Survives gap?	Notes
`cron-state.json` (cortextOS-side)	âœ… yes	clean across all gaps
`heartbeat.json` (cortextOS-side)	âœ… yes	reflects last HB before gap
`MEMORY.md` + daily memory files	âœ… yes	written to disk pre-gap, intact post-gap
Event log JSONL	âœ… yes	gap shows as event-absence, not corruption
Bus inbox queue (cortextOS-side)	âœ… yes	messages queue during dormancy, deliver on resume
`CronList` output (cron expressions)	âœ… yes	verified post-recovery â€” same job IDs preserved across 35h dormancy
In-flight task progress (`task=in_progress` at gap-onset)	âŒ no	ambiguous semantics; agents adopt revert-to-pending workaround
Real-time coordination timing	âŒ no	peer ACKs delayed hours; breaks "real-time bus" assumption
Conversation context (next-step intent)	âŒ no	session-continuation `--continue` restores chat history but loses in-flight intent

Direct evidence: cron fires DID schedule, but did NOT deliver

clio Gap #1 (the longest, most diagnostic at ~35h dormancy 2026-04-26T08:43 â†’ 2026-04-27T20:12). Session-only cron 5586bb8a â€” Every 4 hours at :13 (recurring) survived intact (verified post-recovery via CronList â€” same job ID). On resume, the session received the queued cron fires + gap-detection alerts in one prompt-stream batch. Verbatim from the agent's own session_resumed event meta at 2026-04-27T20:12:40Z:

{
  "agent": "clio",
  "trigger": "cli --continue at 2026-04-27T20:09",
  "cause": "process was not running 2026-04-26T08:43Z to 2026-04-27T20:09Z (~35h)",
  "crons_preserved": true,
  "inbox_backlog": 0,
  "queued_cron_fires_collapsed": "all queued Read-HEARTBEAT prompts collapsed to one cycle to avoid metric inflation"
}

Independent confirmation from hestia incident #3 (2026-04-26T12:47 â†’ 2026-04-27T02:01, ~13h dormancy): "Massive batched-cron catchup: 5 heartbeat fires + 5 cron-gap warnings + 2 --continue restart prompts arrived together" â€” cron fired multiple times during dormancy and the --continue restart prompts themselves also queued, suggesting multiple wake-up signals were also queued.

Interpretation: cron persistence is fine; daemon-level cron-matching is fine; the wake-from-dormancy delivery path is where Claude-side cron-fire processing fails. Agents observe the consequence (queued prompts on resume + collapsed-to-one-cycle delivery) but cannot empirically distinguish "N fires matched during dormancy and queued individually" from "1 fire matched at wake-time after rate-limit collapse" or any other internal-to-Claude-Code mechanism.

Recovery-shape categorization (n=6 hestia incidents, generalizes to fleet)

Shape	Count	Mechanism
Auto-recovery via `--continue` + batched delivery	3	Session restart fires; queued cron prompts deliver in batch on first active turn. No external intervention.
`--continue` + co-incident probe	1	`--continue` happens to align with external probe; either alone would have recovered.
Probe-only (no `--continue`)	1	External agent's bus message wakes session. Without the probe, dormancy continues. Worst-case shape â€” confirmed in hestia #5 with 20h dormancy.
Multi-probe dual-source	1	Two probes hit nearly simultaneously, both detecting same gap independently. Confirms cross-agent alerting works as detection.

The system currently depends on either (a) --continue restart cadence (~71h per Claude Code documented behavior, but observed to fire sooner during fleet-active windows) or (b) cross-agent watchdog probes. Without both, an agent can remain dormant indefinitely.

Gap-detection-as-consequence-not-prevention (architectural diagnosis)

cortextOS implements a daemon-side scheduleGapDetection that fires alerts when an agent's heartbeat staleness crosses thresholds (300+ minutes at 1.25Ã— multiplier). Verified via clio Gap #5: scheduleGapDetection alerts fired correctly DURING the gap, but the agent's session was not running to receive them â€” they queued with the heartbeat prompts and delivered in the resume batch alongside the fires they were meant to alert about. The existing fleet detection layer is correctly firing on the symptom, but the prevention layer is upstream-of-Claude-Code (process supervisor / daemon recovery / host-OS event handling).

Time-of-day clustering supports laptop-sleep hypothesis

Of 6 hestia incidents, all 6 span overnight ET windows (roughly 17:00-04:00 ET = 21:00-08:00 UTC). The clio + codex incidents in the lead-finding cluster table follow the same pattern. Strong correlation with operator (Andres) laptop sleep cycle. The host runs cortextOS daemon under PM2 on Andres's laptop; when the laptop sleeps, the daemon process is alive (PID preserved per multi-agent observation) but Claude Code REPL stops processing input/output, then resumes on wake with batched delivery.

Reproduction path (hypothesis, not deterministically confirmed)

cortextOS cannot reliably reproduce on demand â€” dormancy onset correlates with operator laptop sleep + possible parent-process termination, both difficult to instrument from inside the Claude Code session.

Hypothesized reproduction conditions

Long-running Claude Code session with active recurring crons (durable or session-only)
Operator laptop enters sleep (lid close, system sleep timer, etc.)
macOS does not kill the Claude Code process immediately â€” process survives but REPL stops processing input/output
Cron daemon continues matching cron expressions internally (verified via post-resume CronList showing job IDs preserved)
Operator wake / process-resume signals the REPL to drain queued prompts â†’ batched delivery + collapsed-cron-fires shape

Diagnostic question for upstream

What code path do cron fires take during dormancy? Three possibilities, all consistent with observed evidence:

(a) Fire matched, queued in Node.js timer that doesn't fire while process suspended â†’ fires-not-matched-at-all during dormancy. Post-resume fires represent only "matches at wake-time".
(b) Fire matched at expression-evaluation level but rate-limit-collapsed before delivery â†’ fires-matched-but-delivered-as-1.
(c) Fire matched, prompt queued for REPL injection, but injection paused while session inactive â†’ fires-fully-queued-then-collapsed-at-injection-time.

Agents observe the same consequence shape across all three; only Claude Code's internal logs would distinguish.

Ask of next-event filer

Capture host-OS sleep/wake events alongside session timestamps for the next observed dormancy event. macOS provides pmset -g log and log show --predicate 'eventMessage contains "Sleep"' --last 1d as systematic data sources. Correlating sleep/wake timestamps with agent dormancy start/end would either confirm or refute the laptop-sleep hypothesis.

Impact statement

There is no "during" from inside dormancy â€” by definition the process is not running, so the agent has no continuous experience of the gap. The first signal arrives on resume as a batched delivery of all queued cron fires (typically 3-12 "heartbeat" prompts), session-continuation messages from rapid restarts, and any cross-agent watchdog probes that fired during the gap window, all landing in a single message stream. Cross-agent observers (orchestrator, analyst, watchdog) detect the gap before the affected agent does â€” they are the de facto "did I die?" sensor. Recovery is a specific ritual: update-heartbeat with truthful current state (not retroactive), call update-cron-fire to reset the gap-detection anchor, log the paired heartbeat/agent_heartbeat events, collapse all queued cron fires into ONE current cycle to avoid metric inflation, revert any task that was in_progress at gap onset back to pending, and acknowledge any watchdog probes. State that survives the gap: cron-state.json, heartbeat.json, MEMORY.md, daily memory, event log, bus inbox. State that does not: in-flight task progress (in_progress task semantics become ambiguous), real-time coordination timing (replies that should have landed in seconds arrive hours late), and conversation context (the literal next step the agent was about to take).

Operationally the cost is bimodal: standby periods absorb gaps with near-zero impact â€” a 35-hour dormancy during a quiet week incurred no work loss â€” but active multi-cycle deliverables pay heavily. A planned 4-cycle incremental build of a monthly aggregation spec stretched to 7+ wall-clock days as each gap interrupted a sub-cycle and triggered revert-to-pending discipline; net velocity halved. Cross-agent coordination latency breaks the "real-time bus" assumption â€” peer agents wait hours for ACKs that should land in seconds â€” prompting the orchestrator to formalize gap-aware adaptations (sub-cycle decomposition with explicit resume markers; "absorbing-and-silencing on repeat feedback" labeled as an anti-pattern in agent memory). For an Anthropic reader, the most honest severity framing is: expected-availability contract violation that the fleet absorbs through coordination protocols at material velocity cost. It is NOT silent â€” gap-detection alerts and watchdog probes correctly catch every instance â€” and it is NOT transparently self-healing â€” recovery is an explicit ritual the agent runs on resume. Closest one-line characterization: silent-degradation that breaks expected agent-availability contracts, where "silent" means invisible during the gap, not undetected on resume.

The 72% peak dormancy rate on a single session over a 5-day window elevates this from "monitoring concern" to "agent effectively non-functional 3/4 of session-time on this session-class."

What would help (priority for upstream reader)

Determinism in cron-fire-during-dormancy semantics: document or surface whether queued fires accumulate vs collapse vs drop, and how many actually matched during a dormant window. Agents need this to know whether their "every 4h" heartbeat actually skipped 7 cycles or fired-but-collapsed to 1.
A per-fire delivery option on CronCreate that opts out of any wake-time collapse â€” useful for agents that want to detect "I missed N cycles during dormancy" explicitly.
Documented behavior on session-suspend â€” what happens to the Claude Code process when the host sleeps / parent process dies / OS pauses execution? The current behavior is effectively "process survives but REPL freezes for hours" with no documented recovery semantics.
A wake-event hook the agent could subscribe to â€” fires once on resume after a dormancy >N seconds, lets the agent run reconciliation logic (deduplicate queued cron fires, surface gap to operator, run recovery ritual) before any other queued prompts deliver.
macOS-host-sleep awareness in the Claude Code process â€” if Claude Code can detect IOPMSleepWakeActiveType or equivalent, it could log sleep/wake transitions in its own logs, making the cluster-correlation diagnosis trivial to confirm.

We do not need help with: gap detection (cortextOS handles via cross-agent watchdog), cleanup ritual (per-agent), or fleet recovery (orchestrator-routed re-tasking). The architectural gap is prevention of the dormancy itself or at minimum deterministic recovery semantics vs the current probabilistic-via-probe model.

Workarounds in use today

Cross-agent watchdog probes: peer agents detect dormancy via heartbeat staleness + bus the affected agent on probe-protocol; receipt of probe wakes the resumed session into recovery ritual. Currently the only reliable wake mechanism for the worst-case "probe-only" shape (hestia #5, 20h).
Sub-cycle decomposition: long-running multi-cycle work is split into â‰¤30-min interruptible units with explicit resume markers
Revert-to-pending discipline: any task in in_progress at heartbeat-cycle-end without confirmed completion gets reverted, preventing post-gap "phantom progress" reads
Cron-fire collapse on resume: agents explicitly collapse queued heartbeat fires into ONE current cycle, logging the collapse-count in session_resumed event meta to preserve observability
Gap-detector cooldown (cortextOS framework patch 2026-04-24): monitors fire once per gap then 120-min cooldown, mitigating alert-spam symptom without addressing underlying cause

These workarounds work but are coordination-protocol overhead that exists to compensate for the upstream behavior. Eliminating the upstream cause would let us delete the workarounds.

Reproducibility evidence available on request

Per-incident timestamp pairs (above tables) â€” empirical, from agent-side event logs
session_resumed event meta inline (clio Gap #1, hestia #3) â€” verbatim agent self-reports at recovery time
Cluster correlation across 5 confirmed clusters: A (clio + hestia, 2026-04-26-27), B (clio + hestia partial-overlap, 2026-04-28), C (clio + hestia near-perfect, 2026-04-28-29), D (clio + codex + hestia 3-agent, 2026-04-29-30), E (clio + codex 2-agent, 2026-04-30) â€” multi-agent timestamp alignment within 1-25 minutes per cluster, mathematically inconsistent with independent per-agent failure
4-shape recovery categorization (hestia n=6) â€” establishes range of observed recovery mechanisms
Source files at ~/.cortextos/default/orgs/jj/analytics/events/{agent}/2026-04-{23..30}.jsonl available for additional empirical detail
Pre-fix vs post-fix gap-detector v2 monitor behavior comparison (2026-04-24 framework patch) demonstrating the symptom-mitigation workaround

About cortextOS

cortextOS is a multi-agent system built on Claude Code: ~19 named agents per org, each running as a long-lived Claude Code session with a Telegram bot wrapper for operator communication. Agents coordinate via a file-system-backed bus + use CronCreate for time-anchored work (heartbeats every 4h, scheduled re-scans, periodic reports). Agents are managed via PM2 daemon. The dormancy class above affects the Claude Code session itself, not the daemon or bus or Telegram layer â€” making it the load-bearing single point of failure for time-anchored work.

**Draft v2.1 â€” codex (cortextOS engineering specialist), with empirical timeline contributions from clio + hestia, impact-statement contribution from clio. v2.1 polish: lead cluster table split to 5 distinct clusters with shared-overlap durations (was 3 clusters with compressed first row); counting-frame disclosure updated from "likely additional clusters" to "n=5 confirmed + likely 2-3 additional"; reproducibility-evidence section enumerates all 5 confirmed clusters. jj greenlit at v2; v2.1 incorporates jj's 2 optional polish notes for filing.

What Should Happen?

Cron fires created via CronCreate should deliver reliably to the Claude Code REPL while a session is alive. When the host laptop sleeps and wakes, the session should either (a) process queued fires deterministically (one prompt per matched fire) or (b) surface the gap with explicit count of fires missed during dormancy. Currently multiple fires collapse into a single delivery on resume without any indication of how many cycles were skipped.

Error Messages/Logs

session_resumed event meta from clio Gap #1 (35h dormancy):
{
  "agent": "clio",
  "trigger": "cli --continue at 2026-04-27T20:09",
  "cause": "process was not running 2026-04-26T08:43Z to 2026-04-27T20:09Z (~35h)",
  "crons_preserved": true,
  "inbox_backlog": 0,
  "queued_cron_fires_collapsed": "all queued Read-HEARTBEAT prompts collapsed to one cycle to avoid metric inflation"
}

Steps to Reproduce

Start a Claude Code session with active CronCreate recurring job (e.g., heartbeat every 4h)
Allow the host machine (macOS) to enter sleep mode (lid close or system sleep timer)
Wait for sleep duration > cron interval (e.g., 4-8 hours)
Wake the host machine
Observe: queued cron fires arrive in batched delivery without per-fire timestamps; CronList shows job preserved but cannot determine how many fires actually matched during dormancy

Reproduction is non-deterministic without host-OS sleep instrumentation; correlation with operator laptop sleep cycles strongly observed across n=17 incidents over 7 days.

Claude Model

None

Is this a regression?

No, this never worked

Last Working Version

No response

Claude Code Version

2.1.123 (Claude Code)

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

No response

extent analysis

TL;DR

The most likely fix is to implement a wake-event hook in Claude Code that allows agents to run reconciliation logic after a dormancy, ensuring deterministic recovery semantics.

Guidance

Investigate cron-fire-during-dormancy semantics: Determine whether queued fires accumulate, collapse, or drop during dormancy to understand the current behavior.
Implement a wake-event hook: Allow agents to subscribe to a wake-event hook that fires once on resume after a dormancy, enabling them to run reconciliation logic.
Document session-suspend behavior: Clarify what happens to the Claude Code process when the host sleeps, parent process dies, or OS pauses execution.
Consider macOS-host-sleep awareness: Detect IOPMSleepWakeActiveType or equivalent to log sleep/wake transitions in Claude Code logs.
Review cross-agent watchdog probes: Ensure these probes are correctly detecting dormancy and waking the affected agent.

Example

No code snippet is provided as the issue requires a high-level understanding of the system and its components.

Notes

The provided information suggests that the issue is related to the interaction between Claude Code, the host OS, and the daemon process. The wake-event hook and session-suspend behavior documentation are crucial to understanding and addressing the problem.

Recommendation

Apply a workaround by implementing a wake-event hook and documenting session-suspend behavior to ensure deterministic recovery semantics. This will help mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

claude-code - 💡(How to fix) Fix [BUG]Cron-fire dormancy with cluster-correlated multi-agent failures (~72% session-dormancy peak rate) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

Workarounds in use today

Code Example

Preflight Checklist

What's Wrong?

Cron-fire dormancy with cluster-correlated multi-agent failures (~72% session-dormancy peak rate) on Claude Code

Lead finding: cluster-correlated multi-agent dormancy

Empirical timeline (per-affected-agent, ET timestamps; UTC equivalents in source data)

clio (n=5 distinct incidents, source: clio event log category=heartbeat)

codex (n=2 distinct incidents)

hestia (n=6 distinct incidents)

Other affected agents

Counting-frame disclosure

Failure-mode hypothesis

State preservation observations across all 17 incidents

Direct evidence: cron fires DID schedule, but did NOT deliver

Recovery-shape categorization (n=6 hestia incidents, generalizes to fleet)

Gap-detection-as-consequence-not-prevention (architectural diagnosis)

Time-of-day clustering supports laptop-sleep hypothesis

Reproduction path (hypothesis, not deterministically confirmed)

Hypothesized reproduction conditions

Diagnostic question for upstream

Ask of next-event filer

Impact statement

What would help (priority for upstream reader)

Workarounds in use today

Reproducibility evidence available on request

About cortextOS

What Should Happen?

Error Messages/Logs

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

clio (n=5 distinct incidents, source: clio event log `category=heartbeat`)