openclaw - 💡(How to fix) Fix at-cron silent-drop on `session:cody-coder`: 8 documented incidents (May 4-8, 2026) — `runningAtMs` set but spawn never lands, no errors, no on-disk evidence

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

One-shot at-kind crons that target a session-label (in our case --session-key session:cody-coder) silently fail to spawn the agent. The cron row shows runningAtMs set at the scheduled time, but lastRunAtMs stays null, the runs[] array stays empty, no session is created, and there is zero on-disk evidence that the spawn ever started (no workspace files, no session jsonl, no log line).

We've documented 8+ incidents in a 5‑day window (May 4‑8 2026) on a real production deployment driving a Slack-bot agent fleet. Recovery requires a manual force‑recycle (openclaw cron rm <id> then re-add) or a hard openclaw gateway restart. Without external watchdog scaffolding the work is just lost.

We're filing this with all the diagnostic detail we have because the failure mode is silent — the cron's own state looks healthy, no errors are logged, and we can't repro it on demand. We'd really appreciate any pointer toward where to add instrumentation, or whether this is a known issue with a planned fix.

Root Cause

We're filing this with all the diagnostic detail we have because the failure mode is silent — the cron's own state looks healthy, no errors are logged, and we can't repro it on demand. We'd really appreciate any pointer toward where to add instrumentation, or whether this is a known issue with a planned fix.

Fix Action

Fix / Workaround

  • OpenClaw: 2026.4.26 (commit be8c246)
  • Host OS: Darwin 25.4.0 (arm64) — Mac mini (M-series), running 24/7 on AC
  • Node: v22.22.2
  • Gateway: local (gui/$(id -u)/ai.openclaw.gateway LaunchAgent)
  • ~/.openclaw/cron/jobs.json size during incidents: ranged from ~280 KB to 1.1 MB (342 jobs at peak; see "Possible contributing factor" below)
  • Affected cron pattern (verbatim):
    openclaw cron add \
      --at +15s \
      --delete-after-run \
      --session-key session:cody-coder \
      --message "<dispatch brief...>" \
      --tools "exec read write edit ..."
    Crons are added by an every‑1‑minute queue processor (process_dispatch_queue.py) that drains JSON queue files into one at-job per dispatch. Same code path each time.
  • Cron runtime: agentTurn payload spawning a session-label inside the main agent (session:cody-coder); the session label is not a Tier 1 isolated agent, so this is the main session being woken via session-key routing.
  1. Queue processor writes the dispatch queue file → calls openclaw cron add ... --at +15s --session-key session:cody-coder ... → returns OK with a cron id.
  2. openclaw cron show <id> --json at scheduled time shows:
    • enabled: true
    • runningAtMs: <scheduled epoch ms> ← gets set on time
    • lastRunAtMs: null ← stays null forever
    • runs: [] ← stays empty forever
    • nextRunAtMs: null (it's a one-shot at job)
  3. Filesystem evidence the spawn never happened:
    • no new entry under ~/.openclaw/agents/main/sessions/
    • no session jsonl modified
    • no data/sentinels/dispatch-<id>-cody.start file (the agent writes this as its first action)
    • no Slack DM, no commit, no DB row update
  4. No errors anywhere — ~/.openclaw/logs/gateway.err.log*, the queue processor log, and cron stale-running watchdog logs all clean.
  5. The cron row stays in this state indefinitely until force-recycled. openclaw doctor does not report it as unhealthy.
  • Hits roughly 1 in 6–10 cron-routed dispatches over the last week. No obvious time-of-day pattern.
  • Strong correlation with gateway congestion / load — when the system has many concurrent active sessions or shortly after a gateway restart, the rate goes up. Three of the eight incidents (May 6) clustered around a gateway-restart event that killed three concurrent agentification dispatches mid-flight.
  • Also hits during the bursty period right after launchd respawns the gateway after a crash (we had a few OOM-related restarts on May 7 morning that correlated with one drop).
  • We have never observed it on recurring crons (--every-kind) — only on --at +Ns --delete-after-run one-shots.

Code Example

openclaw cron add \
    --at +15s \
    --delete-after-run \
    --session-key session:cody-coder \
    --message "<dispatch brief...>" \
    --tools "exec read write edit ..."
RAW_BUFFERClick to expand / collapse

Summary

One-shot at-kind crons that target a session-label (in our case --session-key session:cody-coder) silently fail to spawn the agent. The cron row shows runningAtMs set at the scheduled time, but lastRunAtMs stays null, the runs[] array stays empty, no session is created, and there is zero on-disk evidence that the spawn ever started (no workspace files, no session jsonl, no log line).

We've documented 8+ incidents in a 5‑day window (May 4‑8 2026) on a real production deployment driving a Slack-bot agent fleet. Recovery requires a manual force‑recycle (openclaw cron rm <id> then re-add) or a hard openclaw gateway restart. Without external watchdog scaffolding the work is just lost.

We're filing this with all the diagnostic detail we have because the failure mode is silent — the cron's own state looks healthy, no errors are logged, and we can't repro it on demand. We'd really appreciate any pointer toward where to add instrumentation, or whether this is a known issue with a planned fix.

Environment

  • OpenClaw: 2026.4.26 (commit be8c246)
  • Host OS: Darwin 25.4.0 (arm64) — Mac mini (M-series), running 24/7 on AC
  • Node: v22.22.2
  • Gateway: local (gui/$(id -u)/ai.openclaw.gateway LaunchAgent)
  • ~/.openclaw/cron/jobs.json size during incidents: ranged from ~280 KB to 1.1 MB (342 jobs at peak; see "Possible contributing factor" below)
  • Affected cron pattern (verbatim):
    openclaw cron add \
      --at +15s \
      --delete-after-run \
      --session-key session:cody-coder \
      --message "<dispatch brief...>" \
      --tools "exec read write edit ..."
    Crons are added by an every‑1‑minute queue processor (process_dispatch_queue.py) that drains JSON queue files into one at-job per dispatch. Same code path each time.
  • Cron runtime: agentTurn payload spawning a session-label inside the main agent (session:cody-coder); the session label is not a Tier 1 isolated agent, so this is the main session being woken via session-key routing.

Symptom

What we see when the bug fires:

  1. Queue processor writes the dispatch queue file → calls openclaw cron add ... --at +15s --session-key session:cody-coder ... → returns OK with a cron id.
  2. openclaw cron show <id> --json at scheduled time shows:
    • enabled: true
    • runningAtMs: <scheduled epoch ms> ← gets set on time
    • lastRunAtMs: null ← stays null forever
    • runs: [] ← stays empty forever
    • nextRunAtMs: null (it's a one-shot at job)
  3. Filesystem evidence the spawn never happened:
    • no new entry under ~/.openclaw/agents/main/sessions/
    • no session jsonl modified
    • no data/sentinels/dispatch-<id>-cody.start file (the agent writes this as its first action)
    • no Slack DM, no commit, no DB row update
  4. No errors anywhere — ~/.openclaw/logs/gateway.err.log*, the queue processor log, and cron stale-running watchdog logs all clean.
  5. The cron row stays in this state indefinitely until force-recycled. openclaw doctor does not report it as unhealthy.

Reproduction context (we cannot repro on demand)

  • Hits roughly 1 in 6–10 cron-routed dispatches over the last week. No obvious time-of-day pattern.
  • Strong correlation with gateway congestion / load — when the system has many concurrent active sessions or shortly after a gateway restart, the rate goes up. Three of the eight incidents (May 6) clustered around a gateway-restart event that killed three concurrent agentification dispatches mid-flight.
  • Also hits during the bursty period right after launchd respawns the gateway after a crash (we had a few OOM-related restarts on May 7 morning that correlated with one drop).
  • We have never observed it on recurring crons (--every-kind) — only on --at +Ns --delete-after-run one-shots.

Documented incidents (chronological — incident summary)

All recovered manually. Format: dispatch # | observation time (ET) | drop window | how recovered.

  1. #274 (May 4) — drain-progress auto-poster. ~14 min wait before noticing zero on-disk evidence. Force-recycled, re-fired as #277 which shipped clean.
  2. May 5 daily-file batch — multiple intra-day drops summarized in our daily review: "force-recycle (remove + re-add the cron) usually unsticks. Hard gateway restart guaranteed to fix (resets nextWakeAtMs). I avoided hard restart because it kills the in-flight session."
  3. #286 amie (May 6 ~11:30 ET) — coincided with what looked like a whole-scheduler hang. External watchdog com.manmade.openclaw-cron-doctor recovered it; re-fired as #289 (Amie graduation) which shipped.
  4. #290 (May 6) — reliability-bundle dispatch; silent-dropped. Split into smaller refire pieces (one of which was #292) to reduce silent-drop risk.
  5. #298 / #299 / #300 (May 6, concurrent burst) — three Tier 1 graduation dispatches fired concurrently; gateway restart killed all three mid-flight (closely related but separate failure mode — included for context because we think the at-cron silent-drop and gateway-restart-mid-spawn share a code path around session lifecycle).
  6. #303 marky (May 6, 18:11 → 18:27 ET) — cron 541bd8e7 had runningAtMs set, ran 14 minutes with zero on-disk evidence (no workspace mods, no registry, no router, no sentinel), force-recycled, Marky shipped at 18:55 ET. Fifth silent-drop in 24h, which is what tipped us into "this needs a 90-second auto-recycle rule." This is the canonical incident with the cleanest diagnostic.
  7. May 7‑8 chained dispatches — at least one drop in the #360/#361/#362 watcher chain that fired #361 and #362 in sequence. Recovered by force-recycle.
  8. General class documented as a permanent rule in our MEMORY.md since May 6: "A Cody at-cron with no sentinel files + no workspace dir activity within 90 seconds of runningAtMs is silent-dropped. Force-recycle (delete cron + move queue file back to live + reset DB row to queued)."

We can pull session ids / exact cron ids on request — most are still in the gateway state if it would help.

Mitigations we've put in place (none of them fix the root cause)

  1. Anti-silent-drop sentinels. Every dispatched agent writes data/sentinels/dispatch-<id>-cody.start as its very first action and .done after commit. Lets external watchdogs detect "cron runningAtMs set but no sentinel within 90s" and force-recycle.
  2. Generic stuck-cron doctor (scripts/cron_stale_running_doctor.py, every 5 min) — detects runningAtMs > 2 * everyMs for recurring crons; does not help with at-kind one-shots because they don't have everyMs.
  3. External LaunchAgent watchdog outside OpenClaw (com.manmade.openclaw-cron-doctor) — runs every 60s, kicks the gateway via launchctl kickstart -k if 2 named critical recurring crons appear hung. Catches the whole-scheduler-hang class but not single-job silent drops.
  4. Serial-graduation watcher — never fires more than one heavy at-cron concurrently, since concurrency seems to make the bug more frequent.
  5. MEMORY.md permanent rule for the agent itself: "if a cron has no sentinel + no workspace activity 90s past runningAtMs, force-recycle."

Possible contributing factor (worth looking at)

A separate but related issue we found in our own watchdog scripts may be relevant to the upstream code path:

  • The disable + enable recycle pattern (openclaw cron disable <id> then openclaw cron enable <id>) left crons disabled forever when the second call failed under gateway congestion — exactly the state the recycle is trying to fix. Each saturation event left more enabled=false / deleteAfterRun=true / lastRunAt=null / scheduled >48h ago at-jobs in ~/.openclaw/cron/jobs.json.
  • This snowballed our jobs.json to 342 jobs (1.1 MB) by May 7, at which point openclaw cron list itself slowed to >2s and chat tool calls timed out at 10s — feedback loop where a slow cron.list made the gateway congested, which made the next recycle's enable step fail, which left more jobs disabled, etc.
  • Manual prune (xargs -P 4 openclaw cron rm) of 209 stale entries → 1.1 MB → 458 KB → gateway responsiveness restored.
  • We suspect the silent-drop rate may be elevated when jobs.json is large, because the cron scheduler's wake-loop has more work to do and runningAtMs may be set before the spawn path actually completes. We don't have proof — it's a hypothesis. But pruning brought our drop rate down sharply.

Asks

  1. Is this a known issue? Anything in the gateway issue tracker we should be following?
  2. Where should we add instrumentation? We'd be happy to wire structured-log lines / metrics around the at-job → spawn path if we knew where the handoff lives. The bug looks like the cron-scheduler thinks it spawned the session (sets runningAtMs) but the spawn never lands. A debug log line on each side of that boundary would help us repro on instrumented builds.
  3. Should runningAtMs be set after the spawn completes, not before? From the outside, it currently looks like runningAtMs is set optimistically and never reverted on spawn-failure. If the spawn fails silently for any reason (locked store, congested gateway, killed worker), the cron is stuck "running" forever even though nothing started.
  4. Recommended recovery path? We're force-recycling via cron rm + queue-file replay. If there's a supported way to re-enqueue a specific cron's missed run we'd prefer that over delete-and-replay.

What we can supply

  • Full cron show <id> --json snapshots from the canonical incident (#303 marky, May 6).
  • Gateway log slice around any incident timestamp.
  • Our queue processor source + the agent's brief template, if the message body shape matters.
  • ~/.openclaw/cron/jobs.json excerpt (sanitized — bodies contain Slack ids).

Just point us at what would be most useful.


Reporter context: This is filed by a real production deployment running OpenClaw to drive a Slack-bot agent fleet for a Canadian DTC brand. We've shipped >380 dispatches through this code path since April. ~8 silent drops over the May 4‑8 window with the rest landing clean — so the bug is real but not majority-rate. Happy to coordinate on a repro environment or run an instrumented build if helpful.

Thanks for OpenClaw. The lobster way. 🦞

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix at-cron silent-drop on `session:cody-coder`: 8 documented incidents (May 4-8, 2026) — `runningAtMs` set but spawn never lands, no errors, no on-disk evidence