openclaw - 💡(How to fix) Fix Cron: per-job configurable stuck-session threshold (currently hardcoded 120s causes false positives on legitimately slow jobs) [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73327Fetched 2026-04-29 06:20:59
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1

The stuck-session detector emits stuck session: ... age=Ns queueDepth=0 warnings when a session's state=processing exceeds 120 seconds. The threshold appears hardcoded (or at least not user-configurable per-job), which generates false positives for cron jobs that are legitimately long-running by design.

Root Cause

External tooling that monitors stuck session log lines (dashboards, alerting bots, my /gbrain health skill) gets noisy alerts that aren't actionable. Users are forced to either:

  • Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
  • Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

The clean fix is upstream: let cron/jobs.json declare an expected duration per job.

Fix Action

Fix / Workaround

  • Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
  • Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

Workaround in the meantime

Code Example

~/.openclaw/cron/jobs.json defines 8 jobs. Two of them (GBrain Sync every 15min,
GBrain Daily Report every 24h) hit this regularly. ~12 false-positive stuck
warnings per hour during sync windows.

---

{"1":"stuck session: sessionId=2072e9d7-9b6b-4a1e-a10d-c8c01af5f84d sessionKey=agent:main:cron:616d2c20-... state=processing age=148s queueDepth=0", ...}

---

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}
RAW_BUFFERClick to expand / collapse

Summary

The stuck-session detector emits stuck session: ... age=Ns queueDepth=0 warnings when a session's state=processing exceeds 120 seconds. The threshold appears hardcoded (or at least not user-configurable per-job), which generates false positives for cron jobs that are legitimately long-running by design.

Repro

Cron job that calls a remote service (e.g., a Supabase sync that processes 2k pages) routinely takes 2-3 minutes per run. The job completes successfully (status=ok in ~/.openclaw/cron/runs/<job-id>.jsonl), but the detector logs 1-3 stuck session warnings during the run window.

In my deployment:

~/.openclaw/cron/jobs.json defines 8 jobs. Two of them (GBrain Sync every 15min,
GBrain Daily Report every 24h) hit this regularly. ~12 false-positive stuck
warnings per hour during sync windows.

Sample log:

{"1":"stuck session: sessionId=2072e9d7-9b6b-4a1e-a10d-c8c01af5f84d sessionKey=agent:main:cron:616d2c20-... state=processing age=148s queueDepth=0", ...}

The job completed with status=ok shortly after.

Why this matters

External tooling that monitors stuck session log lines (dashboards, alerting bots, my /gbrain health skill) gets noisy alerts that aren't actionable. Users are forced to either:

  • Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
  • Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

The clean fix is upstream: let cron/jobs.json declare an expected duration per job.

Proposed shape

Extend cron/jobs.json with optional stuckThresholdMs per job:

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}

Behavior:

  • If stuckThresholdMs is unset → use the existing default (120000 ms / 120s).
  • If set → detector uses the per-job value when computing age > threshold ⇒ stuck.
  • Backward-compatible: existing jobs continue working unchanged.

Optionally a global override: gateway.stuckThresholdMs defaulting to 120000.

Why per-job (not global)

A global bump would mask real stucks in fast jobs (e.g., interactive Telegram sessions) where 120s is correctly aggressive. Per-job lets each kind of work have its own SLA.

Related issues

  • #71127 — stuck-processing sessions detected but never aborted. Different concern (recovery), but same surface; once aborted, this issue's stuckThresholdMs would tune when that abort fires for non-interactive crons.
  • #68620 — single hung tool blocking session for 49 min. Per-job thresholds help here too: a cron with stuckThresholdMs: 600000 makes the abort window explicit.
  • #39141 — optional session activity watchdog. Adjacent — this issue is the configuration layer that watchdog needs.

Acceptance criteria

  1. Schema validation: cron/jobs.json accepts optional stuckThresholdMs: number per job, validates [1000, 3600000] range.
  2. Detector reads the value at session-start, applies per-session.
  3. Default behavior unchanged when field absent.
  4. Doctor / status output surfaces the active threshold per job for transparency.
  5. Migration is no-op for existing brains.

Workaround in the meantime

Downstream tooling that needs to filter false positives can correlate stuck session events with cron/runs/<job-id>.jsonl status=ok entries, but this is fragile and signals the canonical detector is too aggressive for some jobs.

Happy to send a PR if the shape is acceptable.

extent analysis

TL;DR

Implement a per-job stuckThresholdMs configuration in cron/jobs.json to allow for customizable stuck session detection thresholds.

Guidance

  • Review the proposed stuckThresholdMs configuration shape and consider its implementation to address the issue of false positives for long-running cron jobs.
  • Evaluate the trade-offs of implementing a global override (gateway.stuckThresholdMs) versus per-job thresholds.
  • Consider the potential impact on existing jobs and the migration process for introducing the new configuration option.
  • In the meantime, downstream tooling can attempt to filter false positives by correlating stuck session events with cron/runs/<job-id>.jsonl status=ok entries, but this is not a recommended long-term solution.

Example

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}

Notes

The proposed solution requires careful consideration of the trade-offs between per-job and global thresholds, as well as the potential impact on existing jobs. The implementation should ensure backward compatibility and provide a smooth migration path.

Recommendation

Apply the proposed per-job stuckThresholdMs configuration to allow for customizable stuck session detection thresholds, as it provides a more targeted and flexible solution than a global override.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Cron: per-job configurable stuck-session threshold (currently hardcoded 120s causes false positives on legitimately slow jobs) [1 comments, 1 participants]