openclaw - 💡(How to fix) Fix Cron: per-job configurable stuck-session threshold (currently hardcoded 120s causes false positives on legitimately slow jobs) [1 comments, 1 participants]

openclaw2026-04-28 06:09:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73327•Fetched 2026-04-29 06:20:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

durang

Participants

durang

Timeline (top)

commented ×1cross-referenced ×1

The stuck-session detector emits stuck session: ... age=Ns queueDepth=0 warnings when a session's state=processing exceeds 120 seconds. The threshold appears hardcoded (or at least not user-configurable per-job), which generates false positives for cron jobs that are legitimately long-running by design.

Root Cause

External tooling that monitors stuck session log lines (dashboards, alerting bots, my /gbrain health skill) gets noisy alerts that aren't actionable. Users are forced to either:

Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

The clean fix is upstream: let cron/jobs.json declare an expected duration per job.

Fix Action

Fix / Workaround

Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

Workaround in the meantime

Code Example

~/.openclaw/cron/jobs.json defines 8 jobs. Two of them (GBrain Sync every 15min,
GBrain Daily Report every 24h) hit this regularly. ~12 false-positive stuck
warnings per hour during sync windows.

---

{"1":"stuck session: sessionId=2072e9d7-9b6b-4a1e-a10d-c8c01af5f84d sessionKey=agent:main:cron:616d2c20-... state=processing age=148s queueDepth=0", ...}

---

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}

RAW_BUFFERClick to expand / collapse

Summary

Repro

Cron job that calls a remote service (e.g., a Supabase sync that processes 2k pages) routinely takes 2-3 minutes per run. The job completes successfully (status=ok in ~/.openclaw/cron/runs/<job-id>.jsonl), but the detector logs 1-3 stuck session warnings during the run window.

In my deployment:

~/.openclaw/cron/jobs.json defines 8 jobs. Two of them (GBrain Sync every 15min,
GBrain Daily Report every 24h) hit this regularly. ~12 false-positive stuck
warnings per hour during sync windows.

Sample log:

{"1":"stuck session: sessionId=2072e9d7-9b6b-4a1e-a10d-c8c01af5f84d sessionKey=agent:main:cron:616d2c20-... state=processing age=148s queueDepth=0", ...}

The job completed with status=ok shortly after.

Why this matters

External tooling that monitors stuck session log lines (dashboards, alerting bots, my /gbrain health skill) gets noisy alerts that aren't actionable. Users are forced to either:

Accept noise and lose trust in the detector signal (eventual alert fatigue → real stucks ignored).
Build local workarounds (whitelists, classifiers) that mask the upstream signal — wrong direction; signal-suppression at downstream is fragile and hides real bugs when whitelist drifts.

The clean fix is upstream: let cron/jobs.json declare an expected duration per job.

Proposed shape

Extend cron/jobs.json with optional stuckThresholdMs per job:

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}

Behavior:

If stuckThresholdMs is unset → use the existing default (120000 ms / 120s).
If set → detector uses the per-job value when computing age > threshold ⇒ stuck.
Backward-compatible: existing jobs continue working unchanged.

Optionally a global override: gateway.stuckThresholdMs defaulting to 120000.

Why per-job (not global)

A global bump would mask real stucks in fast jobs (e.g., interactive Telegram sessions) where 120s is correctly aggressive. Per-job lets each kind of work have its own SLA.

Related issues

#71127 — stuck-processing sessions detected but never aborted. Different concern (recovery), but same surface; once aborted, this issue's stuckThresholdMs would tune when that abort fires for non-interactive crons.
#68620 — single hung tool blocking session for 49 min. Per-job thresholds help here too: a cron with stuckThresholdMs: 600000 makes the abort window explicit.
#39141 — optional session activity watchdog. Adjacent — this issue is the configuration layer that watchdog needs.

Acceptance criteria

Schema validation: cron/jobs.json accepts optional stuckThresholdMs: number per job, validates [1000, 3600000] range.
Detector reads the value at session-start, applies per-session.
Default behavior unchanged when field absent.
Doctor / status output surfaces the active threshold per job for transparency.
Migration is no-op for existing brains.

Workaround in the meantime

Downstream tooling that needs to filter false positives can correlate stuck session events with cron/runs/<job-id>.jsonl status=ok entries, but this is fragile and signals the canonical detector is too aggressive for some jobs.

Happy to send a PR if the shape is acceptable.

extent analysis

TL;DR

Implement a per-job stuckThresholdMs configuration in cron/jobs.json to allow for customizable stuck session detection thresholds.

Guidance

Review the proposed stuckThresholdMs configuration shape and consider its implementation to address the issue of false positives for long-running cron jobs.
Evaluate the trade-offs of implementing a global override (gateway.stuckThresholdMs) versus per-job thresholds.
Consider the potential impact on existing jobs and the migration process for introducing the new configuration option.
In the meantime, downstream tooling can attempt to filter false positives by correlating stuck session events with cron/runs/<job-id>.jsonl status=ok entries, but this is not a recommended long-term solution.

Example

{
  "id": "616d2c20-d06d-4116-830b-2b097b6fe819",
  "name": "GBrain Sync",
  "schedule": { "kind": "cron", "expr": "*/15 * * * *" },
  "stuckThresholdMs": 300000,   // <-- new: 5min for this job
  "enabled": true
}

Notes

The proposed solution requires careful consideration of the trade-offs between per-job and global thresholds, as well as the potential impact on existing jobs. The implementation should ensure backward compatibility and provide a smooth migration path.

Recommendation

Apply the proposed per-job stuckThresholdMs configuration to allow for customizable stuck session detection thresholds, as it provides a more targeted and flexible solution than a global override.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Cron: per-job configurable stuck-session threshold (currently hardcoded 120s causes false positives on legitimately slow jobs) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in the meantime

Code Example

Summary

Repro

Why this matters

Proposed shape

Why per-job (not global)

Related issues

Acceptance criteria

Workaround in the meantime

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Cron: per-job configurable stuck-session threshold (currently hardcoded 120s causes false positives on legitimately slow jobs) [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in the meantime

Code Example

Summary

Repro

Why this matters

Proposed shape

Why per-job (not global)

Related issues

Acceptance criteria

Workaround in the meantime

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING