openclaw - ✅(Solved) Fix [Bug]: isolated cron agentTurn can get stuck/rerun stale running state and eventually time out on lightweight ClawHub JSON summarization [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#44541Fetched 2026-04-08 00:45:28
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2commented ×1

Error Message

FailoverError: LLM request timed out.

Fix Action

Fixed

PR fix notes

PR #44573: Cron: isolate cron run session lanes

Description (problem / solution / changelog)

Summary

  • Problem: isolated cron agentTurn runs generated a per-run session key for persistence, but still executed on the stable job-scoped session key.
  • Why it matters: repeated runs of the same cron job could queue behind stale work on session:agent:main:cron:<job-id>, matching the timeout/stuck symptoms in #44541.
  • What changed: isolated cron execution now passes the per-run ...:run:<sessionId> session key into the CLI/embedded agent runner, and adds a regression test that asserts the execution key is per-run.
  • What did NOT change (scope boundary): this PR does not change model/provider selection, cron timeout policy, or delivery behavior.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #44541
  • Related #29774

User-visible / Behavior Changes

  • Isolated cron runs now execute on a unique per-run session lane instead of reusing the stable job-scoped cron lane.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local test harness
  • Model/provider: mocked
  • Integration/channel (if any): N/A
  • Relevant config (redacted): isolated cron agentTurn

Steps

  1. Create an isolated cron job with payload.kind = "agentTurn".
  2. Execute the job runner twice for the same job.
  3. Inspect the session key passed into the isolated runner.

Expected

  • Each execution uses its own ...:run:<sessionId> session key/lane.

Actual

  • Before this change the runner still used the stable job-scoped session key.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: ran targeted Vitest coverage for isolated cron execution session-key behavior and nearby cron isolated-agent behavior.
  • Edge cases checked: existing message-tool policy and owner-auth isolated cron tests still pass.
  • What you did not verify: live provider-backed cron execution on a machine with model credentials.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit 1b9e43ce4.
  • Files/config to restore: src/cron/isolated-agent/run.ts
  • Known bad symptoms reviewers should watch for: isolated cron runs unexpectedly sharing the same stable session lane again.

Risks and Mitigations

  • Risk: per-run session-key execution could diverge from persisted cron session bookkeeping.
    • Mitigation: the stable cron key is still persisted alongside the per-run key; the change only affects execution-time lane selection.

Changed files

  • src/cron/isolated-agent.uses-last-non-empty-agent-text-as.test.ts (modified, +1/-1)
  • src/cron/isolated-agent/delivery-dispatch.double-announce.test.ts (modified, +2/-0)
  • src/cron/isolated-agent/delivery-dispatch.ts (modified, +6/-5)
  • src/cron/isolated-agent/run.execution-session-key.test.ts (added, +84/-0)
  • src/cron/isolated-agent/run.interim-retry.test.ts (modified, +6/-0)
  • src/cron/isolated-agent/run.ts (modified, +11/-4)

Code Example

{
  "kind": "agentTurn",
  "message": "使用 clawhub skill,并且只做下面这件事:\n\n1. 执行一次 clawhub explore --limit 10 --json\n2. 只读取返回 JSON 里的 items 前 10 条\n3. 基于每条的 slug 和 summary,整理成 10 行中文摘要\n4. 每行格式固定为:序号. slug - 简短中文描述\n5. 不要读取其他文件,不要运行其他 clawhub 命令,不要搜索,不要展开分析\n6. 不要输出思考、步骤、工具调用、JSON 原文、标题或结尾说明\n7. 只输出最终这 10 行文本,系统会自动投递到 Telegram",
  "timeoutSeconds": 900,
  "lightContext": true
}

---

cron: queued manual run waiting for an execution slot

---

FailoverError: LLM request timed out.

---

lane=session:agent:main:cron:<job-id> durationMs=900028 error="FailoverError: LLM request timed out."

---

cron: clearing stale running marker on startup
RAW_BUFFERClick to expand / collapse

What happened?

An isolated cron job using payload.kind = "agentTurn" appears to get stuck in a stale/running state and often times out, even for a lightweight task.

The task is intentionally simple:

  1. Run clawhub explore --limit 10 --json
  2. Read the first 10 items
  3. Rewrite them into 10 short lines
  4. Deliver to Telegram via normal cron announce

The underlying clawhub command is fast when run directly, and direct Telegram sends also work. But the same logic inside OpenClaw cron repeatedly hangs or times out.

Environment

  • OpenClaw: 2026.3.8
  • Node: v22.22.0
  • Session target: isolated
  • Payload kind: agentTurn
  • Delivery: announce to Telegram
  • Timeout: 900s
  • lightContext: true

Repro config

Cron job payload:

{
  "kind": "agentTurn",
  "message": "使用 clawhub skill,并且只做下面这件事:\n\n1. 执行一次 clawhub explore --limit 10 --json\n2. 只读取返回 JSON 里的 items 前 10 条\n3. 基于每条的 slug 和 summary,整理成 10 行中文摘要\n4. 每行格式固定为:序号. slug - 简短中文描述\n5. 不要读取其他文件,不要运行其他 clawhub 命令,不要搜索,不要展开分析\n6. 不要输出思考、步骤、工具调用、JSON 原文、标题或结尾说明\n7. 只输出最终这 10 行文本,系统会自动投递到 Telegram",
  "timeoutSeconds": 900,
  "lightContext": true
}

Repro steps

  1. Configure an isolated cron job with the payload above
  2. Trigger it manually with openclaw cron run <job-id>
  3. Observe cron state, run history, and logs

Expected behavior

  • The job should finish within a few minutes
  • It should produce 10 short lines
  • It should be delivered through cron announce

Actual behavior

Observed repeatedly:

  • cron list sometimes shows the job as running even after the previous run already finished with error
  • manual runs can get queued behind a stale slot:
    • cron: queued manual run waiting for an execution slot
  • old stale running markers seem to survive until gateway restart
  • after restart, state may clear, but the next run can still get stuck again
  • many runs end as:
    • Error: cron: job execution timed out
  • logs show:
    • FailoverError: LLM request timed out.

Evidence

Timed-out runs:

  • runAtMs: 1773365921979, durationMs: 900004, status: error
  • runAtMs: 1773326195890, durationMs: 900003, status: error
  • runAtMs: 1773324907261, durationMs: 900004, status: error

Example log lines:

cron: queued manual run waiting for an execution slot
FailoverError: LLM request timed out.
lane=session:agent:main:cron:<job-id> durationMs=900028 error="FailoverError: LLM request timed out."

Also observed earlier:

cron: clearing stale running marker on startup

Why this seems like an OpenClaw bug

The same underlying operations succeed outside cron:

  • clawhub explore --limit 10 --json returns quickly when run directly
  • direct Telegram send via openclaw message send succeeds

So this does not look like:

  • ClawHub CLI slowness
  • Telegram delivery failure
  • large output size

It looks more like a cron / isolated agentTurn state-machine or execution-slot bug.

Additional note

A much simpler 3-line version of the same ClawHub task did succeed once, which suggests the cron lane is sensitive in a way that direct command execution is not.

extent analysis

Fix Plan

To resolve the issue of the cron job getting stuck in a stale/running state and timing out, we will implement the following steps:

  • Increase the timeout for the LLM request to prevent timeouts
  • Implement a retry mechanism for the LLM request to handle temporary failures
  • Add logging to track the execution time of the cron job and identify potential bottlenecks

Code Changes

We will modify the cron job payload to include a longer timeout and implement a retry mechanism:

{
  "kind": "agentTurn",
  "message": "...",
  "timeoutSeconds": 1800, // increased timeout
  "lightContext": true,
  "retry": {
    "attempts": 3,
    "delay": 30000 // 30 seconds
  }
}

We will also add logging to track the execution time of the cron job:

const startTime = Date.now();
// execute cron job logic
const endTime = Date.now();
console.log(`Execution time: ${endTime - startTime}ms`);

Infra / Dependency Fixes

We will ensure that the Node.js version is up-to-date and compatible with the OpenClaw version.

Temporary Workaround

As a temporary workaround, we can manually restart the gateway to clear the stale running markers and allow the cron job to run again.

Verification

To verify that the fix worked, we will:

  • Run the cron job manually and check the execution time and output
  • Check the logs for any errors or timeouts
  • Verify that the job completes successfully and produces the expected output

Extra Tips

To prevent similar issues in the future, we can:

  • Monitor the execution time of cron jobs and adjust timeouts accordingly
  • Implement retry mechanisms for critical tasks
  • Regularly update dependencies and ensure compatibility with the OpenClaw version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • The job should finish within a few minutes
  • It should produce 10 short lines
  • It should be delivered through cron announce

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: isolated cron agentTurn can get stuck/rerun stale running state and eventually time out on lightweight ClawHub JSON summarization [1 pull requests, 1 comments, 1 participants]