openclaw - ✅(Solved) Fix bug(cron): isolated-job status is non-deterministic when an agent recovers from a tool error [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#81514Fetched 2026-05-14 03:31:21
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1

The status field returned for openclaw cron runs --id <jobId> for an isolated (payload.kind: agentTurn) job is non-deterministically ok or error across runs with identical input when the agent encounters a recoverable tool error and continues working. Two consecutive runs of the same cron job, with the same prompt and the same underlying failure condition (a Discord message-send returning "Channel is unavailable"), yielded:

RunStatusSame promptSame outcome (Discord post failed, digest written to fallback log, JSON summary emitted)
1okyesyes
2erroryesyes

This makes the cron status field unreliable as a monitoring signal — operators can't trust status: ok to mean "the run did what it was supposed to," nor status: error to mean "intervention required." For long-running orchestrator-style cron jobs (where the agent is expected to handle tool failures and continue gracefully — e.g. fall back to disk logging when an external channel is unreachable), this is a real production-readiness issue.

Error Message

The classifier is sensitive to tool-call ordering within a single agent turn: it walks the payloads in order, finds the last isError: true, and asks "did any successful payload come after it?" If yes → ok. If no → error.

LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:

  • Emit write(file) AFTER exec(openclaw message send) → ok
  • Emit write(file) BEFORE exec(openclaw message send), with the final emit being the failed send → error

Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.

Why this matters

For monitoring / alerting pipelines that route on status: error:

  • False positives during normal-operation runs that involve any recoverable tool failure
  • Missed signals when the agent happened to order tools differently this run
  • Operators eventually learn to ignore the status field, defeating its purpose

For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is successful from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to error on tool-ordering chance is purely a misclassification.

Suggested fixes

Pick one (or a combination):

A. Trust the agent's final assistant message as the outcome signal

If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is ok. Tool errors during the turn that were followed by any further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: was there any successful agent activity at all after the error? rather than "was the final payload non-error?".

Concretely, replace the trailing-payload check with hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim())). If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.

B. Add an explicit agent-signaled status sentinel

Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:

Root Cause

In dist/helpers-BalIC4F-.js (the cron payload classifier — equivalent source likely at src/cron/... / extensions/cron/...):

const hasErrorPayload = params.payloads.some(p => p?.isError === true)
const lastErrorPayloadIndex = params.payloads.findLastIndex(p => p?.isError === true)
const hasSuccessfulPayloadAfterLastError = !params.runLevelError && lastErrorPayloadIndex >= 0
  && params.payloads.slice(lastErrorPayloadIndex + 1).some(p => p?.isError !== true && Boolean(p?.text?.trim()))

const hasFatalStructuredErrorPayload = hasErrorPayload && !hasSuccessfulPayloadAfterLastError && !hasPendingPresentationWarning

And in dist/isolated-agent-SKs97XgD.js (the cron isolated-agent runner — finalizeCronRun):

const agentDiagnostics = createCronRunDiagnosticsFromAgentResult(finalRunResult, { finalStatus: hasFatalErrorPayload ? "error" : "ok" })
return prepared.withRunSession({
  status: hasFatalErrorPayload ? "error" : "ok",
  // ...
})

The classifier is sensitive to tool-call ordering within a single agent turn: it walks the payloads in order, finds the last isError: true, and asks "did any successful payload come after it?" If yes → ok. If no → error.

LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:

  • Emit write(file) AFTER exec(openclaw message send) → ok
  • Emit write(file) BEFORE exec(openclaw message send), with the final emit being the failed send → error

Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.

Fix Action

Fix / Workaround

Workarounds (for users hitting this today)

PR fix notes

PR #81585: fix(cron): make recovered tool errors order agnostic

Description (problem / solution / changelog)

Summary

  • Narrow recovered isolated-cron tool-error handling to exec warnings with final assistant-visible recovery text.
  • Keep trailing exec warnings without recovery text, canvas warnings, and provider errors fatal.
  • Add regression coverage for recovered exec ordering and fatal trailing-error boundaries.

Fixes #81514.

Verification

  • pnpm test src/cron/isolated-agent.helpers.test.ts src/cron/isolated-agent/run.message-tool-policy.test.ts src/agents/pi-embedded-runner/run/payloads.errors.test.ts
  • git diff --check
  • pnpm check:changed

Real behavior proof

  • Behavior addressed: isolated cron outcome classification for recovered exec tool failures after a fallback success, without making unresolved trailing exec, canvas, or provider errors non-fatal.
  • Real environment tested: local OpenClaw checkout on macOS with Node 22, running the real src/cron/isolated-agent/helpers.ts implementation through Node/tsx.
  • Exact steps or command run after this patch: node --import tsx --input-type=module -e 'import { resolveCronPayloadOutcome } from "./src/cron/isolated-agent/helpers.ts"; ...'
  • Evidence after fix:
{
  "recoveredExec": {
    "hasFatalErrorPayload": false,
    "outputText": "Fallback log written successfully."
  },
  "trailingExecNoRecoveryText": {
    "hasFatalErrorPayload": true,
    "embeddedRunError": "⚠️ 🛠️ Exec failed: Discord channel unavailable",
    "outputText": "⚠️ 🛠️ Exec failed: Discord channel unavailable"
  },
  "trailingCanvas": {
    "hasFatalErrorPayload": true,
    "embeddedRunError": "⚠️ 🖼️ Canvas failed",
    "outputText": "⚠️ 🖼️ Canvas failed"
  },
  "trailingProvider": {
    "hasFatalErrorPayload": true,
    "embeddedRunError": "model provider unreachable",
    "outputText": "model provider unreachable"
  }
}
  • Observed result after fix: terminal output showed recoveredExec.hasFatalErrorPayload is false; trailingExecNoRecoveryText, trailingCanvas, and trailingProvider all returned hasFatalErrorPayload: true with the expected embedded error text.
  • What was not tested: a live scheduled isolated cron job against a real Discord channel was not run because this workspace does not have the user's model and channel credentials; the focused tests and direct Node/tsx runtime check covered the classifier behavior.

Changed files

  • src/cron/isolated-agent.helpers.test.ts (modified, +33/-2)
  • src/cron/isolated-agent/helpers.ts (modified, +14/-1)

Code Example

// Run 1 — status: "ok"
{
  "runId": "manual:93753d9f-...:1",
  "status": "ok",
  "durationMs": 642885,
  "summary": "

---

Both runs:
- Completed normally (no abort, no timeout)
- Produced the same final JSON sentinel with `discordPosted: false`
- Wrote the digest to the fallback log file per the prompt's failure-handling rule
- Returned cleanly from the agent loop

## Root cause

In `dist/helpers-BalIC4F-.js` (the cron payload classifier — equivalent source likely at `src/cron/...` / `extensions/cron/...`):

---

And in `dist/isolated-agent-SKs97XgD.js` (the cron isolated-agent runner — `finalizeCronRun`):

---

The classifier is **sensitive to tool-call ordering within a single agent turn**: it walks the payloads in order, finds the last `isError: true`, and asks "did any successful payload come after it?" If yes → `ok`. If no → `error`.

LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:
- Emit `write(file)` AFTER `exec(openclaw message send)` → ok
- Emit `write(file)` BEFORE `exec(openclaw message send)`, with the final emit being the failed send → error

Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.

## Why this matters

For monitoring / alerting pipelines that route on `status: error`:
- False positives during normal-operation runs that involve any recoverable tool failure
- Missed signals when the agent happened to order tools differently this run
- Operators eventually learn to ignore the status field, defeating its purpose

For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is **successful** from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to `error` on tool-ordering chance is purely a misclassification.

## Suggested fixes

Pick one (or a combination):

### A. Trust the agent's final assistant message as the outcome signal

If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is `ok`. Tool errors during the turn that were followed by *any* further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: **was there any successful agent activity at all after the error?** rather than "was the final payload non-error?".

Concretely, replace the trailing-payload check with `hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim()))`. If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.

### B. Add an explicit agent-signaled status sentinel

Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:
RAW_BUFFERClick to expand / collapse

Summary

The status field returned for openclaw cron runs --id <jobId> for an isolated (payload.kind: agentTurn) job is non-deterministically ok or error across runs with identical input when the agent encounters a recoverable tool error and continues working. Two consecutive runs of the same cron job, with the same prompt and the same underlying failure condition (a Discord message-send returning "Channel is unavailable"), yielded:

RunStatusSame promptSame outcome (Discord post failed, digest written to fallback log, JSON summary emitted)
1okyesyes
2erroryesyes

This makes the cron status field unreliable as a monitoring signal — operators can't trust status: ok to mean "the run did what it was supposed to," nor status: error to mean "intervention required." For long-running orchestrator-style cron jobs (where the agent is expected to handle tool failures and continue gracefully — e.g. fall back to disk logging when an external channel is unreachable), this is a real production-readiness issue.

Reproduction

Set up an isolated cron job whose agent prompt:

  1. Calls a tool that may fail (in our case, openclaw message send --channel discord against an unreachable channel).
  2. Catches the failure and falls back to a different action (in our case, write to a fallback log file).
  3. Emits a final JSON summary line (a sentinel marking the agent's own assessment of success/failure).

Trigger the job twice in a row via openclaw cron run <jobId>. Inspect with openclaw cron runs --id <jobId>.

You will see two entries with the same summary content (modulo minor variation in the agent's text), but different status values.

Concrete evidence from our session

// Run 1 — status: "ok"
{
  "runId": "manual:93753d9f-...:1",
  "status": "ok",
  "durationMs": 642885,
  "summary": "```json\n{\"reactiveTriggers\": 0, \"rescuedSweeps\": 2, ..., \"discordPosted\": false, \"errors\": [\"Discord channel unavailable\"]}\n```"
}

// Run 2 — status: "error" (same prompt, same Discord failure, same fallback)
{
  "runId": "manual:93753d9f-...:2",
  "status": "error",
  "durationMs": 337476,
  "summary": "```json\n{\"reactiveTriggers\": 0, \"rescuedSweeps\": 0, ..., \"discordPosted\": false, \"errors\": [\"Discord posting failed: channel unavailable...\"]}\n```"
}

Both runs:

  • Completed normally (no abort, no timeout)
  • Produced the same final JSON sentinel with discordPosted: false
  • Wrote the digest to the fallback log file per the prompt's failure-handling rule
  • Returned cleanly from the agent loop

Root cause

In dist/helpers-BalIC4F-.js (the cron payload classifier — equivalent source likely at src/cron/... / extensions/cron/...):

const hasErrorPayload = params.payloads.some(p => p?.isError === true)
const lastErrorPayloadIndex = params.payloads.findLastIndex(p => p?.isError === true)
const hasSuccessfulPayloadAfterLastError = !params.runLevelError && lastErrorPayloadIndex >= 0
  && params.payloads.slice(lastErrorPayloadIndex + 1).some(p => p?.isError !== true && Boolean(p?.text?.trim()))

const hasFatalStructuredErrorPayload = hasErrorPayload && !hasSuccessfulPayloadAfterLastError && !hasPendingPresentationWarning

And in dist/isolated-agent-SKs97XgD.js (the cron isolated-agent runner — finalizeCronRun):

const agentDiagnostics = createCronRunDiagnosticsFromAgentResult(finalRunResult, { finalStatus: hasFatalErrorPayload ? "error" : "ok" })
return prepared.withRunSession({
  status: hasFatalErrorPayload ? "error" : "ok",
  // ...
})

The classifier is sensitive to tool-call ordering within a single agent turn: it walks the payloads in order, finds the last isError: true, and asks "did any successful payload come after it?" If yes → ok. If no → error.

LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:

  • Emit write(file) AFTER exec(openclaw message send) → ok
  • Emit write(file) BEFORE exec(openclaw message send), with the final emit being the failed send → error

Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.

Why this matters

For monitoring / alerting pipelines that route on status: error:

  • False positives during normal-operation runs that involve any recoverable tool failure
  • Missed signals when the agent happened to order tools differently this run
  • Operators eventually learn to ignore the status field, defeating its purpose

For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is successful from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to error on tool-ordering chance is purely a misclassification.

Suggested fixes

Pick one (or a combination):

A. Trust the agent's final assistant message as the outcome signal

If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is ok. Tool errors during the turn that were followed by any further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: was there any successful agent activity at all after the error? rather than "was the final payload non-error?".

Concretely, replace the trailing-payload check with hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim())). If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.

B. Add an explicit agent-signaled status sentinel

Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:

payload:
  kind: agentTurn
  message: "..."
  statusSentinel:
    type: jsonField
    path: "$.discordPosted"   # or "$.ok", "$.success", etc.
    okWhen: { equals: true }

When set, the agent's own output is the source of truth. When unset, fall back to current heuristic.

C. Treat tool errors as warning rather than fatal when the agent continues

Introduce a third status — ok-with-warnings — for runs where a tool errored but the agent kept going. Preserves the diagnostic value of the existing classifier without misrouting alerts.

My lean is (A) because it's the smallest change and matches the spirit of the existing heuristic without the order sensitivity. (B) is the most powerful but requires schema changes. (C) is the most truthful but introduces a new state operators have to learn.

Workarounds (for users hitting this today)

  • Inspect summary instead of status if the agent's prompt is known to produce a JSON sentinel.
  • Suppress alerts on transient tool errors at the consumer level (downstream of cron).
  • Re-order the prompt's failure-handling so the last tool call in the recovery path is non-error (e.g., write the fallback file unconditionally at the end of every turn, regardless of whether Discord succeeded). This is brittle but works.

Related files

  • dist/helpers-BalIC4F-.jsresolveCronPayloadOutcome (lines ~140-205)
  • dist/isolated-agent-SKs97XgD.jsfinalizeCronRun (lines ~643-740)
  • dist/server-cron-CM4aws4s.jsexecuteDetachedCronJob (lines ~1242-1290)

Equivalent sources likely under src/cron/ or extensions/cron/ in the repo.

Environment

  • OpenClaw 2026.5.7
  • Linux gateway via systemd user service
  • Job type: isolated (sessionTarget: isolated, payload.kind: agentTurn)
  • Agent: main, model deepseek/deepseek-v4-pro

🤖 Surfaced during canonry orchestrator setup; both runs were the same job at https://github.com/AINYC/canonry — diagnosed by tracing the dist code.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix bug(cron): isolated-job status is non-deterministic when an agent recovers from a tool error [1 pull requests, 1 comments, 2 participants]