openclaw - 💡(How to fix) Fix [Bug]: Cron payload-pinned model fallback chain does not engage on stream-stall failures (provider opens stream, never sends a chunk) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

OpenClaw's cron job documentation (docs/automation/cron-jobs.md) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by clean failures — HTTP error codes, connection refused, auth refusal. Stream-stall failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job.

Error Message

OpenClaw's cron job documentation (docs/automation/cron-jobs.md) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by clean failures — HTTP error codes, connection refused, auth refusal. Stream-stall failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job. The fallback chain only sees a "failure" signal when a model returns an actual HTTP error, throws a connection exception, or otherwise produces a clean error response that the conversation loop can catch and route to retry. A silent stalled stream produces no such signal until the stream-stale watchdog fires, by which time the cron-level inactivity timer has already won the race.

Root Cause

OpenClaw's cron job documentation (docs/automation/cron-jobs.md) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by clean failures — HTTP error codes, connection refused, auth refusal. Stream-stall failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job.

Fix Action

Fixed

Code Example

cron: job execution timed out (last phase: model-call-started)

---

# List all crons with payload.model hard-pinned
python3 -c "
import json
d = json.load(open('~/.openclaw/cron/jobs.json'.replace('~', __import__('os').path.expanduser('~'))))
jobs = d if isinstance(d, list) else d.get('jobs', [])
for j in jobs:
    p = j.get('payload') or {}
    if p.get('model'):
        print(f'{j.get(\"name\")}: model={p[\"model\"]} timeout={p.get(\"timeoutSeconds\")}s fallbacks={p.get(\"fallbacks\")}')"

# Inspect a specific cron's run record
ls -lat ~/.openclaw/cron/runs/<job-id>.jsonl
tail -3 ~/.openclaw/cron/runs/<job-id>.jsonl | python3 -m json.tool
RAW_BUFFERClick to expand / collapse

Summary

OpenClaw's cron job documentation (docs/automation/cron-jobs.md) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by clean failures — HTTP error codes, connection refused, auth refusal. Stream-stall failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job.

Reproduction (real production incident, 2026-05-23)

  • OpenClaw version: 2026.5.20 (stock install, macOS 26.5, Node v25.6.0)
  • Cron: Workspace Cleanup (agentId: tenten, schedule 0 4 * * 0 Europe/Athens)
  • Pinned model: minimax-portal/MiniMax-M2.5 (via payload.model)
  • Payload timeout: timeoutSeconds: 300
  • Agent fallback chain configured for TenTen: minimax-portal/MiniMax-M2.7 → opencode-go/mimo-v2.5 → openai/gpt-5.5

The cron fired at 2026-05-23T21:00:32 (runAtMs: 1779584432676). MiniMax-M2.5 was experiencing a transient stream-stall outage at the time (other clients calling the same endpoint observed the same behavior — connection accepted, response headers sent, no chunks ever emitted). The cron's runHeartbeatOnce/agent run waited 1110928 ms (18 min 30 sec) before the outer watchdog terminated it with:

cron: job execution timed out (last phase: model-call-started)

Two issues with the observed behavior:

  1. The payload timeoutSeconds: 300 was never enforced — the call ran 3.7× longer than the configured per-cron timeout before any watchdog fired. (Separate bug? May be related — the per-attempt timeout doesn't appear to apply to stream-stall scenarios either.)
  2. Critically: the agent-level fallback chain (MiniMax-M2.7 → mimo-v2.5 → gpt-5.5) never engaged. All three are healthy providers; if any of them had been tried, the cron would have completed normally. Instead the whole job was killed with no fallback attempt.

Verified independently via direct API tests after the incident:

  • MiniMax-M2.5 (the failing primary): healthy by the time we tested (first chunk in 2s, full response in 3s)
  • MiniMax-M2.7 (first fallback): healthy
  • gpt-5.5 (last fallback): healthy

So the providers themselves recovered. The cron failure was specifically the absence of fallback engagement during the stall window.

Why the fallback doesn't engage

From source inspection of 2026.5.20:

  • The chat-completion client (agent/chat_completion_helpers.py equivalent path) has a stream-stale watchdog that kills connections after N seconds of no chunks (180s threshold observed in similar Hermes path; OpenClaw value not yet pinned).
  • But the cron run's outer watchdog at cron.scheduler level fires on overall job idle, which (in our case) preceded enough stall-watchdog retries to escalate to the next fallback model.
  • Net effect: the cron is killed before the model-level fallback can promote to the next provider in the chain.

The fallback chain only sees a "failure" signal when a model returns an actual HTTP error, throws a connection exception, or otherwise produces a clean error response that the conversation loop can catch and route to retry. A silent stalled stream produces no such signal until the stream-stale watchdog fires, by which time the cron-level inactivity timer has already won the race.

Suggested fix

Treat "no first chunk received within N seconds of stream open" as a failure that triggers the configured fallback chain — analogous to how a clean HTTP 5xx already does. Concretely:

  • When stream.firstChunkAt exceeds a configurable threshold (e.g. 30-60s), abort the in-flight request, mark the current model as a failed candidate, and proceed to the next entry in the fallback chain.
  • This threshold should be smaller than the cron's per-attempt timeout and much smaller than the outer cron idle watchdog, so the chain has time to escalate before the cron is killed.
  • Reusing the existing fallback-decision machinery (the model fallback decision: decision=candidate_failed requested=… candidate=… reason=… next=… log lines already emitted on clean failures) would keep the implementation consistent with the documented behavior.

Impact

Any cron that hard-pins a model in payload.model is vulnerable to this — which is ~92% of crons on the affected install (24 of 26). The blast radius is per-cron rather than systemic, but a single transient provider stall during a cron's firing window produces a fully-missed run with no recovery, even though the fallback chain is documented to handle exactly this case.

Same architectural pattern observed on a separate Hermes setup (different non-OpenClaw codebase, same family of issue) on the same morning — so this isn't OpenClaw-unique, but the OpenClaw docs do promise fallback engagement, which makes the gap noticeable here.

Verification commands

# List all crons with payload.model hard-pinned
python3 -c "
import json
d = json.load(open('~/.openclaw/cron/jobs.json'.replace('~', __import__('os').path.expanduser('~'))))
jobs = d if isinstance(d, list) else d.get('jobs', [])
for j in jobs:
    p = j.get('payload') or {}
    if p.get('model'):
        print(f'{j.get(\"name\")}: model={p[\"model\"]} timeout={p.get(\"timeoutSeconds\")}s fallbacks={p.get(\"fallbacks\")}')"

# Inspect a specific cron's run record
ls -lat ~/.openclaw/cron/runs/<job-id>.jsonl
tail -3 ~/.openclaw/cron/runs/<job-id>.jsonl | python3 -m json.tool

Happy to share full session JSONL / cron run record from the May 23 incident if useful for testing the fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING