openclaw - 💡(How to fix) Fix [Bug]: Cron payload-pinned model fallback chain does not engage on stream-stall failures (provider opens stream, never sends a chunk) [1 pull requests]

StepCodex · 2026-05-24T02:37:59Z

[openclaw] OpenClaw's cron job documentation docs/automation/cron-jobs.md states that configured fallback chains continue to apply when a cron's pinned primary… OpenClaw's cron job documentation (`docs/automation/cron-jobs.md`) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by **clean** failures — HTTP error codes, connection refused, auth refusal. **Stream-stall** failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job. ## Fixed - Fixed by PR: fix(cron): stop fallback attempts when cron budget is exhausted (https://github.com/openclaw/openclaw/pull/52365) ## Summary OpenClaw's cron job documentation (`docs/automation/cron-jobs.md`) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by **clean** failures — HTTP error codes, connection refused, auth refusal. **Stream-stall** failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job. ## Reproduction (real production incident, 2026-05-23) - **OpenClaw version:** 2026.5.20 (stock install, macOS 26.5, Node v25.6.0) - **Cron:** Workspace Cleanup (`agentId: tenten`, schedule `0 4 * * 0` Europe/Athens) - **Pinned model:** `minimax-portal/MiniMax-M2.5` (via `payload.model`) - **Payload timeout:** `timeoutSeconds: 300` - **Agent fallback chain configured for TenTen:** `minimax-portal/MiniMax-M2.7 → opencode-go/mimo-v2.5 → openai/gpt-5.5` The cron fired at 2026-05-23T21:00:32 (`runAtMs: 1779584432676`). MiniMax-M2.5 was experiencing a transient stream-stall outage at the time (other clients calling the same endpoint observed the same behavior — connection accepted, response headers sent, no chunks ever emitted). The cron's `runHeartbeatOnce`/agent run waited 1110928 ms (18 min 30 sec) before the outer watchdog terminated it with: ``` cron: job execution timed out (last phase: model-call-started) ``` Two issues with the observed behavior: 1. The payload `timeoutSeconds: 300` was never enforced — the call ran 3.7× longer than the configured per-cron timeout before any watchdog fired. (Separate bug? May be related — the per-attempt timeout doesn't appear to apply to stream-stall scenarios either.) 2. **Critically:** the agent-level fallback chain (`MiniMax-M2.7 → mimo-v2.5 → gpt-5.5`) never engaged. All three are healthy providers; if any of them had been tried, the cron would have completed normally. Instead the whole job was killed with no fallback attempt. Verified independently via direct API tests after the incident: - MiniMax-M2.5 (the failing primary): healthy by the time we tested (first chunk in 2s, full response in 3s) - MiniMax-M2.7 (first fallback): healthy - gpt-5.5 (last fallback): healthy So the providers themselves recovered. The cron failure was specifically the absence of fallback engagement during the stall window. ## Why the fallback doesn't engage From source inspection of `2026.5.20`: - The chat-completion client (`agent/chat_completion_helpers.py` equivalent path) has a stream-stale watchdog that kills connections after N seconds of no chunks (180s threshold observed in similar Hermes path; OpenClaw value not yet pinned). - But the cron run's outer watchdog at `cron.scheduler` level fires on overall job idle, which (in our case) preceded enough stall-watchdog retries to escalate to the next fallback model. - Net effect: the cron is killed before the model-level fallback can promote to the next provider in the chain. The fallback chain only sees a "failure" signal when a model returns an actual HTTP error, throws a connection exception, or otherwise produces a clean error response that the conversation loop can catch and route to retry. A silent stalled stream produces no such signal until the stream-stale watchdog fires, by which time the cron-level inactivity timer has already won the race. ## Suggested fix Treat "no first chunk received within N seconds of stream open" as a failure that triggers the configured fallback chain — analogous to how a clean HTTP 5xx already does. Concretely: - When `stream.firstChunkAt` exceeds a configurable threshold (e.g. 30-60s), abort the in-flight request, mark the current model as a failed candidate, and proceed to the next entry in the fallback chain. - This threshold should be smaller than the cron's per-attempt timeout and much smaller than the outer cron idle watchdog, so the chain has time to escalate before the cron is killed. - Reusing the existing fallback-decision machinery (the `model fallback decision: decision=candidate_failed requested=… candidate=… reason=… next=

Error Message

OpenClaw's cron job documentation (docs/automation/cron-jobs.md) states that configured fallback chains continue to apply when a cron's pinned primary model fails. In practice, the fallback chain is only triggered by clean failures — HTTP error codes, connection refused, auth refusal. Stream-stall failures (provider accepts the connection, sends response headers, but never sends a first chunk) are treated as still-in-progress and the fallback never engages, so the cron just sits until the outer watchdog kills the whole job. The fallback chain only sees a "failure" signal when a model returns an actual HTTP error, throws a connection exception, or otherwise produces a clean error response that the conversation loop can catch and route to retry. A silent stalled stream produces no such signal until the stream-stale watchdog fires, by which time the cron-level inactivity timer has already won the race.

Root Cause

Code Example

cron: job execution timed out (last phase: model-call-started)

---

# List all crons with payload.model hard-pinned
python3 -c "
import json
d = json.load(open('~/.openclaw/cron/jobs.json'.replace('~', __import__('os').path.expanduser('~'))))
jobs = d if isinstance(d, list) else d.get('jobs', [])
for j in jobs:
    p = j.get('payload') or {}
    if p.get('model'):
        print(f'{j.get(\"name\")}: model={p[\"model\"]} timeout={p.get(\"timeoutSeconds\")}s fallbacks={p.get(\"fallbacks\")}')"

# Inspect a specific cron's run record
ls -lat ~/.openclaw/cron/runs/<job-id>.jsonl
tail -3 ~/.openclaw/cron/runs/<job-id>.jsonl | python3 -m json.tool

Summary

Reproduction (real production incident, 2026-05-23)

OpenClaw version: 2026.5.20 (stock install, macOS 26.5, Node v25.6.0)
Cron: Workspace Cleanup (agentId: tenten, schedule 0 4 * * 0 Europe/Athens)
Pinned model: minimax-portal/MiniMax-M2.5 (via payload.model)
Payload timeout: timeoutSeconds: 300
Agent fallback chain configured for TenTen: minimax-portal/MiniMax-M2.7 → opencode-go/mimo-v2.5 → openai/gpt-5.5

The cron fired at 2026-05-23T21:00:32 (runAtMs: 1779584432676). MiniMax-M2.5 was experiencing a transient stream-stall outage at the time (other clients calling the same endpoint observed the same behavior — connection accepted, response headers sent, no chunks ever emitted). The cron's runHeartbeatOnce/agent run waited 1110928 ms (18 min 30 sec) before the outer watchdog terminated it with:

cron: job execution timed out (last phase: model-call-started)

Two issues with the observed behavior:

The payload timeoutSeconds: 300 was never enforced — the call ran 3.7× longer than the configured per-cron timeout before any watchdog fired. (Separate bug? May be related — the per-attempt timeout doesn't appear to apply to stream-stall scenarios either.)
Critically: the agent-level fallback chain (MiniMax-M2.7 → mimo-v2.5 → gpt-5.5) never engaged. All three are healthy providers; if any of them had been tried, the cron would have completed normally. Instead the whole job was killed with no fallback attempt.

Verified independently via direct API tests after the incident:

MiniMax-M2.5 (the failing primary): healthy by the time we tested (first chunk in 2s, full response in 3s)
MiniMax-M2.7 (first fallback): healthy
gpt-5.5 (last fallback): healthy

So the providers themselves recovered. The cron failure was specifically the absence of fallback engagement during the stall window.

Why the fallback doesn't engage

From source inspection of 2026.5.20:

The chat-completion client (agent/chat_completion_helpers.py equivalent path) has a stream-stale watchdog that kills connections after N seconds of no chunks (180s threshold observed in similar Hermes path; OpenClaw value not yet pinned).
But the cron run's outer watchdog at cron.scheduler level fires on overall job idle, which (in our case) preceded enough stall-watchdog retries to escalate to the next fallback model.
Net effect: the cron is killed before the model-level fallback can promote to the next provider in the chain.

The fallback chain only sees a "failure" signal when a model returns an actual HTTP error, throws a connection exception, or otherwise produces a clean error response that the conversation loop can catch and route to retry. A silent stalled stream produces no such signal until the stream-stale watchdog fires, by which time the cron-level inactivity timer has already won the race.

Suggested fix

Treat "no first chunk received within N seconds of stream open" as a failure that triggers the configured fallback chain — analogous to how a clean HTTP 5xx already does. Concretely:

When stream.firstChunkAt exceeds a configurable threshold (e.g. 30-60s), abort the in-flight request, mark the current model as a failed candidate, and proceed to the next entry in the fallback chain.
This threshold should be smaller than the cron's per-attempt timeout and much smaller than the outer cron idle watchdog, so the chain has time to escalate before the cron is killed.
Reusing the existing fallback-decision machinery (the model fallback decision: decision=candidate_failed requested=… candidate=… reason=… next=… log lines already emitted on clean failures) would keep the implementation consistent with the documented behavior.

Impact

Any cron that hard-pins a model in payload.model is vulnerable to this — which is ~92% of crons on the affected install (24 of 26). The blast radius is per-cron rather than systemic, but a single transient provider stall during a cron's firing window produces a fully-missed run with no recovery, even though the fallback chain is documented to handle exactly this case.

Same architectural pattern observed on a separate Hermes setup (different non-OpenClaw codebase, same family of issue) on the same morning — so this isn't OpenClaw-unique, but the OpenClaw docs do promise fallback engagement, which makes the gap noticeable here.

Verification commands

# List all crons with payload.model hard-pinned
python3 -c "
import json
d = json.load(open('~/.openclaw/cron/jobs.json'.replace('~', __import__('os').path.expanduser('~'))))
jobs = d if isinstance(d, list) else d.get('jobs', [])
for j in jobs:
    p = j.get('payload') or {}
    if p.get('model'):
        print(f'{j.get(\"name\")}: model={p[\"model\"]} timeout={p.get(\"timeoutSeconds\")}s fallbacks={p.get(\"fallbacks\")}')"

# Inspect a specific cron's run record
ls -lat ~/.openclaw/cron/runs/<job-id>.jsonl
tail -3 ~/.openclaw/cron/runs/<job-id>.jsonl | python3 -m json.tool

Happy to share full session JSONL / cron run record from the May 23 incident if useful for testing the fix.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Cron payload-pinned model fallback chain does not engage on stream-stall failures (provider opens stream, never sends a chunk) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed