openclaw - 💡(How to fix) Fix Cron isolated agentTurn: LLM request hangs for full job timeout — no per-request timeout [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60443Fetched 2026-04-08 02:51:09
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Isolated agentTurn cron jobs intermittently hang for their entire timeoutSeconds budget on the initial LLM API call, with the Anthropic API never responding. The job timeout is consumed by a single hanging request, leaving no time for the actual task or fallback retries.

Regular interactive sessions on the same profile, same API key, same model work fine simultaneously.

Error Message

Profile anthropic:manual timed out. Trying next account... FailoverError: LLM request timed out. model-fallback: candidate_failed reason=timeout next=none

Root Cause

The timeoutSeconds on the cron payload is used as the overall job budget. When the initial LLM API call hangs (status 408), it consumes the entire budget. There is no separate, shorter per-request timeout that would allow:

  1. Detecting a hung request early (e.g. 30s)
  2. Retrying on the same or fallback model
  3. Leaving time for the actual task execution

The model-fallback code does fire, but only after the full timeout — at which point there's no budget left.

Fix Action

Fix / Workaround

Workaround attempts

Code Example

Profile anthropic:manual timed out. Trying next account...
  FailoverError: LLM request timed out.
  model-fallback: candidate_failed reason=timeout next=none
RAW_BUFFERClick to expand / collapse

Summary

Isolated agentTurn cron jobs intermittently hang for their entire timeoutSeconds budget on the initial LLM API call, with the Anthropic API never responding. The job timeout is consumed by a single hanging request, leaving no time for the actual task or fallback retries.

Regular interactive sessions on the same profile, same API key, same model work fine simultaneously.

Environment

  • OpenClaw: running from source (dev channel, ~2026.3.23)
  • Model: anthropic/claude-opus-4-6 (also reproduces on anthropic/claude-sonnet-4-6)
  • Auth: single profile (anthropic:manual, mode=token)
  • OS: Ubuntu, systemd user service

Symptoms

Auth refresh cron job (agentTurn, isolated, runs 3x daily at 06/14/22:00):

  • When it works: 25-35s total (API responds in ~2s, script runs ~20s)
  • When it fails: Hangs for exactly timeoutSeconds (120s or 240s), then:
    Profile anthropic:manual timed out. Trying next account...
    FailoverError: LLM request timed out.
    model-fallback: candidate_failed reason=timeout next=none

137 timeout events in one log file (~1 week), all on cron runs. Zero on interactive sessions.

Failure pattern (from run history)

Date/TimeDurationStatus
Apr 3 14:00240stimeout (Sonnet)
Apr 3 06:00240stimeout (Opus)
Apr 2 22:00147sOK (Opus)
Apr 2 14:00157sOK (Opus)
Apr 2 06:00120stimeout (Opus)
Apr 1 22:00120stimeout (Opus)
Mar 27-29120s each10 consecutive timeouts

Even "successful" runs sometimes take 147-157s for a job that completes in 25s when the API responds promptly — suggesting the LLM request itself takes 120-130s before finally responding.

Root cause analysis

The timeoutSeconds on the cron payload is used as the overall job budget. When the initial LLM API call hangs (status 408), it consumes the entire budget. There is no separate, shorter per-request timeout that would allow:

  1. Detecting a hung request early (e.g. 30s)
  2. Retrying on the same or fallback model
  3. Leaving time for the actual task execution

The model-fallback code does fire, but only after the full timeout — at which point there's no budget left.

Expected behavior

  • LLM API requests should have a per-request timeout (e.g. 60s) independent of the job timeout
  • If a request hangs, the fallback system should retry with time remaining
  • A job with timeoutSeconds: 240 should not spend all 240s on a single hung API call

Workaround attempts

  • Bumped timeoutSeconds from 120 to 240 — helps marginally but still fails
  • Switched model from Opus to Sonnet — same issue
  • Manual runs of the same script from interactive sessions work consistently in 25s

Related issues

  • #42632 — same symptom (isolated agentTurn timeout on minimal prompt)
  • #40237 — WS self-contention (different root cause but similar presentation)
  • #34644 — feature request for configurable embedded agent LLM-request timeout

Suggested fix

Add a configurable agents.defaults.llmRequestTimeoutSeconds (or per-job payload.llmRequestTimeoutSeconds) that caps individual LLM API calls, defaulting to something like 60-90s. The overall job timeout should be the outer budget, not the per-request timeout.

extent analysis

TL;DR

Implement a configurable per-request timeout for LLM API calls to prevent a single hung request from consuming the entire job timeout budget.

Guidance

  • Introduce a separate timeout for LLM API requests, independent of the overall job timeout, to detect and retry hung requests early.
  • Consider adding a configurable parameter, such as agents.defaults.llmRequestTimeoutSeconds or payload.llmRequestTimeoutSeconds, to control this per-request timeout.
  • Set a reasonable default value for the per-request timeout, such as 60-90 seconds, to balance between allowing sufficient time for requests to complete and preventing excessive timeouts.
  • Review the model-fallback code to ensure it can retry requests with the remaining job timeout budget after a per-request timeout occurs.

Example

No specific code example is provided, as the issue suggests modifying the existing codebase to introduce a configurable per-request timeout.

Notes

The suggested fix aims to address the root cause of the issue by introducing a separate timeout for LLM API requests. However, the optimal value for this timeout may depend on the specific use case and requirements of the application.

Recommendation

Apply a workaround by implementing a configurable per-request timeout for LLM API calls, as this addresses the identified root cause of the issue and allows for more robust handling of hung requests.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • LLM API requests should have a per-request timeout (e.g. 60s) independent of the job timeout
  • If a request hangs, the fallback system should retry with time remaining
  • A job with timeoutSeconds: 240 should not spend all 240s on a single hung API call

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING