- LLM API requests should have a **per-request timeout** (e.g. 60s) independent of the job timeout - If a request hangs, the fallback system should retry with time remaining - A job with `timeoutSeconds: 240` should not spend all 240s on a single hung API call

openclaw - 💡(How to fix) Fix Cron isolated agentTurn: LLM request hangs for full job timeout — no per-request timeout [1 participants]

openclaw2026-04-03 17:13:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#60443•Fetched 2026-04-08 02:51:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

JADGardner

Participants

JADGardner

Isolated agentTurn cron jobs intermittently hang for their entire timeoutSeconds budget on the initial LLM API call, with the Anthropic API never responding. The job timeout is consumed by a single hanging request, leaving no time for the actual task or fallback retries.

Regular interactive sessions on the same profile, same API key, same model work fine simultaneously.

Error Message

Profile anthropic:manual timed out. Trying next account... FailoverError: LLM request timed out. model-fallback: candidate_failed reason=timeout next=none

Root Cause

The timeoutSeconds on the cron payload is used as the overall job budget. When the initial LLM API call hangs (status 408), it consumes the entire budget. There is no separate, shorter per-request timeout that would allow:

Detecting a hung request early (e.g. 30s)
Retrying on the same or fallback model
Leaving time for the actual task execution

The model-fallback code does fire, but only after the full timeout — at which point there's no budget left.

Fix Action

Fix / Workaround

Workaround attempts

Code Example

Profile anthropic:manual timed out. Trying next account...
  FailoverError: LLM request timed out.
  model-fallback: candidate_failed reason=timeout next=none

RAW_BUFFERClick to expand / collapse

Summary

Regular interactive sessions on the same profile, same API key, same model work fine simultaneously.

Environment

OpenClaw: running from source (dev channel, ~2026.3.23)
Model: anthropic/claude-opus-4-6 (also reproduces on anthropic/claude-sonnet-4-6)
Auth: single profile (anthropic:manual, mode=token)
OS: Ubuntu, systemd user service

Symptoms

Auth refresh cron job (agentTurn, isolated, runs 3x daily at 06/14/22:00):

When it works: 25-35s total (API responds in ~2s, script runs ~20s)

When it fails: Hangs for exactly timeoutSeconds (120s or 240s), then:

Profile anthropic:manual timed out. Trying next account...
FailoverError: LLM request timed out.
model-fallback: candidate_failed reason=timeout next=none

137 timeout events in one log file (~1 week), all on cron runs. Zero on interactive sessions.

Failure pattern (from run history)

Date/Time	Duration	Status
Apr 3 14:00	240s	timeout (Sonnet)
Apr 3 06:00	240s	timeout (Opus)
Apr 2 22:00	147s	OK (Opus)
Apr 2 14:00	157s	OK (Opus)
Apr 2 06:00	120s	timeout (Opus)
Apr 1 22:00	120s	timeout (Opus)
Mar 27-29	120s each	10 consecutive timeouts

Even "successful" runs sometimes take 147-157s for a job that completes in 25s when the API responds promptly — suggesting the LLM request itself takes 120-130s before finally responding.

Root cause analysis

Detecting a hung request early (e.g. 30s)
Retrying on the same or fallback model
Leaving time for the actual task execution

The model-fallback code does fire, but only after the full timeout — at which point there's no budget left.

Expected behavior

LLM API requests should have a per-request timeout (e.g. 60s) independent of the job timeout
If a request hangs, the fallback system should retry with time remaining
A job with timeoutSeconds: 240 should not spend all 240s on a single hung API call

Workaround attempts

Bumped timeoutSeconds from 120 to 240 — helps marginally but still fails
Switched model from Opus to Sonnet — same issue
Manual runs of the same script from interactive sessions work consistently in 25s

Related issues

#42632 — same symptom (isolated agentTurn timeout on minimal prompt)
#40237 — WS self-contention (different root cause but similar presentation)
#34644 — feature request for configurable embedded agent LLM-request timeout

Suggested fix

Add a configurable agents.defaults.llmRequestTimeoutSeconds (or per-job payload.llmRequestTimeoutSeconds) that caps individual LLM API calls, defaulting to something like 60-90s. The overall job timeout should be the outer budget, not the per-request timeout.

extent analysis

TL;DR

Implement a configurable per-request timeout for LLM API calls to prevent a single hung request from consuming the entire job timeout budget.

Guidance

Introduce a separate timeout for LLM API requests, independent of the overall job timeout, to detect and retry hung requests early.
Consider adding a configurable parameter, such as agents.defaults.llmRequestTimeoutSeconds or payload.llmRequestTimeoutSeconds, to control this per-request timeout.
Set a reasonable default value for the per-request timeout, such as 60-90 seconds, to balance between allowing sufficient time for requests to complete and preventing excessive timeouts.
Review the model-fallback code to ensure it can retry requests with the remaining job timeout budget after a per-request timeout occurs.

Example

No specific code example is provided, as the issue suggests modifying the existing codebase to introduce a configurable per-request timeout.

Notes

The suggested fix aims to address the root cause of the issue by introducing a separate timeout for LLM API requests. However, the optimal value for this timeout may depend on the specific use case and requirements of the application.

Recommendation

Apply a workaround by implementing a configurable per-request timeout for LLM API calls, as this addresses the identified root cause of the issue and allows for more robust handling of hung requests.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

LLM API requests should have a per-request timeout (e.g. 60s) independent of the job timeout
If a request hangs, the fallback system should retry with time remaining
A job with timeoutSeconds: 240 should not spend all 240s on a single hung API call

#api #installation #tensor shape #autograd error #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Cron isolated agentTurn: LLM request hangs for full job timeout — no per-request timeout [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround attempts

Code Example

Summary

Environment

Symptoms

Failure pattern (from run history)

Root cause analysis

Expected behavior

Workaround attempts

Related issues

Suggested fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Cron isolated agentTurn: LLM request hangs for full job timeout — no per-request timeout [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround attempts

Code Example

Summary

Environment

Symptoms

Failure pattern (from run history)

Root cause analysis

Expected behavior

Workaround attempts

Related issues

Suggested fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING