openclaw - 💡(How to fix) Fix [Enhancement] Embedded-run upstream timeout is hard-coded at ~60s; provider 429/Retry-After is erased [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75648Fetched 2026-05-02 05:32:20
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
2
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×1cross-referenced ×1

In a self-hosted OpenClaw deployment we've observed that the gateway's embedded-run path enforces a hard ~60-second timeout on the upstream provider call, with no configuration knob exposed via openclaw.json, helm chart values, or pod environment variables.

When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic [agent] embedded run failover decision: … decision=surface_error reason=timeout to the client. Whatever the upstream actually returned (a 429, a Retry-After, a partial stream, an overloaded_error) is erased in the process.

Error Message

explicit error response from Anthropic. Most likely the upstream error.** When the gateway times out due to provider rate-limit, the

Root Cause

When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic [agent] embedded run failover decision: … decision=surface_error reason=timeout to the client. Whatever the upstream actually returned (a 429, a Retry-After, a partial stream, an overloaded_error) is erased in the process.

Fix Action

Fix / Workaround

Operator-side workarounds we're using meanwhile

Happy to pull more logs or test a patched build if it'd help.

Code Example

06:25:34   ← first failure
06:26:34+60s
06:27:35+61s
  (47 entries)
RAW_BUFFERClick to expand / collapse

Summary

In a self-hosted OpenClaw deployment we've observed that the gateway's embedded-run path enforces a hard ~60-second timeout on the upstream provider call, with no configuration knob exposed via openclaw.json, helm chart values, or pod environment variables.

When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic [agent] embedded run failover decision: … decision=surface_error reason=timeout to the client. Whatever the upstream actually returned (a 429, a Retry-After, a partial stream, an overloaded_error) is erased in the process.

Why this is a problem for downstream operators

  1. No way to lift the ceiling for legitimately long calls. Long-context vision OCR, large completions, or temporarily slow providers all hit the same wall. Self-hosted operators can't tune this for their workload without forking the gateway.

  2. Clients can't implement intelligent back-off. Because the gateway collapses provider rate-limits, slow streams, and stuck calls into a single "timed out at 60 s" surface_error, downstream callers have no information to decide between:

    • retry now (transient slow stream)
    • back off with Retry-After (rate-limit)
    • give up (provider hard failure) They all look identical to the client. The result in practice is that clients implement crude time-based back-off and either retry too aggressively (hammering a throttled provider) or too conservatively (leaving capacity on the floor).

Concrete scenario from production

We hit this hard yesterday processing a 27-minute screen recording for a CXN intake pipeline. The flow is:

  • Whisper a long audio (works fine via /v1/audio/transcriptions once body-size is bumped on the fronting nginx — separate operator-side fix)
  • Send vision frames to /v1/chat/completions with image_url content parts, one per keyframe (1280×720 PNGs, ~50–350 KB)

First ~10 sequential vision calls succeeded in 30–45 s each. From call 11 onwards, every call timed out at exactly the 60 s ceiling, for ~47 calls in a 35-minute window — spaced almost exactly 60 s apart in the gateway logs:

06:25:34   ← first failure
06:26:34   ← +60s
06:27:35   ← +61s
…  (47 entries)

No Overloaded errors and no isError=true entries from upstream during the same window — the gateway timed out before getting any explicit error response from Anthropic. Most likely the upstream silently throttled (TPM-bound after the burst) and the gateway saw it as "slow."

If we'd been able to either (a) raise the timeout to 120–180 s, or (b) see a Retry-After from the provider, the client could have adapted instead of building a 47-frame queue of dead requests.

Asks

In rough priority order:

  1. Make the embedded-run upstream timeout configurable. Either via openclaw.json (e.g. gateway.embeddedRun.upstreamTimeoutMs) or as an env var the operator can set. Keeping 60 s as the default is fine; we just need an override path for vision-heavy / long-context workloads.

  2. Surface upstream 429 / Retry-After as structured fields on the error. When the gateway times out due to provider rate-limit, the client should see something better than reason=timeout — minimum viable: pass through HTTP status code and Retry-After header from the provider response if one was received before the gateway's own timeout fired.

  3. (Stretch) Streaming-aware timeout reset. If the upstream is actually sending content (partial JSON, partial tokens) within the timeout window, reset the clock. A slow-but-progressing upstream is currently killed identically to a stuck one.

Operator-side workarounds we're using meanwhile

  • Client-side serialisation (queue-job) — vision calls were already effectively serialised gateway-side, so this just makes the queue visible to us
  • Reduce request rate to ~1 every 60 s (one frame per minute) to give the provider's TPM bucket time to refill
  • Bump client read-timeout above 60 s (so we don't double up on the same wall the gateway is hitting)
  • Exponential back-off on timeout, with a vision_skipped terminal state instead of infinite retry

These work but cap throughput well below what the upstream provider can actually sustain.

Tracking on our side

  • Con-x-ion/openclaw-platform#156 — platform-level tracking issue, will follow this one for resolution
  • Con-x-ion/openclaw-platform#155 — the workload that surfaced this

Environment

  • Gateway: self-hosted via the standard openclaw/openclaw image, built from main branch
  • k3s cluster on Azure
  • Single-replica gateway (per the project's own "exactly 1 replica" constraint until session state is externalised)

Happy to pull more logs or test a patched build if it'd help.

extent analysis

TL;DR

The gateway's embedded-run path has a hard 60-second timeout on upstream provider calls, which can be mitigated by making the timeout configurable or surfacing upstream error responses.

Guidance

  • The issue is likely caused by the gateway's internal timeout firing before the upstream provider responds, resulting in a generic "timeout" error.
  • To verify, check the gateway logs for the exact 60-second timeout pattern and confirm that the upstream provider is not returning an explicit error response within that time frame.
  • A potential workaround is to reduce the request rate to give the provider's TPM bucket time to refill, but this caps throughput below what the upstream provider can sustain.
  • Another possible mitigation is to implement exponential back-off on timeout with a terminal state, but this may not be optimal for all use cases.

Example

No code snippet is provided as the issue is more related to configuration and timeout settings.

Notes

The provided information suggests that the issue is specific to the self-hosted OpenClaw deployment and the gateway's embedded-run path. The proposed solutions and workarounds may not be applicable to other environments or configurations.

Recommendation

Apply a workaround by reducing the request rate or implementing exponential back-off, as making the timeout configurable or surfacing upstream error responses may require changes to the gateway's configuration or code.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING