openclaw - 💡(How to fix) Fix [Enhancement] Embedded-run upstream timeout is hard-coded at ~60s; provider 429/Retry-After is erased [1 comments, 2 participants]

lesdekock · 2026-05-01T13:09:58Z

[openclaw] In a self-hosted OpenClaw deployment we've observed that the gateway's embedded-run path enforces a hard ~60-second timeout on the upstream provider… In a self-hosted OpenClaw deployment we've observed that the gateway's **embedded-run path enforces a hard ~60-second timeout on the upstream provider call**, with no configuration knob exposed via `openclaw.json`, helm chart values, or pod environment variables. When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic `[agent] embedded run failover decision: … decision=surface_error reason=timeout` to the client. Whatever the upstream actually returned (a `429`, a `Retry-After`, a partial stream, an `overloaded_error`) is erased in the process. ## Fix / Workaround ## Operator-side workarounds we're using meanwhile Happy to pull more logs or test a patched build if it'd help. ## Summary In a self-hosted OpenClaw deployment we've observed that the gateway's **embedded-run path enforces a hard ~60-second timeout on the upstream provider call**, with no configuration knob exposed via `openclaw.json`, helm chart values, or pod environment variables. When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic `[agent] embedded run failover decision: … decision=surface_error reason=timeout` to the client. Whatever the upstream actually returned (a `429`, a `Retry-After`, a partial stream, an `overloaded_error`) is erased in the process. ## Why this is a problem for downstream operators 1. **No way to lift the ceiling for legitimately long calls.** Long-context vision OCR, large completions, or temporarily slow providers all hit the same wall. Self-hosted operators can't tune this for their workload without forking the gateway. 2. **Clients can't implement intelligent back-off.** Because the gateway collapses provider rate-limits, slow streams, and stuck calls into a single \"timed out at 60 s\" surface_error, downstream callers have no information to decide between: * retry now (transient slow stream) * back off with `Retry-After` (rate-limit) * give up (provider hard failure) They all look identical to the client. The result in practice is that clients implement crude time-based back-off and either retry too aggressively (hammering a throttled provider) or too conservatively (leaving capacity on the floor). ## Concrete scenario from production We hit this hard yesterday processing a 27-minute screen recording for a CXN intake pipeline. The flow is: * Whisper a long audio (works fine via `/v1/audio/transcriptions` once body-size is bumped on the fronting nginx — separate operator-side fix) * Send vision frames to `/v1/chat/completions` with `image_url` content parts, one per keyframe (1280×720 PNGs, ~50–350 KB) First ~10 sequential vision calls succeeded in 30–45 s each. From call 11 onwards, **every** call timed out at exactly the 60 s ceiling, for ~47 calls in a 35-minute window — spaced almost exactly 60 s apart in the gateway logs: ``` 06:25:34 ← first failure 06:26:34 ← +60s 06:27:35 ← +61s … (47 entries) ``` No `Overloaded` errors and no `isError=true` entries from upstream during the same window — the gateway timed out *before* getting any explicit error response from Anthropic. Most likely the upstream silently throttled (TPM-bound after the burst) and the gateway saw it as \"slow.\" If we'd been able to either (a) raise the timeout to 120–180 s, or (b) see a `Retry-After` from the provider, the client could have adapted instead of building a 47-frame queue of dead requests. ## Asks In rough priority order: 1. **Make the embedded-run upstream timeout configurable.** Either via `openclaw.json` (e.g. `gateway.embeddedRun.upstreamTimeoutMs`) or as an env var the operator can set. Keeping 60 s as the default is fine; we just need an override path for vision-heavy / long-context workloads. 2. **Surface upstream `429` / `Retry-After` as structured fields on the error.** When the gateway times out due to provider rate-limit, the client should see something better than `reason=timeout` — minimum viable: pass through HTTP status code and `Retry-After` header from the provider response if one was received before the gateway's own timeout fired. 3. **(Stretch)** Streaming-aware timeout reset. If the upstream is actually sending content (partial JSON, partial tokens) within the timeout window, reset the clock. A slow-but-progressing upstream is currently killed identically to a stuck one. ## Operator-side workarounds we're using meanwhile * Client-side serialisation (queue-job) — vision calls were already effectively serialised gateway-side, so this just makes the queue visible to us

In a self-hosted OpenClaw deployment we've observed that the gateway's embedded-run path enforces a hard ~60-second timeout on the upstream provider call, with no configuration knob exposed via openclaw.json, helm chart values, or pod environment variables.

When upstream providers (Anthropic / OpenAI in our case) take longer than 60 s to respond — whether because of silent TPM throttling, a slow vision pathway, or genuinely long generation — the gateway fires its internal timeout and surfaces a generic [agent] embedded run failover decision: … decision=surface_error reason=timeout to the client. Whatever the upstream actually returned (a 429, a Retry-After, a partial stream, an overloaded_error) is erased in the process.

Root Cause

Summary

Why this is a problem for downstream operators

No way to lift the ceiling for legitimately long calls. Long-context vision OCR, large completions, or temporarily slow providers all hit the same wall. Self-hosted operators can't tune this for their workload without forking the gateway.
Clients can't implement intelligent back-off. Because the gateway collapses provider rate-limits, slow streams, and stuck calls into a single "timed out at 60 s" surface_error, downstream callers have no information to decide between:
- retry now (transient slow stream)
- back off with Retry-After (rate-limit)
- give up (provider hard failure) They all look identical to the client. The result in practice is that clients implement crude time-based back-off and either retry too aggressively (hammering a throttled provider) or too conservatively (leaving capacity on the floor).

Concrete scenario from production

We hit this hard yesterday processing a 27-minute screen recording for a CXN intake pipeline. The flow is:

Whisper a long audio (works fine via /v1/audio/transcriptions once body-size is bumped on the fronting nginx — separate operator-side fix)
Send vision frames to /v1/chat/completions with image_url content parts, one per keyframe (1280×720 PNGs, ~50–350 KB)

First ~10 sequential vision calls succeeded in 30–45 s each. From call 11 onwards, every call timed out at exactly the 60 s ceiling, for ~47 calls in a 35-minute window — spaced almost exactly 60 s apart in the gateway logs:

06:25:34   ← first failure
06:26:34   ← +60s
06:27:35   ← +61s
…  (47 entries)

No Overloaded errors and no isError=true entries from upstream during the same window — the gateway timed out before getting any explicit error response from Anthropic. Most likely the upstream silently throttled (TPM-bound after the burst) and the gateway saw it as "slow."

If we'd been able to either (a) raise the timeout to 120–180 s, or (b) see a Retry-After from the provider, the client could have adapted instead of building a 47-frame queue of dead requests.

Asks

In rough priority order:

Make the embedded-run upstream timeout configurable. Either via openclaw.json (e.g. gateway.embeddedRun.upstreamTimeoutMs) or as an env var the operator can set. Keeping 60 s as the default is fine; we just need an override path for vision-heavy / long-context workloads.
Surface upstream 429 / Retry-After as structured fields on the error. When the gateway times out due to provider rate-limit, the client should see something better than reason=timeout — minimum viable: pass through HTTP status code and Retry-After header from the provider response if one was received before the gateway's own timeout fired.
(Stretch) Streaming-aware timeout reset. If the upstream is actually sending content (partial JSON, partial tokens) within the timeout window, reset the clock. A slow-but-progressing upstream is currently killed identically to a stuck one.

Operator-side workarounds we're using meanwhile

Client-side serialisation (queue-job) — vision calls were already effectively serialised gateway-side, so this just makes the queue visible to us
Reduce request rate to ~1 every 60 s (one frame per minute) to give the provider's TPM bucket time to refill
Bump client read-timeout above 60 s (so we don't double up on the same wall the gateway is hitting)
Exponential back-off on timeout, with a vision_skipped terminal state instead of infinite retry

These work but cap throughput well below what the upstream provider can actually sustain.

Tracking on our side

Con-x-ion/openclaw-platform#156 — platform-level tracking issue, will follow this one for resolution
Con-x-ion/openclaw-platform#155 — the workload that surfaced this

Environment

Gateway: self-hosted via the standard openclaw/openclaw image, built from main branch
k3s cluster on Azure
Single-replica gateway (per the project's own "exactly 1 replica" constraint until session state is externalised)

Happy to pull more logs or test a patched build if it'd help.

extent analysis

TL;DR

The gateway's embedded-run path has a hard 60-second timeout on upstream provider calls, which can be mitigated by making the timeout configurable or surfacing upstream error responses.

Guidance

The issue is likely caused by the gateway's internal timeout firing before the upstream provider responds, resulting in a generic "timeout" error.
To verify, check the gateway logs for the exact 60-second timeout pattern and confirm that the upstream provider is not returning an explicit error response within that time frame.
A potential workaround is to reduce the request rate to give the provider's TPM bucket time to refill, but this caps throughput below what the upstream provider can sustain.
Another possible mitigation is to implement exponential back-off on timeout with a terminal state, but this may not be optimal for all use cases.

Example

No code snippet is provided as the issue is more related to configuration and timeout settings.

Notes

The provided information suggests that the issue is specific to the self-hosted OpenClaw deployment and the gateway's embedded-run path. The proposed solutions and workarounds may not be applicable to other environments or configurations.

Recommendation

Apply a workaround by reducing the request rate or implementing exponential back-off, as making the timeout configurable or surfacing upstream error responses may require changes to the gateway's configuration or code.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Enhancement] Embedded-run upstream timeout is hard-coded at ~60s; provider 429/Retry-After is erased [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Operator-side workarounds we're using meanwhile

Code Example

Summary

Why this is a problem for downstream operators

Concrete scenario from production

Asks

Operator-side workarounds we're using meanwhile

Tracking on our side

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Enhancement] Embedded-run upstream timeout is hard-coded at ~60s; provider 429/Retry-After is erased [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Operator-side workarounds we're using meanwhile

Code Example

Summary

Why this is a problem for downstream operators

Concrete scenario from production

Asks

Operator-side workarounds we're using meanwhile

Tracking on our side

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING