openclaw - 💡(How to fix) Fix [Feature/Bug]: Expose undici connect.timeout for Ollama provider + make fallback decision consistent on reason=timeout [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#68796Fetched 2026-04-19 15:07:19
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Two related asks affecting remote-Ollama setups:

  1. Feature: Expose the undici connect.timeout (TCP connect) so operators running Ollama on a different host (LAN, tailnet) can tolerate slow first-byte responses. Today it is hardcoded to undici's default 10 s and is not reachable from openclaw.json.
  2. Bug: For the same reason=timeout from ollama/<model>, the model-fallback/decision sometimes resolves to decision=fallback_model (and recovers via the configured chain) and sometimes to decision=surface_error (no fallback attempted). The chosen path appears non-deterministic for what looks like the same failure mode.

Error Message

error=LLM request failed: network connection error. rawError=fetch failed | Connect Timeout Error (attempted address: 192.168.68.82:11434, timeout: 10000ms) Result: error surfaced to user, no fallback attempted. ❌ Both runs hit the same Connect Timeout Error to the same address with the same configured chain. The branching point that produces surface_error vs fallback_model is not obvious from the logs. For LAN/tailnet Ollama the first request after warm-up of a large model (or transient packet loss) easily exceeds 10 s, surfacing the Connect Timeout Error even though the host is healthy. There is no openclaw.json knob to raise it; only an upstream-source patch helps.

Root Cause

Two related asks affecting remote-Ollama setups:

  1. Feature: Expose the undici connect.timeout (TCP connect) so operators running Ollama on a different host (LAN, tailnet) can tolerate slow first-byte responses. Today it is hardcoded to undici's default 10 s and is not reachable from openclaw.json.
  2. Bug: For the same reason=timeout from ollama/<model>, the model-fallback/decision sometimes resolves to decision=fallback_model (and recovers via the configured chain) and sometimes to decision=surface_error (no fallback attempted). The chosen path appears non-deterministic for what looks like the same failure mode.

Fix Action

Fix / Workaround

dist/undici-global-dispatcher-yJO9KyXW.js builds the global dispatcher with bodyTimeout and headersTimeout honoring the embedded-run timeout (per #63175), but the connect block only sets autoSelectFamily / autoSelectFamilyAttemptTimeout — never a connect.timeout. Effective TCP connect timeout is undici's default (10 s).

For LAN/tailnet Ollama the first request after warm-up of a large model (or transient packet loss) easily exceeds 10 s, surfacing the Connect Timeout Error even though the host is healthy. There is no openclaw.json knob to raise it; only an upstream-source patch helps.

Workarounds I rejected

RAW_BUFFERClick to expand / collapse

Summary

Two related asks affecting remote-Ollama setups:

  1. Feature: Expose the undici connect.timeout (TCP connect) so operators running Ollama on a different host (LAN, tailnet) can tolerate slow first-byte responses. Today it is hardcoded to undici's default 10 s and is not reachable from openclaw.json.
  2. Bug: For the same reason=timeout from ollama/<model>, the model-fallback/decision sometimes resolves to decision=fallback_model (and recovers via the configured chain) and sometimes to decision=surface_error (no fallback attempted). The chosen path appears non-deterministic for what looks like the same failure mode.

Environment

  • OpenClaw 2026.4.15
  • Host: Linux (Ubuntu), gateway running as systemd user service
  • Ollama: remote, http://192.168.68.82:11434 (LAN)
  • Configured chain (relevant entry):
    • primary: ollama/gemma4-team
    • fallbacks: [\"ollama/qwen72b-team\"] (and ultimately claude-cli/claude-opus-4-7 from defaults)

Evidence — fallback inconsistency

Same gateway, same target (ollama/gemma4-team192.168.68.82:11434), same reason=timeout, two different decisions within ~50 minutes:

Run A — 2adb9632-2f8d-4ab0-a1bc-80c4d9107b0e (01:53 UTC): ``` [agent/embedded] embedded run agent end: ... isError=true model=gemma4-team provider=ollama error=LLM request failed: network connection error. rawError=fetch failed | Connect Timeout Error (attempted address: 192.168.68.82:11434, timeout: 10000ms) [agent/embedded] embedded run failover decision: ... stage=assistant decision=fallback_model reason=timeout from=ollama/gemma4-team profile=- [model-fallback/decision] model fallback decision: decision=candidate_failed requested=ollama/gemma4-team candidate=ollama/gemma4-team reason=timeout next=claude-cli/claude-opus-4-7 [model-fallback/decision] model fallback decision: decision=candidate_succeeded requested=ollama/gemma4-team candidate=claude-cli/claude-opus-4-7 reason=unknown next=none ``` Result: recovered via fallback to Opus 4.7. ✅

Run B — a4d41d27-de26-42f2-8f36-5037132eff5c (02:42 UTC): ``` [agent/embedded] embedded run failover decision: ... stage=assistant decision=surface_error reason=timeout from=ollama/gemma4-team profile=- ``` Result: error surfaced to user, no fallback attempted. ❌

Both runs hit the same Connect Timeout Error to the same address with the same configured chain. The branching point that produces surface_error vs fallback_model is not obvious from the logs.

Evidence — connect.timeout not configurable

dist/undici-global-dispatcher-yJO9KyXW.js builds the global dispatcher with bodyTimeout and headersTimeout honoring the embedded-run timeout (per #63175), but the connect block only sets autoSelectFamily / autoSelectFamilyAttemptTimeout — never a connect.timeout. Effective TCP connect timeout is undici's default (10 s).

For LAN/tailnet Ollama the first request after warm-up of a large model (or transient packet loss) easily exceeds 10 s, surfacing the Connect Timeout Error even though the host is healthy. There is no openclaw.json knob to raise it; only an upstream-source patch helps.

Asks

  1. Add a config knob — e.g. models.providers.<id>.connectTimeoutMs or a global infra.net.connectTimeoutMs — that propagates into the undici Agent({ connect: { timeout } }).
  2. Audit the failover decision branch so reason=timeout from=ollama/* is treated consistently — either always fallback_model (preferred) or document the conditions that produce surface_error so operators know when to expect it.

Workarounds I rejected

  • Patching dist/undici-global-dispatcher-yJO9KyXW.js locally — reverted on next npm i -g openclaw.
  • Pre-warming the model — masks the timeout but doesn't fix the fallback-policy inconsistency.

Happy to provide a sanitized openclaw.json, more log slices, or test a candidate fix.

extent analysis

TL;DR

To address the inconsistent fallback behavior and non-configurable TCP connect timeout, consider adding a config knob for connectTimeoutMs and auditing the failover decision branch to ensure consistent treatment of reason=timeout errors.

Guidance

  • Add a configuration option, such as models.providers.<id>.connectTimeoutMs or infra.net.connectTimeoutMs, to allow operators to adjust the TCP connect timeout.
  • Review the failover decision logic to identify the conditions that lead to surface_error instead of fallback_model for reason=timeout errors and ensure consistent behavior.
  • Verify that the connectTimeoutMs value is properly propagated to the undici Agent configuration.
  • Test the updated configuration with various network conditions to ensure the fallback policy works as expected.

Example

No code snippet is provided as the issue requires changes to the underlying configuration and logic.

Notes

The current implementation of the failover decision branch may contain subtle conditions that affect the choice between fallback_model and surface_error. A thorough review of the code and logs is necessary to identify and address these conditions.

Recommendation

Apply a workaround by adding a custom configuration option for connectTimeoutMs and auditing the failover decision branch to ensure consistent behavior, as a permanent fix requires changes to the underlying codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Feature/Bug]: Expose undici connect.timeout for Ollama provider + make fallback decision consistent on reason=timeout [1 participants]