If upstream response bytes have already begun arriving, OpenClaw should **not** classify the attempt as a plain timeout unless there is a clearly separate post-start timeout condition. At minimum, OpenClaw should distinguish between: - no upstream response at all - upstream response started, then local abort - upstream response started, then parser / stream handling failure - upstream explicit error event The surfaced failure should reflect that the stream had already started.

openclaw - ✅(Solved) Fix [Bug]: openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408) [1 pull requests, 4 comments, 3 participants]

openclaw2026-04-14 13:28:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#66561•Fetched 2026-04-15 06:25:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4mentioned ×3subscribed ×3labeled ×2

OpenClaw appears to abort an openai-codex/gpt-5.4 SSE response locally after the upstream response has already started, then surfaces the failure as a generic timeout/failover (aborted:true, status:408, failoverReason:"timeout").

In the traced failing run, mitmproxy captured:

request sent successfully to https://chatgpt.com/backend-api/codex/responses
first response byte arrived
shortly after that, the client side closed the connection

At nearly the same time, OpenClaw logged:

embedded run timeout
profile timeout / failover
aborted:true
status:408

This does not look like a true "provider never responded" timeout. It looks more like a local abort / SSE stream handling / embedded-run watchdog issue that is then misclassified as timeout.

Error Message

upstream explicit error event A secondary, separate observation from the same broader debugging session is that later fallback-stage Anthropic responses may also be classified misleadingly in timeout terms when the provider-side semantic error is actually overloaded_error. But that is not the main bug in this report.

Root Cause

That successful control case is important because it shows:

mitmproxy itself is not sufficient to explain the failure
SSE can work normally in the same setup
this is not simply “all Codex traffic is broken

Fix Action

Fixed

Fixed by PR: fix(agents): don't misclassify AbortError as timeout in hasTimeoutHint (https://github.com/openclaw/openclaw/pull/66599)

PR fix notes

PR #66599: fix(agents): don't misclassify AbortError as timeout in hasTimeoutHint

Repository: openclaw/openclaw
Author: EronFan
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/66599

Description (problem / solution / changelog)

Summary

Issue: #66561 — openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408)

Root Cause: hasTimeoutHint() in src/agents/failover-error.ts was incorrectly classifying AbortError as a timeout based on message patterns (e.g. "stream aborted" matching timeout patterns). Upstream had already responded (first byte at 12:03:01.557) but the client aborted, and the error was misclassified as a 408 timeout, triggering incorrect failover logic.

Fix: Add an explicit AbortError check in hasTimeoutHint() to return false immediately, preventing message-based timeout pattern matching from incorrectly classifying AbortError as a timeout.

// Before
if (readErrorName(err) === "TimeoutError") {
  return true;
}
const message = getErrorMessage(err);
return Boolean(message && isTimeoutErrorMessage(message));

// After
if (readErrorName(err) === "TimeoutError") {
  return true;
}
// AbortError is a distinct error type (cancellation), not a timeout.
// Don't classify AbortErrors as timeouts based on message patterns alone.
if (readErrorName(err) === "AbortError") {
  return false;
}
const message = getErrorMessage(err);
return Boolean(message && isTimeoutErrorMessage(message));

Testing:

failover-error.test.ts — 51 tests pass ✓
isbillingerrormessage.test.ts — pass ✓
llm-idle-timeout.test.ts — pass ✓

Closes #66561

Changed files

src/agents/failover-error.ts (modified, +5/-0)

Code Example

### 1. mitmproxy timing for the failing Codex request

Captured timing from the failing trace:

- first request byte: `2026-04-14 12:03:00.918`
- first response byte: `2026-04-14 12:03:01.557`
- client connection closed: `2026-04-14 12:03:02.215`

Key point:
- **first upstream response byte arrived before the client connection was closed**

### 2. Matching OpenClaw log correlation

Same failing run:
- `runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6`

Correlated gateway log lines:

- `2026-04-14T12:03:02.205Z`
  - `embedded run timeout: runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6 ... timeoutMs=120000`

- immediately after:
  - `Profile openai-codex:... timed out. Trying next account...`

- failover decision contained:
  - `provider:"openai-codex"`
  - `model:"gpt-5.4"`
  - `failoverReason:"timeout"`
  - `timedOut:true`
  - `aborted:true`
  - `status:408`

### 3. Control case in the same environment

Separate mitmproxy capture for a successful streamed request:
- model: `gpt-5.4-mini`
- full SSE lifecycle observed, including events such as:
  - `response.output_item.added`
  - `response.function_call_arguments.done`
  - `response.output_item.done`
  - `response.completed`

This control case suggests:
- proxying itself is not the root cause
- SSE handling can succeed normally in the same setup
- the failing run is not explained by “no provider response” or “proxy universally breaks streams”

### 4. What has already been ruled out

Already checked / excluded as primary explanation:
- wrong endpoint
- wrong model selection
- missing initial request
- generic provider outage
- generic mitmproxy breakage
- earlier OAuth scope bug / 401/403 class of issue

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

In the traced failing run, mitmproxy captured:

request sent successfully to https://chatgpt.com/backend-api/codex/responses
first response byte arrived
shortly after that, the client side closed the connection

At nearly the same time, OpenClaw logged:

embedded run timeout
profile timeout / failover
aborted:true
status:408

This does not look like a true "provider never responded" timeout. It looks more like a local abort / SSE stream handling / embedded-run watchdog issue that is then misclassified as timeout.

Steps to reproduce

Configure OpenClaw with openai-codex/gpt-5.4 as an active runtime model using the chatgpt.com/backend-api/codex/responses path over SSE.
Put mitmproxy (or equivalent HTTP inspection) in front of the Codex traffic.
Trigger a normal assistant run in a real chat session.
Capture:
- OpenClaw gateway logs
- mitmproxy request timing
- failover decision logs
Observe a failing run where:
- the upstream Codex request is sent successfully
- mitmproxy sees first response bytes arrive
- the client connection closes almost immediately afterward
- OpenClaw surfaces the run as timeout / failover

Concrete traced example:

runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6
first request byte: 2026-04-14 12:03:00.918
first response byte: 2026-04-14 12:03:01.557
client connection closed: 2026-04-14 12:03:02.215
matching OpenClaw timeout log: 2026-04-14T12:03:02.205Z

Expected behavior

If upstream response bytes have already begun arriving, OpenClaw should not classify the attempt as a plain timeout unless there is a clearly separate post-start timeout condition.

At minimum, OpenClaw should distinguish between:

no upstream response at all
upstream response started, then local abort
upstream response started, then parser / stream handling failure
upstream explicit error event

The surfaced failure should reflect that the stream had already started.

Actual behavior

Very shortly after the request was successfully made and appr. 700ms after the first response byte came in the client (OpenClaw) resets the connection. OpenClaw treats the attempt as a timeout/failover even though upstream response bytes had already begun arriving.

Observed runtime behavior:

embedded run timeout
profile timeout / failover
failover decision includes:
- failoverReason:"timeout"
- timedOut:true
- aborted:true
- status:408

This makes the failure look like upstream silence, even though mitmproxy shows the upstream response had already started.

OpenClaw version

2026.4.12

Operating system

Ubuntu 24.04

Install method

Global npm / OpenClaw CLI install with gateway running as a service

Model

openai-codex/gpt-5.4

Provider / routing chain

openclaw -> mitmproxy -> openai-codex

Additional provider/model setup details

Codex traffic is routed to:
- https://chatgpt.com/backend-api/codex/responses
transport is SSE, not WebSocket
request body explicitly contained:
- "model":"gpt-5.4"
OAuth / provider health was already separately validated
this is not the earlier 401/403 scope problem
a separate control capture with gpt-5.4-mini succeeded normally via SSE in the same environment

That successful control case is important because it shows:

mitmproxy itself is not sufficient to explain the failure
SSE can work normally in the same setup
this is not simply “all Codex traffic is broken

mitmproxy was installed for debugging purpose only. Problem occurs without it as well.

Logs, screenshots, and evidence

### 1. mitmproxy timing for the failing Codex request

Captured timing from the failing trace:

- first request byte: `2026-04-14 12:03:00.918`
- first response byte: `2026-04-14 12:03:01.557`
- client connection closed: `2026-04-14 12:03:02.215`

Key point:
- **first upstream response byte arrived before the client connection was closed**

### 2. Matching OpenClaw log correlation

Same failing run:
- `runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6`

Correlated gateway log lines:

- `2026-04-14T12:03:02.205Z`
  - `embedded run timeout: runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6 ... timeoutMs=120000`

- immediately after:
  - `Profile openai-codex:... timed out. Trying next account...`

- failover decision contained:
  - `provider:"openai-codex"`
  - `model:"gpt-5.4"`
  - `failoverReason:"timeout"`
  - `timedOut:true`
  - `aborted:true`
  - `status:408`

### 3. Control case in the same environment

Separate mitmproxy capture for a successful streamed request:
- model: `gpt-5.4-mini`
- full SSE lifecycle observed, including events such as:
  - `response.output_item.added`
  - `response.function_call_arguments.done`
  - `response.output_item.done`
  - `response.completed`

This control case suggests:
- proxying itself is not the root cause
- SSE handling can succeed normally in the same setup
- the failing run is not explained by “no provider response” or “proxy universally breaks streams”

### 4. What has already been ruled out

Already checked / excluded as primary explanation:
- wrong endpoint
- wrong model selection
- missing initial request
- generic provider outage
- generic mitmproxy breakage
- earlier OAuth scope bug / 401/403 class of issue

Impact and severity

Severity: High

Why:

the failure is surfaced as a misleading timeout even though upstream had already started responding
debugging becomes much harder because logs imply “provider silent timeout”
failover can obscure the original defect
this can degrade reliability for Codex-backed sessions and create false attribution to upstream provider instability

Practical impact:

missed or delayed replies
unnecessary failover to other models/providers
confusing evidence chain unless packet/proxy traces are available
high time cost to diagnose because standard logs do not reflect the actual transport sequence

Additional information

This appears distinct from, but related to, existing OpenClaw bug classes around:

timeout misclassification
AbortError handling
failover surfacing
cooldown / rate-limit misclassification

The key difference here is stronger transport evidence:

The upstream response had already started, but the run was still surfaced as timeout.

So this report is specifically about:

partial-start SSE stream
local connection close / abort
timeout classification that no longer matches the observed transport reality

A secondary, separate observation from the same broader debugging session is that later fallback-stage Anthropic responses may also be classified misleadingly in timeout terms when the provider-side semantic error is actually overloaded_error. But that is not the main bug in this report.

extent analysis

TL;DR

The most likely fix involves modifying OpenClaw's timeout handling and SSE stream processing to correctly distinguish between true timeouts and local aborts when the upstream response has already started.

Guidance

Review OpenClaw's SSE handling code: Investigate how OpenClaw manages Server-Sent Events (SSE) streams, particularly focusing on how it handles the arrival of the first response byte and subsequent connection closure.
Adjust timeout classification logic: Modify the logic that classifies failures as timeouts to account for cases where the upstream response has started but the connection is closed locally, ensuring that such cases are not misclassified as timeouts.
Implement distinct failure modes: Enhance OpenClaw to report failures in a way that distinguishes between different failure modes, such as "no upstream response," "upstream response started then local abort," and "upstream explicit error event."
Verify with mitmproxy and logs: Use mitmproxy to capture the request and response timing and correlate it with OpenClaw's logs to ensure that the changes correctly handle the SSE stream and timeout classification.

Example

No specific code example can be provided without access to OpenClaw's source code. However, the adjustment would likely involve checking the status of the SSE stream upon connection closure and updating the failure classification logic accordingly.

Notes

The solution requires access to OpenClaw's source code to modify the SSE handling and timeout classification logic.
The changes should ensure that OpenClaw correctly handles partial-start SSE streams and local connection closures without misclassifying them as timeouts.

Recommendation

Apply a workaround by modifying OpenClaw's timeout handling and SSE stream processing logic to correctly classify failures when the upstream response has started but the connection is closed locally. This approach is recommended because it directly addresses the root cause of the issue, which is the misclassification of failures due to incorrect handling of SSE streams and timeouts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

If upstream response bytes have already begun arriving, OpenClaw should not classify the attempt as a plain timeout unless there is a clearly separate post-start timeout condition.

At minimum, OpenClaw should distinguish between:

no upstream response at all
upstream response started, then local abort
upstream response started, then parser / stream handling failure
upstream explicit error event

The surfaced failure should reflect that the stream had already started.

#api #dependency conflict #environment setup #docker error #permission error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408) [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #66599: fix(agents): don't misclassify AbortError as timeout in hasTimeoutHint

Description (problem / solution / changelog)

Summary

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING