openclaw - ✅(Solved) Fix [Bug]: openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408) [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66561Fetched 2026-04-15 06:25:39
View on GitHub
Comments
4
Participants
3
Timeline
14
Reactions
0
Author
Timeline (top)
commented ×4mentioned ×3subscribed ×3labeled ×2

OpenClaw appears to abort an openai-codex/gpt-5.4 SSE response locally after the upstream response has already started, then surfaces the failure as a generic timeout/failover (aborted:true, status:408, failoverReason:"timeout").

In the traced failing run, mitmproxy captured:

  • request sent successfully to https://chatgpt.com/backend-api/codex/responses
  • first response byte arrived
  • shortly after that, the client side closed the connection

At nearly the same time, OpenClaw logged:

  • embedded run timeout
  • profile timeout / failover
  • aborted:true
  • status:408

This does not look like a true "provider never responded" timeout. It looks more like a local abort / SSE stream handling / embedded-run watchdog issue that is then misclassified as timeout.

Error Message

  • upstream explicit error event A secondary, separate observation from the same broader debugging session is that later fallback-stage Anthropic responses may also be classified misleadingly in timeout terms when the provider-side semantic error is actually overloaded_error. But that is not the main bug in this report.

Root Cause

That successful control case is important because it shows:

  • mitmproxy itself is not sufficient to explain the failure
  • SSE can work normally in the same setup
  • this is not simply “all Codex traffic is broken

Fix Action

Fixed

PR fix notes

PR #66599: fix(agents): don't misclassify AbortError as timeout in hasTimeoutHint

Description (problem / solution / changelog)

Summary

Issue: #66561 — openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408)

Root Cause: hasTimeoutHint() in src/agents/failover-error.ts was incorrectly classifying AbortError as a timeout based on message patterns (e.g. "stream aborted" matching timeout patterns). Upstream had already responded (first byte at 12:03:01.557) but the client aborted, and the error was misclassified as a 408 timeout, triggering incorrect failover logic.

Fix: Add an explicit AbortError check in hasTimeoutHint() to return false immediately, preventing message-based timeout pattern matching from incorrectly classifying AbortError as a timeout.

// Before
if (readErrorName(err) === "TimeoutError") {
  return true;
}
const message = getErrorMessage(err);
return Boolean(message && isTimeoutErrorMessage(message));

// After
if (readErrorName(err) === "TimeoutError") {
  return true;
}
// AbortError is a distinct error type (cancellation), not a timeout.
// Don't classify AbortErrors as timeouts based on message patterns alone.
if (readErrorName(err) === "AbortError") {
  return false;
}
const message = getErrorMessage(err);
return Boolean(message && isTimeoutErrorMessage(message));

Testing:

  • failover-error.test.ts — 51 tests pass ✓
  • isbillingerrormessage.test.ts — pass ✓
  • llm-idle-timeout.test.ts — pass ✓

Closes #66561

Changed files

  • src/agents/failover-error.ts (modified, +5/-0)

Code Example

### 1. mitmproxy timing for the failing Codex request

Captured timing from the failing trace:

- first request byte: `2026-04-14 12:03:00.918`
- first response byte: `2026-04-14 12:03:01.557`
- client connection closed: `2026-04-14 12:03:02.215`

Key point:
- **first upstream response byte arrived before the client connection was closed**

### 2. Matching OpenClaw log correlation

Same failing run:
- `runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6`

Correlated gateway log lines:

- `2026-04-14T12:03:02.205Z`
  - `embedded run timeout: runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6 ... timeoutMs=120000`

- immediately after:
  - `Profile openai-codex:... timed out. Trying next account...`

- failover decision contained:
  - `provider:"openai-codex"`
  - `model:"gpt-5.4"`
  - `failoverReason:"timeout"`
  - `timedOut:true`
  - `aborted:true`
  - `status:408`

### 3. Control case in the same environment

Separate mitmproxy capture for a successful streamed request:
- model: `gpt-5.4-mini`
- full SSE lifecycle observed, including events such as:
  - `response.output_item.added`
  - `response.function_call_arguments.done`
  - `response.output_item.done`
  - `response.completed`

This control case suggests:
- proxying itself is not the root cause
- SSE handling can succeed normally in the same setup
- the failing run is not explained by “no provider response” or “proxy universally breaks streams”

### 4. What has already been ruled out

Already checked / excluded as primary explanation:
- wrong endpoint
- wrong model selection
- missing initial request
- generic provider outage
- generic mitmproxy breakage
- earlier OAuth scope bug / 401/403 class of issue
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

OpenClaw appears to abort an openai-codex/gpt-5.4 SSE response locally after the upstream response has already started, then surfaces the failure as a generic timeout/failover (aborted:true, status:408, failoverReason:"timeout").

In the traced failing run, mitmproxy captured:

  • request sent successfully to https://chatgpt.com/backend-api/codex/responses
  • first response byte arrived
  • shortly after that, the client side closed the connection

At nearly the same time, OpenClaw logged:

  • embedded run timeout
  • profile timeout / failover
  • aborted:true
  • status:408

This does not look like a true "provider never responded" timeout. It looks more like a local abort / SSE stream handling / embedded-run watchdog issue that is then misclassified as timeout.

Steps to reproduce

  1. Configure OpenClaw with openai-codex/gpt-5.4 as an active runtime model using the chatgpt.com/backend-api/codex/responses path over SSE.
  2. Put mitmproxy (or equivalent HTTP inspection) in front of the Codex traffic.
  3. Trigger a normal assistant run in a real chat session.
  4. Capture:
    • OpenClaw gateway logs
    • mitmproxy request timing
    • failover decision logs
  5. Observe a failing run where:
    • the upstream Codex request is sent successfully
    • mitmproxy sees first response bytes arrive
    • the client connection closes almost immediately afterward
    • OpenClaw surfaces the run as timeout / failover

Concrete traced example:

  • runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6
  • first request byte: 2026-04-14 12:03:00.918
  • first response byte: 2026-04-14 12:03:01.557
  • client connection closed: 2026-04-14 12:03:02.215
  • matching OpenClaw timeout log: 2026-04-14T12:03:02.205Z

Expected behavior

If upstream response bytes have already begun arriving, OpenClaw should not classify the attempt as a plain timeout unless there is a clearly separate post-start timeout condition.

At minimum, OpenClaw should distinguish between:

  • no upstream response at all
  • upstream response started, then local abort
  • upstream response started, then parser / stream handling failure
  • upstream explicit error event

The surfaced failure should reflect that the stream had already started.

Actual behavior

Very shortly after the request was successfully made and appr. 700ms after the first response byte came in the client (OpenClaw) resets the connection. OpenClaw treats the attempt as a timeout/failover even though upstream response bytes had already begun arriving.

Observed runtime behavior:

  • embedded run timeout
  • profile timeout / failover
  • failover decision includes:
    • failoverReason:"timeout"
    • timedOut:true
    • aborted:true
    • status:408

This makes the failure look like upstream silence, even though mitmproxy shows the upstream response had already started.

OpenClaw version

2026.4.12

Operating system

Ubuntu 24.04

Install method

Global npm / OpenClaw CLI install with gateway running as a service

Model

openai-codex/gpt-5.4

Provider / routing chain

openclaw -> mitmproxy -> openai-codex

Additional provider/model setup details

  • Codex traffic is routed to:
    • https://chatgpt.com/backend-api/codex/responses
  • transport is SSE, not WebSocket
  • request body explicitly contained:
    • "model":"gpt-5.4"
  • OAuth / provider health was already separately validated
  • this is not the earlier 401/403 scope problem
  • a separate control capture with gpt-5.4-mini succeeded normally via SSE in the same environment

That successful control case is important because it shows:

  • mitmproxy itself is not sufficient to explain the failure
  • SSE can work normally in the same setup
  • this is not simply “all Codex traffic is broken

mitmproxy was installed for debugging purpose only. Problem occurs without it as well.

Logs, screenshots, and evidence

### 1. mitmproxy timing for the failing Codex request

Captured timing from the failing trace:

- first request byte: `2026-04-14 12:03:00.918`
- first response byte: `2026-04-14 12:03:01.557`
- client connection closed: `2026-04-14 12:03:02.215`

Key point:
- **first upstream response byte arrived before the client connection was closed**

### 2. Matching OpenClaw log correlation

Same failing run:
- `runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6`

Correlated gateway log lines:

- `2026-04-14T12:03:02.205Z`
  - `embedded run timeout: runId=fb87df63-e856-4f02-bf57-1cf0f18a3dd6 ... timeoutMs=120000`

- immediately after:
  - `Profile openai-codex:... timed out. Trying next account...`

- failover decision contained:
  - `provider:"openai-codex"`
  - `model:"gpt-5.4"`
  - `failoverReason:"timeout"`
  - `timedOut:true`
  - `aborted:true`
  - `status:408`

### 3. Control case in the same environment

Separate mitmproxy capture for a successful streamed request:
- model: `gpt-5.4-mini`
- full SSE lifecycle observed, including events such as:
  - `response.output_item.added`
  - `response.function_call_arguments.done`
  - `response.output_item.done`
  - `response.completed`

This control case suggests:
- proxying itself is not the root cause
- SSE handling can succeed normally in the same setup
- the failing run is not explained by “no provider response” or “proxy universally breaks streams”

### 4. What has already been ruled out

Already checked / excluded as primary explanation:
- wrong endpoint
- wrong model selection
- missing initial request
- generic provider outage
- generic mitmproxy breakage
- earlier OAuth scope bug / 401/403 class of issue

Impact and severity

Severity: High

Why:

  • the failure is surfaced as a misleading timeout even though upstream had already started responding
  • debugging becomes much harder because logs imply “provider silent timeout”
  • failover can obscure the original defect
  • this can degrade reliability for Codex-backed sessions and create false attribution to upstream provider instability

Practical impact:

  • missed or delayed replies
  • unnecessary failover to other models/providers
  • confusing evidence chain unless packet/proxy traces are available
  • high time cost to diagnose because standard logs do not reflect the actual transport sequence

Additional information

This appears distinct from, but related to, existing OpenClaw bug classes around:

  • timeout misclassification
  • AbortError handling
  • failover surfacing
  • cooldown / rate-limit misclassification

The key difference here is stronger transport evidence:

The upstream response had already started, but the run was still surfaced as timeout.

So this report is specifically about:

  • partial-start SSE stream
  • local connection close / abort
  • timeout classification that no longer matches the observed transport reality

A secondary, separate observation from the same broader debugging session is that later fallback-stage Anthropic responses may also be classified misleadingly in timeout terms when the provider-side semantic error is actually overloaded_error. But that is not the main bug in this report.

extent analysis

TL;DR

The most likely fix involves modifying OpenClaw's timeout handling and SSE stream processing to correctly distinguish between true timeouts and local aborts when the upstream response has already started.

Guidance

  1. Review OpenClaw's SSE handling code: Investigate how OpenClaw manages Server-Sent Events (SSE) streams, particularly focusing on how it handles the arrival of the first response byte and subsequent connection closure.
  2. Adjust timeout classification logic: Modify the logic that classifies failures as timeouts to account for cases where the upstream response has started but the connection is closed locally, ensuring that such cases are not misclassified as timeouts.
  3. Implement distinct failure modes: Enhance OpenClaw to report failures in a way that distinguishes between different failure modes, such as "no upstream response," "upstream response started then local abort," and "upstream explicit error event."
  4. Verify with mitmproxy and logs: Use mitmproxy to capture the request and response timing and correlate it with OpenClaw's logs to ensure that the changes correctly handle the SSE stream and timeout classification.

Example

No specific code example can be provided without access to OpenClaw's source code. However, the adjustment would likely involve checking the status of the SSE stream upon connection closure and updating the failure classification logic accordingly.

Notes

  • The solution requires access to OpenClaw's source code to modify the SSE handling and timeout classification logic.
  • The changes should ensure that OpenClaw correctly handles partial-start SSE streams and local connection closures without misclassifying them as timeouts.

Recommendation

Apply a workaround by modifying OpenClaw's timeout handling and SSE stream processing logic to correctly classify failures when the upstream response has started but the connection is closed locally. This approach is recommended because it directly addresses the root cause of the issue, which is the misclassification of failures due to incorrect handling of SSE streams and timeouts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

If upstream response bytes have already begun arriving, OpenClaw should not classify the attempt as a plain timeout unless there is a clearly separate post-start timeout condition.

At minimum, OpenClaw should distinguish between:

  • no upstream response at all
  • upstream response started, then local abort
  • upstream response started, then parser / stream handling failure
  • upstream explicit error event

The surfaced failure should reflect that the stream had already started.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING