openclaw - ✅(Solved) Fix [Bug]: openai-codex/gpt-5.4 returns Cloudflare HTML and gets misclassified as rate_limit / DNS, leaving TUI stuck [2 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#67517Fetched 2026-04-17 08:30:25
View on GitHub
Comments
3
Participants
3
Timeline
13
Reactions
0
Author
Timeline (top)
referenced ×5commented ×3cross-referenced ×2labeled ×2

openai-codex/gpt-5.4 stalls indefinitely in the TUI. Logs show Cloudflare / HTML responses, but OpenClaw misclassifies the failure as API rate limit reached, DNS lookup for provider endpoint failed, or raw <html>....

Error Message

warn errors:

  • HTML error truncated 04:16:12+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true, "error":"API rate limit reached","failoverReason":"rate_limit", 04:16:22+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true, "error":"API rate limit reached","failoverReason":"rate_limit", 04:16:33+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true, "error":"DNS lookup for provider endpoint failed", 04:16:49+00:00 warn errors Long error truncated: <html>... 04:16:49+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true, "error":"<html>...","model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

Root Cause

openai-codex/gpt-5.4 stalls indefinitely in the TUI. Logs show Cloudflare / HTML responses, but OpenClaw misclassifies the failure as API rate limit reached, DNS lookup for provider endpoint failed, or raw <html>....

Fix Action

Fixed

PR fix notes

PR #67642: fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Description (problem / solution / changelog)

Summary

  • Problem: When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This causes incorrect failover behavior (profile rotation instead of clean retry/fallback) and leaves the TUI stuck.
  • Root cause: classifyFailoverSignal runs text-pattern classifiers on raw error messages without first checking whether the message is an HTML page. HTML error pages from CDNs like Cloudflare often contain rate-limit or error keywords in their human-readable body text, which pattern matchers incorrectly classify as structured API errors. Additionally, classifyProviderRuntimeFailureKind only checked for HTML responses on status 403 (auth_html_403), missing non-403 HTML pages entirely.
  • Fix:
    1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly.
    2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning a new "upstream_html" kind for non-403 statuses with a clear user-facing message: "The provider returned an HTML error page instead of an API response."
    3. Regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML preservation, and JSON rate-limit correctness.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Fixes #67517

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts (modified, +4/-4)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +34/-7)
  • src/agents/pi-embedded-helpers/provider-error-patterns.test.ts (modified, +71/-1)

PR #67762: fix(agents): classify raw HTML error responses even without leading HTTP status prefix

Description (problem / solution / changelog)

Summary

Problem: openai-codex runs whose upstream returns a raw HTML Cloudflare/CDN error page (without a leading HTTP status prefix) are surfaced to the user as:

LLM request failed: DNS lookup for the provider endpoint failed.

Even though host DNS is healthy, short live probes succeed, and the gateway logs clearly show rawError=<html>.... The message is misleading, and a user seeing it has no reason to suspect a CDN/gateway issue upstream.

Why: Follow-up to merged #67642 (Cloudflare HTML misclassification). That PR taught classifyProviderRuntimeFailureKind to return upstream_html for HTML bodies — but only when an HTTP status >= 400 could be inferred. When providers forward a raw <html>...</html> body without a leading status prefix and callers don't pass an explicit status, the classifier silently fails over to isDnsTransportErrorMessage (which substring-matches dns anywhere in the body) and produces a DNS-lookup message.

What changed: Relaxed isHtmlErrorResponse in src/agents/pi-embedded-helpers/errors.ts to trust strong HTML markers (<!doctype html>/<html> start + </html> close) even when no status can be inferred. The pre-existing < 400 guard still fires whenever a status IS inferred — so status-prefixed payloads keep their current semantics.

What did NOT change:

  • classifyFailoverSignal (failover/timeout gate at 408/499/5xx) — intentionally left as is.
  • formatTransportErrorCopy substring dns match — the HTML branch now runs first so the fall-through isn't reachable for HTML bodies; tightening the DNS regex would be a separate cleanup.
  • src/shared/assistant-error-format.ts:isCloudflareOrHtmlErrorPage — similar gap but broader blast radius, out of scope.

Change Type

Bug fix.

Scope

  • agents (src/agents)
  • plugins / providers
  • channels
  • gateway
  • cli / web-ui / apps

Linked Issue

Closes #67712. Follow-up to #67642 (same classification pipeline, same reviewer surface).

Root Cause

isHtmlErrorResponse at src/agents/pi-embedded-helpers/errors.ts:339-354 required an inferrable HTTP status code >= 400 to classify raw HTML as an HTML error.

Classification chain for an openai-codex HTML response without a leading status prefix:

  1. extractLeadingHttpStatus("<html>...") returns null (no 3-digit prefix).
  2. inferred is undefined.
  3. if (typeof inferred !== "number" || inferred < 400) return false; — bails early.
  4. classifyProviderRuntimeFailureKind at line 820 does not return "upstream_html".
  5. Falls through to isDnsTransportErrorMessage(message) (line 831). That uses DNS_ERROR_RE with \bdns\b, which matches Cloudflare challenge bodies that reference DNS in their body copy.
  6. Returns kind "dns".
  7. formatAssistantErrorText has no explicit "dns" branch, falls through to formatTransportErrorCopy(raw) at line 971, matches lower.includes("dns"), returns "LLM request failed: DNS lookup for the provider endpoint failed."

The status-gate was reasonable for status-prefixed payloads (e.g. 200 {"...":"..."}) where we don't want sub-400 responses to be flagged as HTML errors. It was overly strict for raw HTML bodies where the HTML markers themselves are strong enough evidence that the upstream is misbehaving.

Regression Test Plan

  • Coverage level: Unit tests in the owning helper test file, plus end-to-end user-message test in formatassistanterrortext.test.ts.
  • Target tests added:
    • provider-error-patterns.test.ts — new describe block "Raw HTML error pages without a leading HTTP status (#67712)" with 5 cases:
      1. Raw Cloudflare challenge HTML classifies as upstream_html.
      2. HTML body that mentions DNS does NOT classify as dns (the reported regression).
      3. Error:-prefixed raw HTML still classifies as upstream_html.
      4. Plain DNS transport errors (ENOTFOUND) still classify as dns (negative guard).
      5. Explicit sub-400 status still vetoes HTML classification (preserves status-gate semantics when the status is known).
    • pi-embedded-helpers.formatassistanterrortext.test.ts — end-to-end: raw HTML body containing the substring "DNS" returns the upstream-HTML user copy, not the DNS copy.
  • Existing coverage preserved: All #67517 (Cloudflare HTML with status) and existing DNS/transport tests keep passing.

User-visible Changes

openai-codex runs that receive a raw HTML Cloudflare/CDN response now surface the upstream-HTML message:

The provider returned an HTML error page instead of an API response. This usually means a CDN or gateway (e.g. Cloudflare) blocked the request. Retry in a moment or check provider status.

Instead of the misleading:

LLM request failed: DNS lookup for the provider endpoint failed.

Diagram

N/A.

Security Impact

  • Adds or changes permissions/capabilities? No.
  • Reads, writes, or persists secrets? No.
  • Opens new network endpoints or outbound calls? No.
  • Changes code execution boundaries (sandbox, exec, MCP)? No.
  • Widens data visibility or cross-user scope? No.

Repro + Verification

  • Environment: OpenClaw 2026.4.14 (323493f), macOS 15.6.1 arm64, Node 24.12.0.
  • Steps:
    1. Configure openai-codex/gpt-5.2 as primary model.
    2. Trigger a run while the upstream returns an HTML Cloudflare/CDN error body (raw, no status prefix).
    3. Observe user-facing message and gateway logs.
  • Expected (after fix): User sees the upstream-HTML message; logs still show rawError=<html>....
  • Actual (before fix): User sees "LLM request failed: DNS lookup for the provider endpoint failed." even though DNS is healthy.

Evidence

Failing before:

  • classifyProviderRuntimeFailureKind({message: "<html>...</html>"}) returned "dns" (when body mentioned DNS) or "unknown" otherwise.
  • formatAssistantErrorText(raw) returned the DNS copy.

Passing after:

  • pnpm test src/agents/pi-embedded-helpers/provider-error-patterns.test.ts src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts → all green (47 + 43 tests).
  • pnpm test src/agents/pi-embedded-helpers/ src/agents/pi-embedded-helpers.*.test.ts broader sweep → all green (128 tests for the sanitize/format surface).

Human Verification

  • Verified: targeted tests for provider-error-patterns.test.ts, formatassistanterrortext.test.ts, and sanitizeuserfacingtext.test.ts; no touched-file changes show up in pnpm tsgo or pnpm lint output.
  • Not verified: full pnpm tsgo / pnpm lint / pnpm build are clean on upstream main — local runs surface pre-existing failures in extensions/discord/src/monitor/gateway-plugin.*, extensions/qa-lab/src/scenario-runtime-api.test.ts, and a @clawdbot/lobster/core rolldown resolution error, all unrelated to this change.
  • Edge cases considered: status-prefixed HTML (preserved), sub-400 statuses (preserved veto), Error:-prefixed HTML, plain DNS messages (still classify as dns).

Review Conversations

  • All Greptile findings addressed
  • All ChatGPT Codex findings addressed

Compatibility / Migration

  • Backward compatible: Yes. All previously-classified HTML+status cases still classify identically; only previously-unclassified raw-HTML-without-status cases change.
  • Config changes: None.
  • Migration required: None.

Risks and Mitigations

  • Risk: A non-error payload that legitimately starts with <html> and closes with </html> could now be classified as upstream_html when previously it would fall through.
    • Mitigation: The < 400 status veto still fires whenever a status can be inferred, so explicit 200 <html>... responses remain unclassified. Status-less raw HTML arriving through error-paths inside the agent runtime is already an exceptional signal — treating it as upstream HTML is accurate.
  • Risk: Downstream consumers of "dns" classification may have relied on HTML bodies to reach them for retry logic.
    • Mitigation: Our #67642 already diverted status-prefixed HTML to upstream_html; this PR simply extends that to the status-less case. No new semantic class is introduced.

AI-assisted: This PR was developed with AI assistance.

Changed files

  • src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts (modified, +17/-0)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +14/-5)
  • src/agents/pi-embedded-helpers/provider-error-patterns.test.ts (modified, +43/-0)

Code Example

warn errors:
- HTML error truncated
- Cloudflare 521 response

04:16:12+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"API rate limit reached","failoverReason":"rate_limit",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:22+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"API rate limit reached","failoverReason":"rate_limit",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:33+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,

"error":"DNS lookup for provider endpoint failed",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:49+00:00 warn errors Long error truncated: <html>...
04:16:49+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"<html>...","model":"gpt-5.4","provider":"openai-codex"} embedded run agent end
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

openai-codex/gpt-5.4 stalls indefinitely in the TUI. Logs show Cloudflare / HTML responses, but OpenClaw misclassifies the failure as API rate limit reached, DNS lookup for provider endpoint failed, or raw <html>....

Steps to reproduce

Authenticate Codex OAuth: codex login --device-auth openclaw models auth login --provider openai-codex Set model to openai-codex/gpt-5.4 Start a normal TUI/session prompt Observe the session stall/noodle without a real response Check logs, which show Cloudflare / HTML responses and inconsistent classification (rate_limit, DNS, raw <html>)

Expected behavior

Cloudflare / HTML upstream failures should be classified consistently as upstream HTTP/service errors and fail cleanly, not as DNS or rate-limit errors, and should not leave the TUI session stuck.

Actual behavior

The same underlying upstream failure appears to be mapped inconsistently: API rate limit reached DNS lookup for provider endpoint failed raw <html>...

OpenClaw version

First observed on 2026.4.14. Downgraded to 2026.4.12 (1c0672b) and the same behavior persisted.

Operating system

macOS

Install method

npm global install (updated via OpenClaw tooling, downgraded with npm command)

Model

gpt-5.4

Provider / routing chain

openai-codex/gpt-5.4 via Codex OAuth (not openai/gpt-5.4 API key route)

Additional provider/model setup details

codex login --device-auth openclaw models auth login --provider openai-codex default model set to openai-codex/gpt-5.4 disabling cron/channels/memory-core reduced background traffic but did not fix single-request repro

Logs, screenshots, and evidence

warn errors:
- HTML error truncated
- Cloudflare 521 response

04:16:12+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"API rate limit reached","failoverReason":"rate_limit",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:22+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"API rate limit reached","failoverReason":"rate_limit",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:33+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,

"error":"DNS lookup for provider endpoint failed",
"model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

04:16:49+00:00 warn errors Long error truncated: <html>...
04:16:49+00:00 warn agent/embedded {"event":"embedded_run_agent_end","isError":true,
"error":"<html>...","model":"gpt-5.4","provider":"openai-codex"} embedded run agent end

Impact and severity

High. Blocks normal interactive use of openai-codex/gpt-5.4 in the TUI and leaves sessions stuck.

Additional information

I first noticed this after updating to 2026.4.14 Downgrading to 2026.4.12 did not clear it This makes it unclear whether this is a pure 4.14 regression vs. an upstream/provider-side or persisted local-state issue Another user in the Discord help thread reported the same symptom

extent analysis

TL;DR

The issue can be mitigated by improving error handling for Cloudflare/HTML responses in OpenClaw to consistently classify them as upstream HTTP/service errors.

Guidance

  • Review the OpenClaw code to identify where the error classification is happening and why it's inconsistent for Cloudflare/HTML responses.
  • Consider adding specific handling for Cloudflare error codes (e.g., 521) to classify them as upstream service errors.
  • Verify that the openclaw models auth login and codex login commands are correctly configured and authenticated.
  • Investigate if the issue persists when using a different model or provider to isolate if it's specific to openai-codex/gpt-5.4.

Example

No code snippet is provided as the issue requires a deeper understanding of the OpenClaw codebase and its error handling mechanisms.

Notes

The issue's root cause is unclear due to the inconsistent classification of errors and the persistence of the issue after downgrading OpenClaw. Further investigation into the OpenClaw code and its interaction with the openai-codex/gpt-5.4 model is necessary.

Recommendation

Apply a workaround by improving error handling in OpenClaw for Cloudflare/HTML responses, as this is likely to mitigate the issue and provide a more consistent user experience.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Cloudflare / HTML upstream failures should be classified consistently as upstream HTTP/service errors and fail cleanly, not as DNS or rate-limit errors, and should not leave the TUI session stuck.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING