openclaw - ✅(Solved) Fix openai-codex HTML/rate-limit responses are sometimes surfaced as 'DNS lookup for the provider endpoint failed' [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#67712Fetched 2026-04-17 08:29:38
View on GitHub
Comments
2
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×2cross-referenced ×1

openai-codex/* real runs can fail with HTML/rate-limit/challenge-like upstream responses, but OpenClaw sometimes surfaces the failure as:

LLM request failed: DNS lookup for the provider endpoint failed.

On this machine, host DNS appears healthy, short live probes can succeed, and the misleading DNS message shows up alongside raw HTML responses and explicit rate-limit paths.

Error Message

embedded run agent end: runId=50e0f85d-da79-41e9-b472-713e48425eeb isError=true model=gpt-5.2 provider=openai-codex error=LLM request failed: DNS lookup for the provider endpoint failed. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ... embedded run agent end: ... provider=openai-codex error=⚠️ API rate limit reached. Please try again later. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ... And error truncation logs show: Long error truncated: <html> If the upstream returns HTML/challenge/rate-limit content, surface a more accurate provider/transport error instead of a DNS failure.

Root Cause

This does not look like a pure host DNS problem because:

Fix Action

Fixed

PR fix notes

PR #67762: fix(agents): classify raw HTML error responses even without leading HTTP status prefix

Description (problem / solution / changelog)

Summary

Problem: openai-codex runs whose upstream returns a raw HTML Cloudflare/CDN error page (without a leading HTTP status prefix) are surfaced to the user as:

LLM request failed: DNS lookup for the provider endpoint failed.

Even though host DNS is healthy, short live probes succeed, and the gateway logs clearly show rawError=<html>.... The message is misleading, and a user seeing it has no reason to suspect a CDN/gateway issue upstream.

Why: Follow-up to merged #67642 (Cloudflare HTML misclassification). That PR taught classifyProviderRuntimeFailureKind to return upstream_html for HTML bodies — but only when an HTTP status >= 400 could be inferred. When providers forward a raw <html>...</html> body without a leading status prefix and callers don't pass an explicit status, the classifier silently fails over to isDnsTransportErrorMessage (which substring-matches dns anywhere in the body) and produces a DNS-lookup message.

What changed: Relaxed isHtmlErrorResponse in src/agents/pi-embedded-helpers/errors.ts to trust strong HTML markers (<!doctype html>/<html> start + </html> close) even when no status can be inferred. The pre-existing < 400 guard still fires whenever a status IS inferred — so status-prefixed payloads keep their current semantics.

What did NOT change:

  • classifyFailoverSignal (failover/timeout gate at 408/499/5xx) — intentionally left as is.
  • formatTransportErrorCopy substring dns match — the HTML branch now runs first so the fall-through isn't reachable for HTML bodies; tightening the DNS regex would be a separate cleanup.
  • src/shared/assistant-error-format.ts:isCloudflareOrHtmlErrorPage — similar gap but broader blast radius, out of scope.

Change Type

Bug fix.

Scope

  • agents (src/agents)
  • plugins / providers
  • channels
  • gateway
  • cli / web-ui / apps

Linked Issue

Closes #67712. Follow-up to #67642 (same classification pipeline, same reviewer surface).

Root Cause

isHtmlErrorResponse at src/agents/pi-embedded-helpers/errors.ts:339-354 required an inferrable HTTP status code >= 400 to classify raw HTML as an HTML error.

Classification chain for an openai-codex HTML response without a leading status prefix:

  1. extractLeadingHttpStatus("<html>...") returns null (no 3-digit prefix).
  2. inferred is undefined.
  3. if (typeof inferred !== "number" || inferred < 400) return false; — bails early.
  4. classifyProviderRuntimeFailureKind at line 820 does not return "upstream_html".
  5. Falls through to isDnsTransportErrorMessage(message) (line 831). That uses DNS_ERROR_RE with \bdns\b, which matches Cloudflare challenge bodies that reference DNS in their body copy.
  6. Returns kind "dns".
  7. formatAssistantErrorText has no explicit "dns" branch, falls through to formatTransportErrorCopy(raw) at line 971, matches lower.includes("dns"), returns "LLM request failed: DNS lookup for the provider endpoint failed."

The status-gate was reasonable for status-prefixed payloads (e.g. 200 {"...":"..."}) where we don't want sub-400 responses to be flagged as HTML errors. It was overly strict for raw HTML bodies where the HTML markers themselves are strong enough evidence that the upstream is misbehaving.

Regression Test Plan

  • Coverage level: Unit tests in the owning helper test file, plus end-to-end user-message test in formatassistanterrortext.test.ts.
  • Target tests added:
    • provider-error-patterns.test.ts — new describe block "Raw HTML error pages without a leading HTTP status (#67712)" with 5 cases:
      1. Raw Cloudflare challenge HTML classifies as upstream_html.
      2. HTML body that mentions DNS does NOT classify as dns (the reported regression).
      3. Error:-prefixed raw HTML still classifies as upstream_html.
      4. Plain DNS transport errors (ENOTFOUND) still classify as dns (negative guard).
      5. Explicit sub-400 status still vetoes HTML classification (preserves status-gate semantics when the status is known).
    • pi-embedded-helpers.formatassistanterrortext.test.ts — end-to-end: raw HTML body containing the substring "DNS" returns the upstream-HTML user copy, not the DNS copy.
  • Existing coverage preserved: All #67517 (Cloudflare HTML with status) and existing DNS/transport tests keep passing.

User-visible Changes

openai-codex runs that receive a raw HTML Cloudflare/CDN response now surface the upstream-HTML message:

The provider returned an HTML error page instead of an API response. This usually means a CDN or gateway (e.g. Cloudflare) blocked the request. Retry in a moment or check provider status.

Instead of the misleading:

LLM request failed: DNS lookup for the provider endpoint failed.

Diagram

N/A.

Security Impact

  • Adds or changes permissions/capabilities? No.
  • Reads, writes, or persists secrets? No.
  • Opens new network endpoints or outbound calls? No.
  • Changes code execution boundaries (sandbox, exec, MCP)? No.
  • Widens data visibility or cross-user scope? No.

Repro + Verification

  • Environment: OpenClaw 2026.4.14 (323493f), macOS 15.6.1 arm64, Node 24.12.0.
  • Steps:
    1. Configure openai-codex/gpt-5.2 as primary model.
    2. Trigger a run while the upstream returns an HTML Cloudflare/CDN error body (raw, no status prefix).
    3. Observe user-facing message and gateway logs.
  • Expected (after fix): User sees the upstream-HTML message; logs still show rawError=<html>....
  • Actual (before fix): User sees "LLM request failed: DNS lookup for the provider endpoint failed." even though DNS is healthy.

Evidence

Failing before:

  • classifyProviderRuntimeFailureKind({message: "<html>...</html>"}) returned "dns" (when body mentioned DNS) or "unknown" otherwise.
  • formatAssistantErrorText(raw) returned the DNS copy.

Passing after:

  • pnpm test src/agents/pi-embedded-helpers/provider-error-patterns.test.ts src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts → all green (47 + 43 tests).
  • pnpm test src/agents/pi-embedded-helpers/ src/agents/pi-embedded-helpers.*.test.ts broader sweep → all green (128 tests for the sanitize/format surface).

Human Verification

  • Verified: targeted tests for provider-error-patterns.test.ts, formatassistanterrortext.test.ts, and sanitizeuserfacingtext.test.ts; no touched-file changes show up in pnpm tsgo or pnpm lint output.
  • Not verified: full pnpm tsgo / pnpm lint / pnpm build are clean on upstream main — local runs surface pre-existing failures in extensions/discord/src/monitor/gateway-plugin.*, extensions/qa-lab/src/scenario-runtime-api.test.ts, and a @clawdbot/lobster/core rolldown resolution error, all unrelated to this change.
  • Edge cases considered: status-prefixed HTML (preserved), sub-400 statuses (preserved veto), Error:-prefixed HTML, plain DNS messages (still classify as dns).

Review Conversations

  • All Greptile findings addressed
  • All ChatGPT Codex findings addressed

Compatibility / Migration

  • Backward compatible: Yes. All previously-classified HTML+status cases still classify identically; only previously-unclassified raw-HTML-without-status cases change.
  • Config changes: None.
  • Migration required: None.

Risks and Mitigations

  • Risk: A non-error payload that legitimately starts with <html> and closes with </html> could now be classified as upstream_html when previously it would fall through.
    • Mitigation: The < 400 status veto still fires whenever a status can be inferred, so explicit 200 <html>... responses remain unclassified. Status-less raw HTML arriving through error-paths inside the agent runtime is already an exceptional signal — treating it as upstream HTML is accurate.
  • Risk: Downstream consumers of "dns" classification may have relied on HTML bodies to reach them for retry logic.
    • Mitigation: Our #67642 already diverted status-prefixed HTML to upstream_html; this PR simply extends that to the status-less case. No new semantic class is introduced.

AI-assisted: This PR was developed with AI assistance.

Changed files

  • src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts (modified, +17/-0)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +14/-5)
  • src/agents/pi-embedded-helpers/provider-error-patterns.test.ts (modified, +43/-0)

Code Example

{
  "provider": "openai-codex",
  "model": "openai-codex/gpt-5.2",
  "status": "ok",
  "latencyMs": 10591
}

---

embedded run agent end: runId=50e0f85d-da79-41e9-b472-713e48425eeb isError=true model=gpt-5.2 provider=openai-codex error=LLM request failed: DNS lookup for the provider endpoint failed. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ...

---

embedded run agent end: ... provider=openai-codex error=⚠️ API rate limit reached. Please try again later. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ...

---

Long error truncated: <html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1" />

---

LLM request failed: DNS lookup for the provider endpoint failed.
RAW_BUFFERClick to expand / collapse

Summary

openai-codex/* real runs can fail with HTML/rate-limit/challenge-like upstream responses, but OpenClaw sometimes surfaces the failure as:

LLM request failed: DNS lookup for the provider endpoint failed.

On this machine, host DNS appears healthy, short live probes can succeed, and the misleading DNS message shows up alongside raw HTML responses and explicit rate-limit paths.

Environment

  • OpenClaw: 2026.4.14 (323493f)
  • OS: macOS 15.6.1 (arm64)
  • Node: 24.12.0
  • Gateway: local loopback, reachable
  • Primary model: openai-codex/gpt-5.2

What I Verified

From the same machine where the gateway is running:

  • openclaw status --all shows gateway reachable and Telegram healthy
  • nslookup api.openai.com resolves normally
  • nslookup chatgpt.com resolves normally
  • openclaw models status --probe --json can show short live probes succeeding for:
    • openai-codex/gpt-5.2
    • google/gemini-2.0-flash

Successful Probe Example

From openclaw models status --probe --json:

{
  "provider": "openai-codex",
  "model": "openai-codex/gpt-5.2",
  "status": "ok",
  "latencyMs": 10591
}

Failed Real Run Examples

From openclaw status --all / gateway logs:

embedded run agent end: runId=50e0f85d-da79-41e9-b472-713e48425eeb isError=true model=gpt-5.2 provider=openai-codex error=LLM request failed: DNS lookup for the provider endpoint failed. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ...

Nearby runs also show:

embedded run agent end: ... provider=openai-codex error=⚠️ API rate limit reached. Please try again later. rawError=<html> <head> <meta name="viewport" content="width=device-width, initial-scale=1" /> ...

And error truncation logs show:

Long error truncated: <html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1" />

Why This Looks Wrong

This does not look like a pure host DNS problem because:

  • api.openai.com and chatgpt.com both resolve on the host
  • short live probes can succeed
  • failed real runs include rawError=<html>...
  • nearby failures are also classified as rate_limit

So the likely bug is that intermittent upstream HTML/challenge/rate-limit responses on the openai-codex/* path are being misclassified and surfaced as DNS lookup for the provider endpoint failed.

Expected

If the upstream returns HTML/challenge/rate-limit content, surface a more accurate provider/transport error instead of a DNS failure.

Actual

Some openai-codex/* failures are surfaced as:

LLM request failed: DNS lookup for the provider endpoint failed.

even when the raw response appears to be HTML and DNS on the host is healthy.

extent analysis

TL;DR

The issue can be addressed by modifying the error handling logic to correctly identify and surface upstream HTML, challenge, or rate-limit responses instead of misclassifying them as DNS lookup failures.

Guidance

  • Review the error handling code to identify where the misclassification occurs and adjust the logic to check for HTML, challenge, or rate-limit responses before surfacing a DNS error.
  • Verify that the openclaw version 2026.4.14 (323493f) has the necessary functionality to handle such responses correctly, and consider updating if a newer version addresses this issue.
  • Inspect the rawError field in failed run logs to confirm that the responses are indeed HTML or rate-limit related, which would support the theory that the issue is with error classification rather than a genuine DNS problem.
  • Consider adding specific error handling for rate-limit responses to provide a more user-friendly error message, such as "API rate limit reached. Please try again later."

Example

No specific code example can be provided without access to the openclaw codebase, but the adjustment would involve modifying the error handling logic to check for specific conditions (e.g., HTML content, rate-limit headers) before determining the error type.

Notes

The solution assumes that the openclaw software has the capability to inspect and handle different types of error responses. If the software does not have this capability, a feature request or a custom solution might be necessary.

Recommendation

Apply a workaround by modifying the error handling logic in openclaw to correctly classify and surface upstream errors, as this directly addresses the identified issue and provides a more accurate error message to the user.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix openai-codex HTML/rate-limit responses are sometimes surfaced as 'DNS lookup for the provider endpoint failed' [1 pull requests, 2 comments, 2 participants]