openclaw - ✅(Solved) Fix LLM error messages are over-normalized: raw error details lost in logs [2 pull requests, 2 comments, 3 participants]

zwj0117 · 2026-03-21T02:45:20Z

[openclaw] PR 1: fix agents : surface raw LLM errors in embedded run consoleMessage - Repository: jepson-liu/openclaw - Author: jepson-liu - State: closed | me… # PR #1: fix(agents): surface raw LLM errors in embedded run consoleMessage - Repository: jepson-liu/openclaw - Author: jepson-liu - State: closed | merged: False - Link: https://github.com/jepson-liu/openclaw/pull/1 ## Description (problem / solution / changelog) ## Summary - Problem: multiple transport/provider failures were normalized to `LLM request timed out.`, which hid actionable diagnostics in operator logs. - Why it matters: incident triage could not distinguish timeout vs connection refused/reset/DNS-like failures from `consoleMessage` alone. - What changed: `handleAgentEnd` now appends a redacted `rawError=...` suffix to `consoleMessage` only when it differs from the friendly error text. - What did NOT change (scope boundary): no changes to timeout classification patterns, provider retry policy, or provider-specific timeout configuration. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Fixes #51387 - Related #N/A ## User-visible / Behavior Changes - Error-facing lifecycle event text remains friendly (for example `LLM request timed out.`). - Operator `consoleMessage` now includes redacted `rawError` details when friendly normalization would otherwise mask the root cause. ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: `N/A` ## Repro + Verification ### Environment - OS: Linux - Runtime/container: local dev shell - Model/provider: mocked assistant error payloads in unit tests - Integration/channel (if any): N/A - Relevant config (redacted): default test config ### Steps 1. Trigger `handleAgentEnd` with assistant `stopReason: "error"` and a raw error matching timeout normalization (for example `ECONNREFUSED`). 2. Inspect logged metadata and `consoleMessage`. 3. Repeat with sensitive token-like content in raw error. ### Expected - `error` keeps friendly normalized copy. - `consoleMessage` includes `rawError=...` when different from friendly copy. - Sensitive values remain redacted. ### Actual - Matches expected behavior; regression tests pass. ## Evidence - [x] Failing test/log before + passing after - [x] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Human Verification (required) - Verified scenarios: - Overloaded provider payload keeps friendly message and appends redacted raw payload in `consoleMessage`. - Timeout-normalized `ECONNREFUSED` now surfaces original transport hint via `rawError` suffix. - Sensitive `x-api-key` value is redacted in appended `rawError`. - Edge cases checked: - No suffix when raw preview equals friendly text. - Control characters still sanitized in console-facing fields. - What you did **not** verify: - Live channel end-to-end runs against external providers. ## Review Conversations - [x] I replied to or resolved every bot review conversation I addressed in this PR. - [x] I left unresolved only the conversations that still need reviewer or maintainer judgment. ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`No`) - Migration needed? (`No`) - If yes, exact upgrade steps: `N/A` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: revert this PR commit on `fix/51387-llm-error-log-raw-preview`. - Files/config to restore: `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts`, `src/agents/pi-embedded-error-observation.ts`, and related test updates. - Known bad symptoms reviewers should watch for: duplicate/noisy `consoleMessage` suffixes or unexpected unredacted values. ## Risks and Mitigations - Risk: appended `rawError` increases log verbosity. - Mitigation: suffix is conditional (only when it adds signal) and length-limited via `RAW_ERROR_PREVIEW_MAX_CHARS`. - Risk: sensitive content leakage through raw append path. - Mitigation: reuse existing `buildApiErrorObservationFields` redaction and console sanitization before appending. ## Changed files - `src/agents/pi-embedded-error-observation.ts` (modified, +1/-1) - `src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts` (modified, +38/-1) - `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` (modified, +9/-1) --- # PR #51419: fix(agent): clarify embedded transport errors - Repository: openclaw/openclaw - Author: scoootscooob - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/51419

openclaw2026-03-21 02:45:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#51387•Fetched 2026-04-08 01:11:53

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2cross-referenced ×2closed ×1locked ×1

Error Message

When an LLM request fails, formatAssistantErrorText() normalizes many different errors into a single generic message like "LLM request timed out.". The raw error message is never logged, making it impossible to diagnose the actual failure.

Error patterns that all map to "LLM request timed out"

connection error, network error These represent very different failure modes (real timeout vs. connection refused vs. network error), but users and operators only see "LLM request timed out." In our deployment using a custom provider (custom-idealab-alibaba-inc-com), we see frequent "LLM request timed out" errors in gateway logs. Some occur within 0.4 seconds of the request starting — clearly not a 30-second timeout. Without the raw error, we cannot determine whether the issue is:
A TLS error

Where the raw error is lost

In handleAgentEnd(), the error flows through: 3. buildApiErrorObservationFields() further redacts the raw error

Log the raw error alongside the formatted one — at minimum in debug/warn level: embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>
Consider differentiating error categories — instead of mapping everything to "timed out", use distinct user-facing messages:

"LLM request failed: connection error"

Code Example

embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>

---

{
     "models": {
       "providers": {
         "my-provider": {
           "requestTimeoutMs": 120000
         }
       }
     }
   }

RAW_BUFFERClick to expand / collapse

Problem

Error patterns that all map to "LLM request timed out"

The ERROR_PATTERNS.timeout array matches 15+ patterns:

timeout, timed out
service unavailable
connection error, network error
fetch failed, socket hang up
ECONNREFUSED, ECONNRESET, ECONNABORTED
ETIMEDOUT, ENETUNREACH, EHOSTUNREACH
And more...

These represent very different failure modes (real timeout vs. connection refused vs. network error), but users and operators only see "LLM request timed out."

Impact

In our deployment using a custom provider (custom-idealab-alibaba-inc-com), we see frequent "LLM request timed out" errors in gateway logs. Some occur within 0.4 seconds of the request starting — clearly not a 30-second timeout. Without the raw error, we cannot determine whether the issue is:

An actual timeout
A connection reset
A DNS failure
A TLS error
The request being aborted by something else

Where the raw error is lost

In handleAgentEnd(), the error flows through:

lastAssistant.errorMessage (raw) → formatAssistantErrorText() → safeErrorText (normalized)
safeErrorText is what gets logged via consoleMessage
buildApiErrorObservationFields() further redacts the raw error

The raw errorMessage is never emitted to any log output.

Suggested fix

Log the raw error alongside the formatted one — at minimum in debug/warn level:

embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>

Consider differentiating error categories — instead of mapping everything to "timed out", use distinct user-facing messages:
- "LLM request timed out (no response within Xs)"
- "LLM request failed: connection error"
- "LLM request failed: service unavailable"

Make the LLM request timeout configurable per provider in openclaw.json:

{
  "models": {
    "providers": {
      "my-provider": {
        "requestTimeoutMs": 120000
      }
    }
  }
}

Currently the timeout is hardcoded at 30 seconds (3e4 in GatewayClient constructor).

Environment

OpenClaw version: 2026.3.13
Provider: custom Anthropic-compatible proxy (anthropic-messages API)
Model: claude-opus-4-6
Channel: DingTalk

extent analysis

Fix Plan

To address the issue, we will implement the following steps:

Log the raw error alongside the formatted one
Differentiate error categories
Make the LLM request timeout configurable

Step 1: Log Raw Error

Modify the handleAgentEnd() function to log the raw error:

console.log(`embedded run agent end: runId=${runId} error=${safeErrorText} rawError=${lastAssistant.errorMessage}`);

Step 2: Differentiate Error Categories

Update the formatAssistantErrorText() function to use distinct user-facing messages:

function formatAssistantErrorText(errorMessage) {
  if (errorMessage.includes('timeout')) {
    return 'LLM request timed out (no response within Xs)';
  } else if (errorMessage.includes('connection error')) {
    return 'LLM request failed: connection error';
  } else if (errorMessage.includes('service unavailable')) {
    return 'LLM request failed: service unavailable';
  } else {
    return 'LLM request failed: unknown error';
  }
}

Step 3: Make LLM Request Timeout Configurable

Add a requestTimeoutMs field to the provider configuration in openclaw.json:

{
  "models": {
    "providers": {
      "my-provider": {
        "requestTimeoutMs": 120000
      }
    }
  }
}

Then, update the GatewayClient constructor to use the configurable timeout:

function GatewayClient(provider) {
  const requestTimeoutMs = provider.requestTimeoutMs || 30000;
  // ...
}

Verification

To verify the fix, check the logs for the raw error messages and ensure that the error categories are correctly differentiated. Additionally, test the configurable timeout by setting different values in openclaw.json and verifying that the timeout is applied correctly.

Extra Tips

Consider adding more error categories and user-facing messages to improve error handling and diagnosis.
Review the ERROR_PATTERNS array to ensure that it covers all possible error scenarios.
Test the fix thoroughly to ensure that it does not introduce any new issues or regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #training loop #device allocation #model download #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix LLM error messages are over-normalized: raw error details lost in logs [2 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error patterns that all map to "LLM request timed out"

Where the raw error is lost

Fix Action

Fixed

PR fix notes

PR #1: fix(agents): surface raw LLM errors in embedded run consoleMessage

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

PR #51419: fix(agent): clarify embedded transport errors

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

Code Example

Problem

Error patterns that all map to "LLM request timed out"

Impact

Where the raw error is lost

Suggested fix

Environment

extent analysis

Fix Plan

Step 1: Log Raw Error

Step 2: Differentiate Error Categories

Step 3: Make LLM Request Timeout Configurable

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING