openclaw - ✅(Solved) Fix LLM error messages are over-normalized: raw error details lost in logs [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#51387Fetched 2026-04-08 01:11:53
View on GitHub
Comments
2
Participants
3
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×2cross-referenced ×2closed ×1locked ×1

Error Message

When an LLM request fails, formatAssistantErrorText() normalizes many different errors into a single generic message like "LLM request timed out.". The raw error message is never logged, making it impossible to diagnose the actual failure.

Error patterns that all map to "LLM request timed out"

  • connection error, network error These represent very different failure modes (real timeout vs. connection refused vs. network error), but users and operators only see "LLM request timed out." In our deployment using a custom provider (custom-idealab-alibaba-inc-com), we see frequent "LLM request timed out" errors in gateway logs. Some occur within 0.4 seconds of the request starting — clearly not a 30-second timeout. Without the raw error, we cannot determine whether the issue is:
  • A TLS error

Where the raw error is lost

In handleAgentEnd(), the error flows through: 3. buildApiErrorObservationFields() further redacts the raw error

  1. Log the raw error alongside the formatted one — at minimum in debug/warn level: embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>
  2. Consider differentiating error categories — instead of mapping everything to "timed out", use distinct user-facing messages:
  • "LLM request failed: connection error"

Fix Action

Fixed

PR fix notes

PR #1: fix(agents): surface raw LLM errors in embedded run consoleMessage

Description (problem / solution / changelog)

Summary

  • Problem: multiple transport/provider failures were normalized to LLM request timed out., which hid actionable diagnostics in operator logs.
  • Why it matters: incident triage could not distinguish timeout vs connection refused/reset/DNS-like failures from consoleMessage alone.
  • What changed: handleAgentEnd now appends a redacted rawError=... suffix to consoleMessage only when it differs from the friendly error text.
  • What did NOT change (scope boundary): no changes to timeout classification patterns, provider retry policy, or provider-specific timeout configuration.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Fixes #51387
  • Related #N/A

User-visible / Behavior Changes

  • Error-facing lifecycle event text remains friendly (for example LLM request timed out.).
  • Operator consoleMessage now includes redacted rawError details when friendly normalization would otherwise mask the root cause.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local dev shell
  • Model/provider: mocked assistant error payloads in unit tests
  • Integration/channel (if any): N/A
  • Relevant config (redacted): default test config

Steps

  1. Trigger handleAgentEnd with assistant stopReason: "error" and a raw error matching timeout normalization (for example ECONNREFUSED).
  2. Inspect logged metadata and consoleMessage.
  3. Repeat with sensitive token-like content in raw error.

Expected

  • error keeps friendly normalized copy.
  • consoleMessage includes rawError=... when different from friendly copy.
  • Sensitive values remain redacted.

Actual

  • Matches expected behavior; regression tests pass.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • Overloaded provider payload keeps friendly message and appends redacted raw payload in consoleMessage.
    • Timeout-normalized ECONNREFUSED now surfaces original transport hint via rawError suffix.
    • Sensitive x-api-key value is redacted in appended rawError.
  • Edge cases checked:
    • No suffix when raw preview equals friendly text.
    • Control characters still sanitized in console-facing fields.
  • What you did not verify:
    • Live channel end-to-end runs against external providers.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR commit on fix/51387-llm-error-log-raw-preview.
  • Files/config to restore: src/agents/pi-embedded-subscribe.handlers.lifecycle.ts, src/agents/pi-embedded-error-observation.ts, and related test updates.
  • Known bad symptoms reviewers should watch for: duplicate/noisy consoleMessage suffixes or unexpected unredacted values.

Risks and Mitigations

  • Risk: appended rawError increases log verbosity.
    • Mitigation: suffix is conditional (only when it adds signal) and length-limited via RAW_ERROR_PREVIEW_MAX_CHARS.
  • Risk: sensitive content leakage through raw append path.
    • Mitigation: reuse existing buildApiErrorObservationFields redaction and console sanitization before appending.

Changed files

  • src/agents/pi-embedded-error-observation.ts (modified, +1/-1)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts (modified, +38/-1)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts (modified, +9/-1)

PR #51419: fix(agent): clarify embedded transport errors

Description (problem / solution / changelog)

Summary

Describe the problem and fix in 2-5 bullets:

  • Problem: embedded agent failures were collapsing many transport/network errors into the same LLM request timed out. message.
  • Why it matters: operators could not distinguish connection refused, DNS failures, and interrupted sockets from real timeouts, even in lifecycle logs.
  • What changed: added safe transport-specific error copy for common network failures and included the already-redacted raw error preview in the lifecycle console message.
  • What did NOT change (scope boundary): this does not expose raw provider errors to chat, change request timeout configuration, or widen any logging surface beyond the existing redacted observation fields.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #51387
  • Related #51308
  • Related #51323

User-visible / Behavior Changes

  • Embedded agent failures now distinguish common transport failures such as connection refused, DNS lookup failures, interrupted sockets, and generic network errors instead of flattening them all to LLM request timed out.
  • Lifecycle console logging now appends the existing redacted rawErrorPreview for operator diagnosis.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm test wrapper
  • Model/provider: synthetic embedded-agent error fixtures
  • Integration/channel (if any): embedded agent lifecycle logging
  • Relevant config (redacted): none required

Steps

  1. Reproduce formatAssistantErrorText() with transport-style failures such as ECONNREFUSED, ENOTFOUND, and socket hang up.
  2. Run the embedded lifecycle handler with an assistant error and inspect the emitted lifecycle metadata.
  3. Run targeted tests covering formatter, sanitization, and lifecycle logging.

Expected

  • Connection-refused, DNS, and interrupted-connection failures get distinct safe copy.
  • Lifecycle console output preserves the redacted raw preview for diagnosis.

Actual

  • Before this patch, the same failures were normalized to LLM request timed out. and the console line omitted the redacted raw preview.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm test -- src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts
    • pnpm test -- src/agents/pi-embedded-helpers.sanitizeuserfacingtext.test.ts
    • pnpm test -- src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts
    • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Edge cases checked:
    • rate limit and overload paths still take precedence over transport classification
    • lifecycle console output stays sanitized while appending redacted raw previews
  • What you did not verify:
    • full repo test suite
    • live provider traffic against a real external endpoint

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit 9521f48b85456728519c3cf48a9e3ef7820dae1c
  • Files/config to restore:
    • src/agents/pi-embedded-helpers/errors.ts
    • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
  • Known bad symptoms reviewers should watch for:
    • transport failures unexpectedly falling through to generic raw HTTP formatting
    • lifecycle console messages including unsanitized content

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: a transport-specific matcher could accidentally steal a more specific classification.
    • Mitigation: transport classification runs after rate-limit/overload classification, and targeted tests cover the new branches.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts (modified, +21/-0)
  • src/agents/pi-embedded-helpers.sanitizeuserfacingtext.test.ts (modified, +8/-0)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +60/-0)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts (modified, +6/-4)
  • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts (modified, +3/-1)

Code Example

embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>

---

{
     "models": {
       "providers": {
         "my-provider": {
           "requestTimeoutMs": 120000
         }
       }
     }
   }
RAW_BUFFERClick to expand / collapse

Problem

When an LLM request fails, formatAssistantErrorText() normalizes many different errors into a single generic message like "LLM request timed out.". The raw error message is never logged, making it impossible to diagnose the actual failure.

Error patterns that all map to "LLM request timed out"

The ERROR_PATTERNS.timeout array matches 15+ patterns:

  • timeout, timed out
  • service unavailable
  • connection error, network error
  • fetch failed, socket hang up
  • ECONNREFUSED, ECONNRESET, ECONNABORTED
  • ETIMEDOUT, ENETUNREACH, EHOSTUNREACH
  • And more...

These represent very different failure modes (real timeout vs. connection refused vs. network error), but users and operators only see "LLM request timed out."

Impact

In our deployment using a custom provider (custom-idealab-alibaba-inc-com), we see frequent "LLM request timed out" errors in gateway logs. Some occur within 0.4 seconds of the request starting — clearly not a 30-second timeout. Without the raw error, we cannot determine whether the issue is:

  • An actual timeout
  • A connection reset
  • A DNS failure
  • A TLS error
  • The request being aborted by something else

Where the raw error is lost

In handleAgentEnd(), the error flows through:

  1. lastAssistant.errorMessage (raw) → formatAssistantErrorText()safeErrorText (normalized)
  2. safeErrorText is what gets logged via consoleMessage
  3. buildApiErrorObservationFields() further redacts the raw error

The raw errorMessage is never emitted to any log output.

Suggested fix

  1. Log the raw error alongside the formatted one — at minimum in debug/warn level:

    embedded run agent end: runId=... error=LLM request timed out. rawError=<original error>
  2. Consider differentiating error categories — instead of mapping everything to "timed out", use distinct user-facing messages:

    • "LLM request timed out (no response within Xs)"
    • "LLM request failed: connection error"
    • "LLM request failed: service unavailable"
  3. Make the LLM request timeout configurable per provider in openclaw.json:

    {
      "models": {
        "providers": {
          "my-provider": {
            "requestTimeoutMs": 120000
          }
        }
      }
    }

    Currently the timeout is hardcoded at 30 seconds (3e4 in GatewayClient constructor).

Environment

  • OpenClaw version: 2026.3.13
  • Provider: custom Anthropic-compatible proxy (anthropic-messages API)
  • Model: claude-opus-4-6
  • Channel: DingTalk

extent analysis

Fix Plan

To address the issue, we will implement the following steps:

  • Log the raw error alongside the formatted one
  • Differentiate error categories
  • Make the LLM request timeout configurable

Step 1: Log Raw Error

Modify the handleAgentEnd() function to log the raw error:

console.log(`embedded run agent end: runId=${runId} error=${safeErrorText} rawError=${lastAssistant.errorMessage}`);

Step 2: Differentiate Error Categories

Update the formatAssistantErrorText() function to use distinct user-facing messages:

function formatAssistantErrorText(errorMessage) {
  if (errorMessage.includes('timeout')) {
    return 'LLM request timed out (no response within Xs)';
  } else if (errorMessage.includes('connection error')) {
    return 'LLM request failed: connection error';
  } else if (errorMessage.includes('service unavailable')) {
    return 'LLM request failed: service unavailable';
  } else {
    return 'LLM request failed: unknown error';
  }
}

Step 3: Make LLM Request Timeout Configurable

Add a requestTimeoutMs field to the provider configuration in openclaw.json:

{
  "models": {
    "providers": {
      "my-provider": {
        "requestTimeoutMs": 120000
      }
    }
  }
}

Then, update the GatewayClient constructor to use the configurable timeout:

function GatewayClient(provider) {
  const requestTimeoutMs = provider.requestTimeoutMs || 30000;
  // ...
}

Verification

To verify the fix, check the logs for the raw error messages and ensure that the error categories are correctly differentiated. Additionally, test the configurable timeout by setting different values in openclaw.json and verifying that the timeout is applied correctly.

Extra Tips

  • Consider adding more error categories and user-facing messages to improve error handling and diagnosis.
  • Review the ERROR_PATTERNS array to ensure that it covers all possible error scenarios.
  • Test the fix thoroughly to ensure that it does not introduce any new issues or regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING