openclaw - ✅(Solved) Fix [Feature]: Capture upstream LLM request-id/trace-id in diagnostic events and error logs [1 pull requests, 1 comments, 2 participants]

openclaw2026-04-13 07:42:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#65787•Fetched 2026-04-14 05:40:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

eightHundreds

Participants

eightHundreds

Lidang-Jiang

Timeline (top)

commented ×1cross-referenced ×1

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Error Message

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Root Cause

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Fix Action

Fixed

Fixed by PR: feat: surface upstream request ids in diagnostics (https://github.com/openclaw/openclaw/pull/66286)

PR fix notes

PR #66286: feat: surface upstream request ids in diagnostics

Repository: openclaw/openclaw
Author: Lidang-Jiang
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/66286

Description (problem / solution / changelog)

Summary

extract upstream request ids from OpenAI and Anthropic transport responses
thread the id through failover metadata, embedded agent metadata, and model.usage diagnostic events
record openclaw.upstream_request_id on diagnostics-otel spans and add regression coverage for transport/error/diagnostic paths

Closes #65787.

AI Assistance

AI-assisted: Codex
Testing: fully tested locally on targeted lanes

<details> <summary>Before</summary>

$ node --import tsx -e "import { describeFailoverError } from './src/agents/failover-error.js'; const described=describeFailoverError({status:429,message:'request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry'}); console.log(JSON.stringify(described, null, 2));"
{
  "message": "request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry",
  "reason": "rate_limit",
  "status": 429
}

</details> <details> <summary>After</summary>

$ node --import tsx -e "import { describeFailoverError } from './src/agents/failover-error.js'; const described=describeFailoverError({status:429,message:'request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry'}); console.log(JSON.stringify(described, null, 2));"
{
  "message": "request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry",
  "reason": "rate_limit",
  "status": 429,
  "upstreamRequestId": "20260303141547610b7f574d1b44cb"
}

$ node scripts/run-vitest.mjs run --config test/vitest/vitest.unit.config.ts src/agents/anthropic-transport-stream.test.ts src/agents/openai-transport-stream.test.ts src/agents/openai-transport-stream.request-id.test.ts src/agents/failover-error.test.ts src/agents/transport-stream-shared.test.ts src/agents/pi-embedded-runner/run/helpers.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/infra/diagnostic-events.test.ts
EXIT:0

$ node scripts/run-vitest.mjs run --config test/vitest/vitest.extensions.config.ts extensions/diagnostics-otel/src/service.test.ts
EXIT:0

</details>

Test plan

targeted transport/failover/diagnostic unit tests
targeted diagnostics-otel extension tests

Changed files

extensions/diagnostics-otel/src/service.test.ts (modified, +29/-0)
extensions/diagnostics-otel/src/service.ts (modified, +3/-0)
src/agents/anthropic-transport-stream.test.ts (modified, +54/-0)
src/agents/anthropic-transport-stream.ts (modified, +34/-3)
src/agents/failover-error.test.ts (modified, +28/-0)
src/agents/failover-error.ts (modified, +71/-0)
src/agents/openai-transport-stream.request-id.test.ts (added, +189/-0)
src/agents/openai-transport-stream.ts (modified, +93/-7)
src/agents/pi-embedded-runner/run.ts (modified, +4/-0)
src/agents/pi-embedded-runner/run/helpers.test.ts (modified, +33/-1)
src/agents/pi-embedded-runner/run/helpers.ts (modified, +4/-1)
src/agents/pi-embedded-runner/types.ts (modified, +1/-0)
src/agents/transport-stream-shared.test.ts (modified, +26/-0)
src/agents/transport-stream-shared.ts (modified, +49/-0)
src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +45/-2)
src/auto-reply/reply/agent-runner.ts (modified, +4/-0)
src/infra/diagnostic-events.test.ts (modified, +17/-0)
src/infra/diagnostic-events.ts (modified, +1/-0)

Code Example

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

---

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

---

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out." upstreamRequestId=req_01J5K...

RAW_BUFFERClick to expand / collapse

Summary

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Problem to solve

When an LLM request times out or fails, the lane task error log only contains:

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Upstream LLM providers (Anthropic, OpenAI, etc.) return request-id / x-request-id / trace-id headers that are essential for cross-referencing failures with provider-side diagnostics. Currently these headers are available in buildManagedResponse() but discarded — they are not recorded in logs, not attached to FailoverError, not included in DiagnosticUsageEvent, and not exposed to OTEL spans or plugin diagnostic listeners.

This makes it very difficult to troubleshoot intermittent timeouts or provider-side errors, since there is no correlation key to take to the provider's support/dashboard.

Proposed solution

Provider transport streams (anthropic-transport-stream.ts, openai-transport-stream.ts, etc.): extract request-id / x-request-id from SDK response headers and attach to the output metadata.
FailoverError: add an optional upstreamRequestId field so timeout/failure logs carry the correlation ID.
DiagnosticUsageEvent (model.usage): add an optional upstreamRequestId field.
agent-runner.ts: pass the captured ID when emitting model.usage events.
Error logging (command-queue.ts): include upstreamRequestId in lane task error when available.
diagnostics-otel: record the ID as a span attribute on openclaw.model.usage.

This approach requires no changes to the Plugin SDK transport abstraction, StreamFn signature, or buildGuardedModelFetch return type — it stays within each provider's stream handler and the existing diagnostic event pipeline.

Alternatives considered

Intercepting at buildGuardedModelFetch level: would require changing the fetch wrapper return type from standard Response to a custom object, breaking the contract with Anthropic/OpenAI SDKs that expect a standard fetch response.
Exposing raw HTTP Response to plugins via SDK: high-cost change that breaks the intentional transport abstraction boundary. The diagnostic event approach gives plugins the correlation ID without leaking HTTP internals.
Plugin-side interception: third-party plugins currently cannot access raw HTTP responses by design. Even provider plugins only get the abstract StreamFn, not transport details.

Impact

Affected: anyone debugging LLM request failures, timeouts, or rate limiting — operators, self-hosters, and plugin authors using OTEL
Severity: medium — not a blocker, but significantly increases debugging time
Frequency: every time an LLM request fails or times out (intermittent but recurring)
Consequence: without upstream request IDs, operators cannot correlate OpenClaw errors with provider-side logs/dashboards, leading to blind troubleshooting and slower incident resolution

Evidence/examples

Current error log with no upstream correlation:

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Desired error log:

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out." upstreamRequestId=req_01J5K...

Anthropic API returns request-id header; OpenAI returns x-request-id. Both are accessible from their respective SDK response objects (stream.response.headers).

Additional information

The minimax-vlm.ts file already demonstrates the pattern of extracting Trace-Id from response headers for error messages — this proposal generalizes that approach across all providers.
Change is additive (all new fields are optional) and backward-compatible with existing plugin diagnostic listeners.
Provider-specific header names can be mapped to a single canonical upstreamRequestId field.

extent analysis

TL;DR

Extract and propagate upstream LLM API response headers (request-id, x-request-id, trace-id) through diagnostic events, OTEL spans, and error logs to enable correlation with provider-side diagnostics.

Guidance

Modify provider transport streams (e.g., anthropic-transport-stream.ts, openai-transport-stream.ts) to extract request-id / x-request-id from SDK response headers and attach to output metadata.
Update FailoverError to include an optional upstreamRequestId field for timeout/failure logs.
Add an optional upstreamRequestId field to DiagnosticUsageEvent and pass the captured ID when emitting model.usage events in agent-runner.ts.
Include upstreamRequestId in lane task error logs when available, and record it as a span attribute on openclaw.model.usage in diagnostics-otel.

Example

// In anthropic-transport-stream.ts
const requestId = stream.response.headers['request-id'];
// Attach requestId to output metadata

// In FailoverError.ts
interface FailoverError {
  // ...
  upstreamRequestId?: string;
}

// In agent-runner.ts
const usageEvent = new DiagnosticUsageEvent({
  // ...
  upstreamRequestId: capturedRequestId,
});

Notes

This solution assumes that the request-id / x-request-id headers are accessible from the SDK response objects. The proposed changes are additive and backward-compatible with existing plugin diagnostic listeners.

Recommendation

Apply the proposed workaround to extract and propagate upstream LLM API response headers, as it provides a clear correlation key for troubleshooting intermittent timeouts or provider-side errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Feature]: Capture upstream LLM request-id/trace-id in diagnostic events and error logs [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #66286: feat: surface upstream request ids in diagnostics

Description (problem / solution / changelog)

Summary

AI Assistance

Test plan

Changed files

Code Example

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING