openclaw - ✅(Solved) Fix [Feature]: Capture upstream LLM request-id/trace-id in diagnostic events and error logs [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65787Fetched 2026-04-14 05:40:18
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Error Message

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Root Cause

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Fix Action

Fixed

PR fix notes

PR #66286: feat: surface upstream request ids in diagnostics

Description (problem / solution / changelog)

Summary

  • extract upstream request ids from OpenAI and Anthropic transport responses
  • thread the id through failover metadata, embedded agent metadata, and model.usage diagnostic events
  • record openclaw.upstream_request_id on diagnostics-otel spans and add regression coverage for transport/error/diagnostic paths

Closes #65787.

AI Assistance

  • AI-assisted: Codex
  • Testing: fully tested locally on targeted lanes
<details> <summary>Before</summary>
$ node --import tsx -e "import { describeFailoverError } from './src/agents/failover-error.js'; const described=describeFailoverError({status:429,message:'request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry'}); console.log(JSON.stringify(described, null, 2));"
{
  "message": "request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry",
  "reason": "rate_limit",
  "status": 429
}
</details> <details> <summary>After</summary>
$ node --import tsx -e "import { describeFailoverError } from './src/agents/failover-error.js'; const described=describeFailoverError({status:429,message:'request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry'}); console.log(JSON.stringify(described, null, 2));"
{
  "message": "request failed: RESOURCE_EXHAUSTED (request_id: 20260303141547610b7f574d1b44cb) please retry",
  "reason": "rate_limit",
  "status": 429,
  "upstreamRequestId": "20260303141547610b7f574d1b44cb"
}

$ node scripts/run-vitest.mjs run --config test/vitest/vitest.unit.config.ts src/agents/anthropic-transport-stream.test.ts src/agents/openai-transport-stream.test.ts src/agents/openai-transport-stream.request-id.test.ts src/agents/failover-error.test.ts src/agents/transport-stream-shared.test.ts src/agents/pi-embedded-runner/run/helpers.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/infra/diagnostic-events.test.ts
EXIT:0

$ node scripts/run-vitest.mjs run --config test/vitest/vitest.extensions.config.ts extensions/diagnostics-otel/src/service.test.ts
EXIT:0
</details>

Test plan

  • targeted transport/failover/diagnostic unit tests
  • targeted diagnostics-otel extension tests

Changed files

  • extensions/diagnostics-otel/src/service.test.ts (modified, +29/-0)
  • extensions/diagnostics-otel/src/service.ts (modified, +3/-0)
  • src/agents/anthropic-transport-stream.test.ts (modified, +54/-0)
  • src/agents/anthropic-transport-stream.ts (modified, +34/-3)
  • src/agents/failover-error.test.ts (modified, +28/-0)
  • src/agents/failover-error.ts (modified, +71/-0)
  • src/agents/openai-transport-stream.request-id.test.ts (added, +189/-0)
  • src/agents/openai-transport-stream.ts (modified, +93/-7)
  • src/agents/pi-embedded-runner/run.ts (modified, +4/-0)
  • src/agents/pi-embedded-runner/run/helpers.test.ts (modified, +33/-1)
  • src/agents/pi-embedded-runner/run/helpers.ts (modified, +4/-1)
  • src/agents/pi-embedded-runner/types.ts (modified, +1/-0)
  • src/agents/transport-stream-shared.test.ts (modified, +26/-0)
  • src/agents/transport-stream-shared.ts (modified, +49/-0)
  • src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts (modified, +45/-2)
  • src/auto-reply/reply/agent-runner.ts (modified, +4/-0)
  • src/infra/diagnostic-events.test.ts (modified, +17/-0)
  • src/infra/diagnostic-events.ts (modified, +1/-0)

Code Example

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

---

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

---

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out." upstreamRequestId=req_01J5K...
RAW_BUFFERClick to expand / collapse

Summary

Capture upstream LLM API response headers (request-id, x-request-id, trace-id) and propagate them through diagnostic events, OTEL spans, and error logs.

Problem to solve

When an LLM request times out or fails, the lane task error log only contains:

lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Upstream LLM providers (Anthropic, OpenAI, etc.) return request-id / x-request-id / trace-id headers that are essential for cross-referencing failures with provider-side diagnostics. Currently these headers are available in buildManagedResponse() but discarded — they are not recorded in logs, not attached to FailoverError, not included in DiagnosticUsageEvent, and not exposed to OTEL spans or plugin diagnostic listeners.

This makes it very difficult to troubleshoot intermittent timeouts or provider-side errors, since there is no correlation key to take to the provider's support/dashboard.

Proposed solution

  1. Provider transport streams (anthropic-transport-stream.ts, openai-transport-stream.ts, etc.): extract request-id / x-request-id from SDK response headers and attach to the output metadata.
  2. FailoverError: add an optional upstreamRequestId field so timeout/failure logs carry the correlation ID.
  3. DiagnosticUsageEvent (model.usage): add an optional upstreamRequestId field.
  4. agent-runner.ts: pass the captured ID when emitting model.usage events.
  5. Error logging (command-queue.ts): include upstreamRequestId in lane task error when available.
  6. diagnostics-otel: record the ID as a span attribute on openclaw.model.usage.

This approach requires no changes to the Plugin SDK transport abstraction, StreamFn signature, or buildGuardedModelFetch return type — it stays within each provider's stream handler and the existing diagnostic event pipeline.

Alternatives considered

  • Intercepting at buildGuardedModelFetch level: would require changing the fetch wrapper return type from standard Response to a custom object, breaking the contract with Anthropic/OpenAI SDKs that expect a standard fetch response.
  • Exposing raw HTTP Response to plugins via SDK: high-cost change that breaks the intentional transport abstraction boundary. The diagnostic event approach gives plugins the correlation ID without leaking HTTP internals.
  • Plugin-side interception: third-party plugins currently cannot access raw HTTP responses by design. Even provider plugins only get the abstract StreamFn, not transport details.

Impact

  • Affected: anyone debugging LLM request failures, timeouts, or rate limiting — operators, self-hosters, and plugin authors using OTEL
  • Severity: medium — not a blocker, but significantly increases debugging time
  • Frequency: every time an LLM request fails or times out (intermittent but recurring)
  • Consequence: without upstream request IDs, operators cannot correlate OpenClaw errors with provider-side logs/dashboards, leading to blind troubleshooting and slower incident resolution

Evidence/examples

Current error log with no upstream correlation:

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out."

Desired error log:

13:40:10 error diagnostic lane task error: lane=main durationMs=25736 error="FailoverError: LLM request timed out." upstreamRequestId=req_01J5K...

Anthropic API returns request-id header; OpenAI returns x-request-id. Both are accessible from their respective SDK response objects (stream.response.headers).

Additional information

  • The minimax-vlm.ts file already demonstrates the pattern of extracting Trace-Id from response headers for error messages — this proposal generalizes that approach across all providers.
  • Change is additive (all new fields are optional) and backward-compatible with existing plugin diagnostic listeners.
  • Provider-specific header names can be mapped to a single canonical upstreamRequestId field.

extent analysis

TL;DR

Extract and propagate upstream LLM API response headers (request-id, x-request-id, trace-id) through diagnostic events, OTEL spans, and error logs to enable correlation with provider-side diagnostics.

Guidance

  • Modify provider transport streams (e.g., anthropic-transport-stream.ts, openai-transport-stream.ts) to extract request-id / x-request-id from SDK response headers and attach to output metadata.
  • Update FailoverError to include an optional upstreamRequestId field for timeout/failure logs.
  • Add an optional upstreamRequestId field to DiagnosticUsageEvent and pass the captured ID when emitting model.usage events in agent-runner.ts.
  • Include upstreamRequestId in lane task error logs when available, and record it as a span attribute on openclaw.model.usage in diagnostics-otel.

Example

// In anthropic-transport-stream.ts
const requestId = stream.response.headers['request-id'];
// Attach requestId to output metadata

// In FailoverError.ts
interface FailoverError {
  // ...
  upstreamRequestId?: string;
}

// In agent-runner.ts
const usageEvent = new DiagnosticUsageEvent({
  // ...
  upstreamRequestId: capturedRequestId,
});

Notes

This solution assumes that the request-id / x-request-id headers are accessible from the SDK response objects. The proposed changes are additive and backward-compatible with existing plugin diagnostic listeners.

Recommendation

Apply the proposed workaround to extract and propagate upstream LLM API response headers, as it provides a clear correlation key for troubleshooting intermittent timeouts or provider-side errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING