openclaw - 💡(How to fix) Fix [Bug]: diagnostics-otel spans don't share evt.trace.traceId — orphan root traces, broken Cloud Logging trace links [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75174Fetched 2026-05-01 05:37:20
View on GitHub
Comments
3
Participants
2
Timeline
4
Reactions
2
Author
Timeline (top)
commented ×3closed ×1

The bundled diagnostics-otel plugin creates OTel spans via tracer.startSpan(name, opts, undefined), falling back to context.active() for parent context. For non-HTTP-driven runs, and for any event dispatched through the async setImmediate drain in diagnostic-events.ts, context.active() is empty by the time the listener fires — so each span becomes a fresh root with its own trace_id, disconnected from evt.trace.traceId (which the same plugin correctly uses for log records). One logical run produces N orphan one-span traces instead of one trace with N child spans, and Cloud Logging entries that reference evt.trace.traceId resolve to 404 in Cloud Trace.

Error Message

"tool.execution.error", "model.call.error",

Root Cause

The bundled diagnostics-otel plugin creates OTel spans via tracer.startSpan(name, opts, undefined), falling back to context.active() for parent context. For non-HTTP-driven runs, and for any event dispatched through the async setImmediate drain in diagnostic-events.ts, context.active() is empty by the time the listener fires — so each span becomes a fresh root with its own trace_id, disconnected from evt.trace.traceId (which the same plugin correctly uses for log records). One logical run produces N orphan one-span traces instead of one trace with N child spans, and Cloud Logging entries that reference evt.trace.traceId resolve to 404 in Cloud Trace.

Fix Action

Fix / Workaround

The bundled diagnostics-otel plugin creates OTel spans via tracer.startSpan(name, opts, undefined), falling back to context.active() for parent context. For non-HTTP-driven runs, and for any event dispatched through the async setImmediate drain in diagnostic-events.ts, context.active() is empty by the time the listener fires — so each span becomes a fresh root with its own trace_id, disconnected from evt.trace.traceId (which the same plugin correctly uses for log records). One logical run produces N orphan one-span traces instead of one trace with N child spans, and Cloud Logging entries that reference evt.trace.traceId resolve to 404 in Cloud Trace.

Workaround currently deployed in our fleet: a separate plugin patches globalThis.__openclawDiagnosticEventsState.listeners to wrap each listener invocation in context.with(reconstructedCtx, () => listener(evt)), where reconstructedCtx is built from evt.trace. This makes context.active() carry the run's correlation context for the duration of the listener call, so the bundled tracer.startSpan fallback resolves correctly. Implementation: https://github.com/SnappStats/openclaw-agents/pull/213. We'd much rather drop the workaround once this is fixed upstream.

Code Example

const ASYNC_DIAGNOSTIC_EVENT_TYPES = new Set([
  "tool.execution.started",
  "tool.execution.completed",
  "tool.execution.error",
  "exec.process.completed",
  "model.call.started",
  "model.call.completed",
  "model.call.error",
  "log.record"
]);

---

const recordRunCompleted = (evt) => {
  // ...
  const span = spanWithDuration("openclaw.run", spanAttrs, evt.durationMs, { endTimeMs: evt.ts });
  // ...
};
const spanWithDuration = (name, attributes, durationMs, options = {}) => {
  // ...
  const parentContext = "parentContext" in options ? options.parentContext ?? void 0 : void 0;
  return tracer.startSpan(name, { /* ... */ }, parentContext);
};

---

addTraceAttributes(attributes, evt.trace);
const logContext = contextForTraceContext(evt.trace);
if (logContext) logRecord.context = logContext;
otelLogger.emit(logRecord);

---

const recordRunCompleted = (evt) => {
  // ...
  const parentContext = contextForTraceContext(evt.trace);
  const span = spanWithDuration("openclaw.run", spanAttrs, evt.durationMs, {
    endTimeMs: evt.ts,
    parentContext,  // <-- new
  });
  // ...
};
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The bundled diagnostics-otel plugin creates OTel spans via tracer.startSpan(name, opts, undefined), falling back to context.active() for parent context. For non-HTTP-driven runs, and for any event dispatched through the async setImmediate drain in diagnostic-events.ts, context.active() is empty by the time the listener fires — so each span becomes a fresh root with its own trace_id, disconnected from evt.trace.traceId (which the same plugin correctly uses for log records). One logical run produces N orphan one-span traces instead of one trace with N child spans, and Cloud Logging entries that reference evt.trace.traceId resolve to 404 in Cloud Trace.

Steps to reproduce

  1. Run OpenClaw 2026.4.24 with diagnostics.otel.enabled: true, traces: true, exporting to any OTLP-compatible collector that forwards to a backend with trace lookup (we use OTel Collector → GCP Cloud Trace).
  2. Trigger a run through any non-HTTP entrypoint (e.g. cron-fired Read HEARTBEAT.md self-prompt). An inbound HTTP request without a traceparent header reproduces the same behavior.
  3. Capture a run.started or run.completed log record emitted via recordLogRecord — note trace.traceId field.
  4. Look up that trace_id in the trace backend.

Expected behavior

The trace backend returns a single trace with openclaw.run plus any openclaw.model.call / openclaw.tool.execution children, all sharing evt.trace.traceId. This matches what recordLogRecord does for the same run via contextForTraceContext(evt.trace) — log records are correctly trace-correlated, so spans should be too.

Actual behavior

Trace backend returns 404 for evt.trace.traceId. The corresponding spans exist, but each lives in its own one-span trace under a fresh trace_id. Concrete example from a prod run on 2026-04-30T05:16Z:

  • Logged evt.trace.traceId: 72e161a7da11405afe23fe2cf4bcf16dCloud Trace 404
  • Five separate orphan traces from the same run, each one span:
    • ed49a96b4f235cfbd5cfccefde6a4489openclaw.run
    • d08ba1e878b25132f5e82ebb3e299b66openclaw.model.call
    • cae39655833069db39241ee1749d339dopenclaw.tool.execution
    • f1119dadeb2d687c946c671cc227c37aopenclaw.model.call
    • c48bb7cf625ff34f1b30017b06e5e8ffopenclaw.model.usage

24-hour fleet sample: 1391 run.started log records vs. 11 openclaw.run spans in Cloud Trace (the 11 are from inbound HTTP requests where an upstream traceparent happened to set context.active() before the bus emit).

OpenClaw version

2026.4.24

Operating system

Linux (Debian 12, Docker container on GCE)

Install method

ghcr.io/openclaw/openclaw:2026.4.24 base image

Additional provider/model setup details

Trace pipeline: OpenClaw diagnostics-otel → OTel Collector (host) at host.docker.internal:4318 (http/protobuf) → googlecloud exporter → Cloud Trace. Pipeline verified healthy with synthetic OTLP traces (round-trip <1 minute). Issue is at span creation, not export.

Logs, screenshots, and evidence

Source-level analysis from the bundled output of 2026.4.24:

1) diagnostic-events-Cz86_awm.js queues most event types via setImmediate:

const ASYNC_DIAGNOSTIC_EVENT_TYPES = new Set([
  "tool.execution.started",
  "tool.execution.completed",
  "tool.execution.error",
  "exec.process.completed",
  "model.call.started",
  "model.call.completed",
  "model.call.error",
  "log.record"
]);

scheduleAsyncDiagnosticDrain registers one setImmediate(...) per drain cycle; by the time the drain runs, the originating call stack's AsyncLocalStorage frame is gone.

2) extensions/diagnostics-otel/index.jsrecordRunCompleted and friends:

const recordRunCompleted = (evt) => {
  // ...
  const span = spanWithDuration("openclaw.run", spanAttrs, evt.durationMs, { endTimeMs: evt.ts });
  // ...
};
const spanWithDuration = (name, attributes, durationMs, options = {}) => {
  // ...
  const parentContext = "parentContext" in options ? options.parentContext ?? void 0 : void 0;
  return tracer.startSpan(name, { /* ... */ }, parentContext);
};

parentContext is always undefined — the bus event's evt.trace is never passed in. tracer.startSpan(name, opts, undefined) falls back to context.active().

3) Same plugin, recordLogRecord — does the right thing for log records:

addTraceAttributes(attributes, evt.trace);
const logContext = contextForTraceContext(evt.trace);
if (logContext) logRecord.context = logContext;
otelLogger.emit(logRecord);

This asymmetry (logs use evt.trace, spans don't) is the bug.

Impact and severity

  • Affected: any deployment using diagnostics.otel for trace export with workflows that fire runs outside an inbound HTTP request frame (cron, scheduler, programmatic invocation), or inbound HTTP requests without an upstream traceparent. Also any deployment that relies on Cloud Logging↔Cloud Trace correlation via the evt.trace.traceId field.
  • Severity: medium. Spans are still emitted and Cloud Trace receives them, but trace topology is fragmented (one trace per span instead of one per run) and log↔trace correlation is broken — defeats the main observability use case.
  • Frequency: always for affected entrypoints. We see ~99% of runs fragmented in our 24h fleet sample (1391 logged runs, 11 correlated traces).
  • Consequence: every log entry containing trace_id field is a broken link to Cloud Trace; runs are not findable as a unit; child spans cannot be related to their parent run.

Additional information

Suggested fix (one line): mirror what recordLogRecord does — pass contextForTraceContext(evt.trace) as parentContext to spanWithDuration from each record* function:

const recordRunCompleted = (evt) => {
  // ...
  const parentContext = contextForTraceContext(evt.trace);
  const span = spanWithDuration("openclaw.run", spanAttrs, evt.durationMs, {
    endTimeMs: evt.ts,
    parentContext,  // <-- new
  });
  // ...
};

spanWithDuration already supports a parentContext option but no caller currently sets it.

Workaround currently deployed in our fleet: a separate plugin patches globalThis.__openclawDiagnosticEventsState.listeners to wrap each listener invocation in context.with(reconstructedCtx, () => listener(evt)), where reconstructedCtx is built from evt.trace. This makes context.active() carry the run's correlation context for the duration of the listener call, so the bundled tracer.startSpan fallback resolves correctly. Implementation: https://github.com/SnappStats/openclaw-agents/pull/213. We'd much rather drop the workaround once this is fixed upstream.

extent analysis

TL;DR

Passing contextForTraceContext(evt.trace) as parentContext to spanWithDuration should fix the issue with fragmented traces in Cloud Trace.

Guidance

  • Identify all record* functions in the diagnostics-otel plugin that create spans and modify them to pass contextForTraceContext(evt.trace) as parentContext to spanWithDuration.
  • Verify that the spanWithDuration function supports a parentContext option and update its callers accordingly.
  • Test the changes with non-HTTP entrypoints and inbound HTTP requests without an upstream traceparent to ensure that traces are no longer fragmented.
  • Monitor Cloud Trace to confirm that log entries containing trace_id field are now correctly correlated with their corresponding traces.

Example

const recordRunCompleted = (evt) => {
  // ...
  const parentContext = contextForTraceContext(evt.trace);
  const span = spanWithDuration("openclaw.run", spanAttrs, evt.durationMs, {
    endTimeMs: evt.ts,
    parentContext,  
  });
  // ...
};

Notes

The suggested fix is based on the provided code snippets and may require additional modifications to ensure correct functionality. It is recommended to thoroughly test the changes before deploying them to production.

Recommendation

Apply the suggested fix by passing contextForTraceContext(evt.trace) as parentContext to spanWithDuration to resolve the issue with fragmented traces in Cloud Trace. This fix should provide correct trace correlation and resolve the broken links between log entries and Cloud Trace.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The trace backend returns a single trace with openclaw.run plus any openclaw.model.call / openclaw.tool.execution children, all sharing evt.trace.traceId. This matches what recordLogRecord does for the same run via contextForTraceContext(evt.trace) — log records are correctly trace-correlated, so spans should be too.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: diagnostics-otel spans don't share evt.trace.traceId — orphan root traces, broken Cloud Logging trace links [1 pull requests, 3 comments, 2 participants]