openclaw - 💡(How to fix) Fix Feature: Per-API-call latency metrics in diagnostics.otel [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#59930Fetched 2026-04-08 02:38:42
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1locked ×1

The built-in diagnostics.otel module tracks openclaw.run.duration_ms which measures the entire agent turn (potentially 10+ tool loop iterations). There is no per-API-call latency metric, making it impossible to distinguish between slow LLM responses vs slow tool execution vs framework overhead.

Root Cause

When investigating slow oncall agent responses, we found:

  • Some sessions had 10-14s per tool roundtrip (vs normal ~2s)
  • Root cause analysis required manually parsing session JSONL timestamps
  • Gateway logs showed Slack WebSocket instability during the same period, but we couldn't determine how much delay was from Slack WS vs Claude API response time
  • The existing openclaw.run.duration_ms histogram gave us only one data point (342s total), not the per-call breakdown

Code Example

// Line 1467
const promptStartedAt = Date.now();

// Line 1670  
log.debug(`embedded run prompt end: runId=... durationMs=${Date.now() - promptStartedAt}`);
RAW_BUFFERClick to expand / collapse

Summary

The built-in diagnostics.otel module tracks openclaw.run.duration_ms which measures the entire agent turn (potentially 10+ tool loop iterations). There is no per-API-call latency metric, making it impossible to distinguish between slow LLM responses vs slow tool execution vs framework overhead.

Problem

When investigating slow oncall agent responses, we found:

  • Some sessions had 10-14s per tool roundtrip (vs normal ~2s)
  • Root cause analysis required manually parsing session JSONL timestamps
  • Gateway logs showed Slack WebSocket instability during the same period, but we couldn't determine how much delay was from Slack WS vs Claude API response time
  • The existing openclaw.run.duration_ms histogram gave us only one data point (342s total), not the per-call breakdown

Current State

The timing code already exists in src/agents/pi-embedded-runner/run/attempt.ts:

// Line 1467
const promptStartedAt = Date.now();

// Line 1670  
log.debug(`embedded run prompt end: runId=... durationMs=${Date.now() - promptStartedAt}`);

But:

  1. It only writes to log.debug() — filtered out at default log level
  2. It is not emitted as a diagnostic event → diagnostics-otel cannot export it
  3. The agent_end hook receives durationMs but that's per-attempt, not surfaced as a metric

Requested Metrics

1. Per-API-call latency (openclaw.prompt.duration_ms)

  • Type: Histogram
  • Attributes: openclaw.channel, openclaw.provider, openclaw.model, openclaw.sessionKey
  • What it measures: Time from prompt assembly start to LLM response complete (the existing promptStartedAtDate.now() in attempt.ts)
  • Emit point: attempt.ts:1668 (the finally block after prompt execution)

2. Slack WebSocket health

  • openclaw.slack.ws_pong_timeout_total (Counter) — pong wasn't received within timeout
  • openclaw.slack.ws_reconnect_total (Counter) — WebSocket reconnection attempts
  • Emit point: extensions/slack/src/monitor/provider.ts where reconnectAttempts is tracked

3. Handle tool.loop diagnostic event

  • tool.loop is already defined in src/infra/diagnostic-events.ts (lines 137-148) but the OTel service in extensions/diagnostics-otel/src/service.ts does not handle it (silently dropped)
  • This would give per-tool-loop iteration metrics

Suggested Implementation

Minimal change (~20 lines):

  1. In attempt.ts:1668, add emitDiagnosticEvent({ type: "prompt.completed", durationMs, provider, model, sessionKey })
  2. In diagnostics-otel/src/service.ts, handle "prompt.completed" → record to a new histogram
  3. In extensions/slack/src/monitor/provider.ts, emit diagnostic events for pong timeout and reconnect

Environment

  • OpenClaw 2026.4.1 (da64a97)
  • Provider: Anthropic Claude Opus 4.6
  • Channel: Slack (Socket Mode)

extent analysis

TL;DR

To address the issue of slow oncall agent responses, implement per-API-call latency metrics and Slack WebSocket health metrics by adding diagnostic events and handling them in the OTel service.

Guidance

  • Add a diagnostic event for prompt completion in attempt.ts to measure per-API-call latency, including attributes such as openclaw.channel, openclaw.provider, openclaw.model, and openclaw.sessionKey.
  • Handle the new "prompt.completed" event in diagnostics-otel/src/service.ts to record a histogram for per-API-call latency.
  • Emit diagnostic events for Slack WebSocket pong timeout and reconnect attempts in extensions/slack/src/monitor/provider.ts to track WebSocket health.
  • Update the OTel service to handle the tool.loop diagnostic event to provide per-tool-loop iteration metrics.

Example

// In attempt.ts:1668
emitDiagnosticEvent({
  type: "prompt.completed",
  durationMs: Date.now() - promptStartedAt,
  provider: provider,
  model: model,
  sessionKey: sessionKey
});

Notes

The suggested implementation requires minimal changes (~20 lines) and focuses on adding diagnostic events and handling them in the OTel service to provide the required metrics.

Recommendation

Apply the suggested implementation to add per-API-call latency metrics and Slack WebSocket health metrics, which will help in identifying and addressing the root cause of slow oncall agent responses.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING