openclaw - 💡(How to fix) Fix Feature: Per-API-call latency metrics in diagnostics.otel [1 comments, 1 participants]

zzc2019 · 2026-04-02T22:08:54Z

[openclaw] The built-in diagnostics.otel module tracks openclaw.run.duration ms which measures the entire agent turn potentially 10+ tool loop iterations . The… The built-in `diagnostics.otel` module tracks `openclaw.run.duration_ms` which measures the **entire agent turn** (potentially 10+ tool loop iterations). There is no per-API-call latency metric, making it impossible to distinguish between slow LLM responses vs slow tool execution vs framework overhead. ## Summary The built-in `diagnostics.otel` module tracks `openclaw.run.duration_ms` which measures the **entire agent turn** (potentially 10+ tool loop iterations). There is no per-API-call latency metric, making it impossible to distinguish between slow LLM responses vs slow tool execution vs framework overhead. ## Problem When investigating slow oncall agent responses, we found: - Some sessions had **10-14s per tool roundtrip** (vs normal ~2s) - Root cause analysis required manually parsing session JSONL timestamps - Gateway logs showed Slack WebSocket instability during the same period, but we couldn't determine how much delay was from Slack WS vs Claude API response time - The existing `openclaw.run.duration_ms` histogram gave us only one data point (342s total), not the per-call breakdown ## Current State The **timing code already exists** in `src/agents/pi-embedded-runner/run/attempt.ts`: ```typescript // Line 1467 const promptStartedAt = Date.now(); // Line 1670 log.debug(`embedded run prompt end: runId=... durationMs=${Date.now() - promptStartedAt}`); ``` But: 1. It only writes to `log.debug()` — filtered out at default log level 2. It is **not** emitted as a diagnostic event → `diagnostics-otel` cannot export it 3. The `agent_end` hook receives `durationMs` but that's per-attempt, not surfaced as a metric ## Requested Metrics ### 1. Per-API-call latency (`openclaw.prompt.duration_ms`) - **Type**: Histogram - **Attributes**: `openclaw.channel`, `openclaw.provider`, `openclaw.model`, `openclaw.sessionKey` - **What it measures**: Time from prompt assembly start to LLM response complete (the existing `promptStartedAt` → `Date.now()` in `attempt.ts`) - **Emit point**: `attempt.ts:1668` (the `finally` block after prompt execution) ### 2. Slack WebSocket health - **`openclaw.slack.ws_pong_timeout_total`** (Counter) — pong wasn't received within timeout - **`openclaw.slack.ws_reconnect_total`** (Counter) — WebSocket reconnection attempts - **Emit point**: `extensions/slack/src/monitor/provider.ts` where `reconnectAttempts` is tracked ### 3. Handle `tool.loop` diagnostic event - `tool.loop` is already defined in `src/infra/diagnostic-events.ts` (lines 137-148) but the OTel service in `extensions/diagnostics-otel/src/service.ts` does not handle it (silently dropped) - This would give per-tool-loop iteration metrics ## Suggested Implementation Minimal change (~20 lines): 1. In `attempt.ts:1668`, add `emitDiagnosticEvent({ type: "prompt.completed", durationMs, provider, model, sessionKey })` 2. In `diagnostics-otel/src/service.ts`, handle `"prompt.completed"` → record to a new histogram 3. In `extensions/slack/src/monitor/provider.ts`, emit diagnostic events for pong timeout and reconnect ## Environment - OpenClaw 2026.4.1 (da64a97) - Provider: Anthropic Claude Opus 4.6 - Channel: Slack (Socket Mode)

openclaw2026-04-02 22:08:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#59930•Fetched 2026-04-08 02:38:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zzc2019

Participants

zzc2019

Timeline (top)

closed ×1commented ×1locked ×1

The built-in diagnostics.otel module tracks openclaw.run.duration_ms which measures the entire agent turn (potentially 10+ tool loop iterations). There is no per-API-call latency metric, making it impossible to distinguish between slow LLM responses vs slow tool execution vs framework overhead.

Root Cause

When investigating slow oncall agent responses, we found:

Some sessions had 10-14s per tool roundtrip (vs normal ~2s)
Root cause analysis required manually parsing session JSONL timestamps
Gateway logs showed Slack WebSocket instability during the same period, but we couldn't determine how much delay was from Slack WS vs Claude API response time
The existing openclaw.run.duration_ms histogram gave us only one data point (342s total), not the per-call breakdown

Code Example

// Line 1467
const promptStartedAt = Date.now();

// Line 1670  
log.debug(`embedded run prompt end: runId=... durationMs=${Date.now() - promptStartedAt}`);

RAW_BUFFERClick to expand / collapse

Summary

Problem

When investigating slow oncall agent responses, we found:

Some sessions had 10-14s per tool roundtrip (vs normal ~2s)
Root cause analysis required manually parsing session JSONL timestamps
Gateway logs showed Slack WebSocket instability during the same period, but we couldn't determine how much delay was from Slack WS vs Claude API response time
The existing openclaw.run.duration_ms histogram gave us only one data point (342s total), not the per-call breakdown

Current State

The timing code already exists in src/agents/pi-embedded-runner/run/attempt.ts:

// Line 1467
const promptStartedAt = Date.now();

// Line 1670  
log.debug(`embedded run prompt end: runId=... durationMs=${Date.now() - promptStartedAt}`);

But:

It only writes to log.debug() — filtered out at default log level
It is not emitted as a diagnostic event → diagnostics-otel cannot export it
The agent_end hook receives durationMs but that's per-attempt, not surfaced as a metric

Requested Metrics

1. Per-API-call latency (`openclaw.prompt.duration_ms`)

Type: Histogram
Attributes: openclaw.channel, openclaw.provider, openclaw.model, openclaw.sessionKey
What it measures: Time from prompt assembly start to LLM response complete (the existing promptStartedAt → Date.now() in attempt.ts)
Emit point: attempt.ts:1668 (the finally block after prompt execution)

2. Slack WebSocket health

openclaw.slack.ws_pong_timeout_total (Counter) — pong wasn't received within timeout
openclaw.slack.ws_reconnect_total (Counter) — WebSocket reconnection attempts
Emit point: extensions/slack/src/monitor/provider.ts where reconnectAttempts is tracked

3. Handle `tool.loop` diagnostic event

tool.loop is already defined in src/infra/diagnostic-events.ts (lines 137-148) but the OTel service in extensions/diagnostics-otel/src/service.ts does not handle it (silently dropped)
This would give per-tool-loop iteration metrics

Suggested Implementation

Minimal change (~20 lines):

In attempt.ts:1668, add emitDiagnosticEvent({ type: "prompt.completed", durationMs, provider, model, sessionKey })
In diagnostics-otel/src/service.ts, handle "prompt.completed" → record to a new histogram
In extensions/slack/src/monitor/provider.ts, emit diagnostic events for pong timeout and reconnect

Environment

OpenClaw 2026.4.1 (da64a97)
Provider: Anthropic Claude Opus 4.6
Channel: Slack (Socket Mode)

extent analysis

TL;DR

To address the issue of slow oncall agent responses, implement per-API-call latency metrics and Slack WebSocket health metrics by adding diagnostic events and handling them in the OTel service.

Guidance

Add a diagnostic event for prompt completion in attempt.ts to measure per-API-call latency, including attributes such as openclaw.channel, openclaw.provider, openclaw.model, and openclaw.sessionKey.
Handle the new "prompt.completed" event in diagnostics-otel/src/service.ts to record a histogram for per-API-call latency.
Emit diagnostic events for Slack WebSocket pong timeout and reconnect attempts in extensions/slack/src/monitor/provider.ts to track WebSocket health.
Update the OTel service to handle the tool.loop diagnostic event to provide per-tool-loop iteration metrics.

Example

// In attempt.ts:1668
emitDiagnosticEvent({
  type: "prompt.completed",
  durationMs: Date.now() - promptStartedAt,
  provider: provider,
  model: model,
  sessionKey: sessionKey
});

Notes

The suggested implementation requires minimal changes (~20 lines) and focuses on adding diagnostic events and handling them in the OTel service to provide the required metrics.

Recommendation

Apply the suggested implementation to add per-API-call latency metrics and Slack WebSocket health metrics, which will help in identifying and addressing the root cause of slow oncall agent responses.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #LLM response #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature: Per-API-call latency metrics in diagnostics.otel [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Current State

Requested Metrics

1. Per-API-call latency (`openclaw.prompt.duration_ms`)

2. Slack WebSocket health

3. Handle `tool.loop` diagnostic event

Suggested Implementation

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature: Per-API-call latency metrics in diagnostics.otel [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Current State

Requested Metrics

1. Per-API-call latency (openclaw.prompt.duration_ms)

2. Slack WebSocket health

3. Handle tool.loop diagnostic event

Suggested Implementation

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Per-API-call latency (`openclaw.prompt.duration_ms`)

3. Handle `tool.loop` diagnostic event