openclaw - 💡(How to fix) Fix Pulse Health Dashboard (OpenClaw #50371) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52693Fetched 2026-04-08 01:20:15
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

  • Metrics: memory usage, tool latency, error counts, restart frequency
  • Error rate per tool
  • Error rate >5% (last 5min) → critical

Code Example

{
  "metrics_interval_sec": 10,
  "trace_sampling_ratio": 0.1,
  "alerts": {
    "memory_pct": 80,
    "error_rate_pct": 5,
    "heartbeat_interval_sec": 120
  }
}
RAW_BUFFERClick to expand / collapse

Pulse Health Dashboard (OpenClaw #50371)

Problem

No centralized observability for OpenClaw agent health. Issues detected only after failures (e.g., context overflow, tool crashes). Need proactive monitoring.

Proposed Solution

Integrate OpenTelemetry auto-instrumentation + simple HTTP dashboard (Pulse).

Architecture

OpenTelemetry Auto-Instrumentation:

  • Use opentelemetry-instrument wrapper for Python agent harness
  • Export to OTLP collector (in-process or separate)
  • Metrics: memory usage, tool latency, error counts, restart frequency
  • Traces: tool call lifecycle (including sub-agents)
  • Logs: structured JSON with context

Pulse Dashboard:

  • Flask/Express app serving /metrics (Prometheus format)
  • /health endpoint with liveness/readiness probes
  • Simple UI (React or Svelte) showing:
    • Current memory usage (MB) and token count
    • Tool latency histogram (last 1h)
    • Error rate per tool
    • Restart count trend

Alerting (AlertPipe):

  • Thresholds:
    • Memory > 80% → warning
    • Error rate >5% (last 5min) → critical
    • No heartbeats for 2min → down
  • Send alerts to Discord via message tool

Implementation Steps

  1. Add opentelemetry-api, opentelemetry-sdk, opentelemetry-instrumentation-requests to pyproject.toml
  2. Create src/observability/otel_setup.py to initialize tracer/metrics
  3. Wrap agent entry: opentelemetry-instrument python -m openclaw ...
  4. Create /pulse Flask app with /metrics endpoint
  5. Deploy Pulse alongside gateway (systemd service openclaw-pulse)
  6. Add AlertPipe rule engine (blocked on #50365)

Configuration

~/.openclaw/observability.json:

{
  "metrics_interval_sec": 10,
  "trace_sampling_ratio": 0.1,
  "alerts": {
    "memory_pct": 80,
    "error_rate_pct": 5,
    "heartbeat_interval_sec": 120
  }
}

Alternatives Considered

  • Manual logs only: Too reactive; no real-time visibility
  • Full Prometheus stack: Overkill; Pulse aims for simplicity

References

Related Issues

  • Blocks: OpenClaw #50372 (AlertPipe)
  • Depends on: None — can be implemented independently

extent analysis

Fix Plan

To integrate OpenTelemetry auto-instrumentation and a simple HTTP dashboard (Pulse) for proactive monitoring of OpenClaw agent health, follow these steps:

  • Step 1: Add dependencies
    • Add opentelemetry-api, opentelemetry-sdk, and opentelemetry-instrumentation-requests to pyproject.toml:
    [tool.poetry.dependencies]
    opentelemetry-api = "^1.13.0"
    opentelemetry-sdk = "^1.13.0"
    opentelemetry-instrumentation-requests = "^0.23.0"
  • Step 2: Initialize tracer and metrics
    • Create src/observability/otel_setup.py to initialize the tracer and metrics:
    from opentelemetry import trace
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import (
        ConsoleSpanExporter,
        SimpleSpanProcessor,
    )
    
    provider = TracerProvider()
    processor = SimpleSpanProcessor(ConsoleSpanExporter())
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)
  • Step 3: Wrap agent entry
    • Wrap the agent entry with opentelemetry-instrument:
    opentelemetry-instrument python -m openclaw ...
  • Step 4: Create Pulse app
    • Create a Flask app for the Pulse dashboard with a /metrics endpoint:
    from flask import Flask, jsonify
    from prometheus_client import generate_latest
    
    app = Flask(__name__)
    
    @app.route("/metrics")
    def metrics():
        return generate_latest()
  • Step 5: Deploy Pulse
    • Deploy the Pulse app alongside the gateway as a systemd service openclaw-pulse.

Verification

To verify that the fix worked, check the following:

  • The Pulse dashboard is accessible and displays the expected metrics (memory usage, tool latency, error counts, restart frequency).
  • The /metrics endpoint returns Prometheus-formatted metrics.
  • The /health endpoint returns a successful response.

Extra Tips

  • Make sure to configure the observability.json file with the correct settings for metrics interval, trace sampling ratio, and alerts.
  • Use a tool like curl to test the /metrics and /health endpoints.
  • Consider adding additional metrics and alerts to the Pulse dashboard as needed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Pulse Health Dashboard (OpenClaw #50371) [1 participants]