openclaw - 💡(How to fix) Fix Proposal: native /health/deep probe endpoint for external model-contract verification [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70191Fetched 2026-04-23 07:28:03
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Error Message

  • Error messages have absolute paths redacted before returning (avoids filesystem reconnaissance behind the auth gate)
  1. *Wired: false escape hatches. We exposed recentErrorCount + lastRoutingDecision before they're fully wired to inference-dispatch/error-emitter paths, specifically so downstream consumers could code against the final shape while internal wiring happens incrementally. Acceptable pattern, or would you prefer we hold the fields back until fully wired?
  2. Error tracking storage. The overlay uses a module-level let gatewayErrorTimestamps: number[] = [] (single-worker gateway only). Do you have multi-worker plans that would require SQLite/Redis/IPC from the start?

Root Cause

recentErrorTrackingWired / lastRoutingDecisionWired are escape hatches: the overlay exposes the field shape but returns 0 / null because it can't reach the inference-dispatch code path from an HTTP handler patch. Native upstream can flip these to true once the relevant emitters call recordGatewayError() / recordRoutingDecision().

Fix Action

Fix / Workaround

Filing as a design discussion ahead of a PR. We've been running a local overlay of this feature against openclaw for the past ~6 weeks (openclaw/openclaw isn't forked — we patch via Nix overlay in our infra repo) and would like to upstream it, but there are design calls where your preference matters before I cut the PR.

We run validate-flows.sh as an external probe against the gateway on canon. Its S21 probe ("model contract") verifies that the gateway's currently-active model matches what our config-patcher wrote. Without a cheap in-process introspection endpoint, we either:

recentErrorTrackingWired / lastRoutingDecisionWired are escape hatches: the overlay exposes the field shape but returns 0 / null because it can't reach the inference-dispatch code path from an HTTP handler patch. Native upstream can flip these to true once the relevant emitters call recordGatewayError() / recordRoutingDecision().

Code Example

{
  "status": "ok",              // "ok" | "degraded" | "fail"
  "modelId": "amazon-bedrock/us.anthropic.claude-sonnet-4-6",
  "provider": "amazon-bedrock",
  "fallbacks": ["amazon-bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"],
  "latencyMs": 2,
  "timestamp": "2026-04-22T05:03:30Z",
  "gatewayUptime": "11m27s",   // human-readable
  "gatewayUptimeSeconds": 687, // integer (machine-consumable)
  "recentErrorCount": 0,       // 1-hour rolling count, capped
  "recentErrorTrackingWired": false, // toggle true when integrated
  "lastRoutingDecision": null, // {modelId, provider, fallbackReason, timestamp}
  "lastRoutingDecisionWired": false
}
RAW_BUFFERClick to expand / collapse

Proposal: native /health/deep probe endpoint

Filing as a design discussion ahead of a PR. We've been running a local overlay of this feature against openclaw for the past ~6 weeks (openclaw/openclaw isn't forked — we patch via Nix overlay in our infra repo) and would like to upstream it, but there are design calls where your preference matters before I cut the PR.

Use case

We run validate-flows.sh as an external probe against the gateway on canon. Its S21 probe ("model contract") verifies that the gateway's currently-active model matches what our config-patcher wrote. Without a cheap in-process introspection endpoint, we either:

  1. Tail gateway logs for model-decision lines (fragile — depends on log verbosity, prone to missing entries during rotation)
  2. Invoke a real zero-temp inference and inspect the response (expensive, authenticates against upstream providers, drains budget, slow)
  3. Stop at "gateway is up" granularity (/health / /ready) — insufficient for detecting model-routing drift

A cheap JSON-returning probe at /health/deep that reports the currently-active model + provider + fallbacks resolves this. Auth-gated (same policy as /ready details — localhost direct OR bearer token). Does NOT invoke inference.

What the overlay currently returns

{
  "status": "ok",              // "ok" | "degraded" | "fail"
  "modelId": "amazon-bedrock/us.anthropic.claude-sonnet-4-6",
  "provider": "amazon-bedrock",
  "fallbacks": ["amazon-bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"],
  "latencyMs": 2,
  "timestamp": "2026-04-22T05:03:30Z",
  "gatewayUptime": "11m27s",   // human-readable
  "gatewayUptimeSeconds": 687, // integer (machine-consumable)
  "recentErrorCount": 0,       // 1-hour rolling count, capped
  "recentErrorTrackingWired": false, // toggle true when integrated
  "lastRoutingDecision": null, // {modelId, provider, fallbackReason, timestamp}
  "lastRoutingDecisionWired": false
}

status semantics:

  • "ok" — config loads with a resolvable default model
  • "degraded" — resolved provider isn't in models.providers (fallback will be exercised on first inference)
  • "fail" — config unreadable OR default model missing entirely

recentErrorTrackingWired / lastRoutingDecisionWired are escape hatches: the overlay exposes the field shape but returns 0 / null because it can't reach the inference-dispatch code path from an HTTP handler patch. Native upstream can flip these to true once the relevant emitters call recordGatewayError() / recordRoutingDecision().

Auth model

Same as /ready details today — canRevealReadinessDetails():

  • localhost direct (127.0.0.1 / ::1) bypasses auth
  • Remote callers require bearer token
  • 401 on unauth returns no body (avoids leaking model IDs to scans)
  • All expensive work runs AFTER the 401-return (no timing oracle)
  • Error messages have absolute paths redacted before returning (avoids filesystem reconnaissance behind the auth gate)

Prior art (our overlay)

~260-line Nix overlay against src/gateway/server-http.ts + src/gateway/control-ui-routing.ts: nix/patches/health-deep-endpoint.patch

Has been running continuously on our canon instance since Wave 5 I-track shipped. Tracking issue (our side): markthebest12/openclaw-infra#803, #1026, #1065.

Design questions before PR

Before I adapt the overlay to upstream code style and open a PR, a few calls where upstream preference dominates:

  1. Path. We picked /health/deep. Alternatives: extend /ready (nested object under details) ; new /observability/* tree; /diag/model. Preference?

  2. Auth model. We reuse canRevealReadinessDetails(). OK to keep, or would you want a separate probe-auth policy (e.g., bearer-only, no localhost bypass)?

  3. Field naming. Dual-encoding gatewayUptime (string) + gatewayUptimeSeconds (integer) is a concession — we originally wanted one field. Machine consumers need integer; we couldn't break the human-readable one. Comfortable with that, or prefer single integer + format-on-consume?

  4. *Wired: false escape hatches. We exposed recentErrorCount + lastRoutingDecision before they're fully wired to inference-dispatch/error-emitter paths, specifically so downstream consumers could code against the final shape while internal wiring happens incrementally. Acceptable pattern, or would you prefer we hold the fields back until fully wired?

  5. Error tracking storage. The overlay uses a module-level let gatewayErrorTimestamps: number[] = [] (single-worker gateway only). Do you have multi-worker plans that would require SQLite/Redis/IPC from the start?

  6. Gateway-start structured log. Related deferral — we wanted to emit a structured gateway.started log entry on gateway init (modelId, provider, authMode) for correlation with probe results. Couldn't reach init path from the overlay. Would you take that as part of the same PR or prefer separate?

Happy to cut a PR straight against your preferences if any of these are already-decided. Otherwise, let me know which direction to go on each and I'll adapt.

cc: field shape suggestions, escape-hatch approach, auth-gate reuse — anything worth pushing back on before code lands is cheap to change now.

extent analysis

TL;DR

Implement a native /health/deep probe endpoint with the proposed design, addressing the outlined design questions and preferences.

Guidance

  1. Path selection: Consider extending /ready with a nested object under details for consistency, or introduce a new /observability/* tree for better organization.
  2. Auth model review: Evaluate the reuse of canRevealReadinessDetails() for auth, ensuring it aligns with the project's security requirements and consider a separate probe-auth policy if needed.
  3. Field naming and encoding: Decide on a consistent approach for gatewayUptime, either using a single integer field with format-on-consume or maintaining dual encoding for human-readable and machine-consumable formats.
  4. Escape hatches and field exposure: Determine the best approach for exposing fields like recentErrorCount and lastRoutingDecision before they are fully wired, considering the trade-offs between early adoption and completeness.
  5. Error tracking storage and gateway-start structured log: Plan for potential multi-worker scenarios and decide on the storage approach for error tracking, as well as whether to include a structured gateway.started log entry in the same PR.

Example

No specific code example is provided due to the design discussion nature of the issue, but the proposed JSON return format for the /health/deep endpoint is:

{
  "status": "ok",
  "modelId": "amazon-bedrock/us.anthropic.claude-sonnet-4-6",
  "provider": "amazon-bedrock",
  "fallbacks": ["amazon-bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0"],
  "latencyMs": 2,
  "timestamp": "2026-04-22T05:03:30Z",
  "gatewayUptime": "11m27s",
  "gatewayUptimeSeconds": 687,
  "recentErrorCount": 0,
  "recentErrorTrackingWired": false,
  "lastRoutingDecision": null,
  "lastRoutingDecisionWired": false
}

Notes

The implementation should consider the project's existing architecture, security requirements, and potential future developments, such as multi-worker support and expanded observability features.

Recommendation

Apply the proposed design for the /health/deep endpoint, addressing the outlined design questions and preferences to ensure a consistent and secure implementation that aligns with the project's goals.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING