claude-code - 💡(How to fix) Fix Feature Request: Inference-health signal for power-user workflows (extends #42796) [1 participants]

claude-code2026-04-14 18:04:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#48042•Fetched 2026-04-15 06:34:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

gfrom085

Participants

gfrom085

Timeline (top)

labeled ×2

Request for an observability signal (endpoint or metric class) that allows power-user workflows to distinguish platform-side inference variance from prompt/workflow-side issues without burning quota on deterministic canary testing.

Extends the technical exchange in #42796 (stellaraccident, closed). The resolution there addressed UI-layer product changes (redact-thinking-2026-02-12 header, adaptive thinking default, effort=85 default) and their documented opt-outs. This issue surfaces a substrate-layer observability need from a distinct user segment — one that persists independent of, and orthogonal to, those product-layer resolutions.

Root Cause

That resolution addresses product-layer observability gaps and user-side tuning. This issue is orthogonal: it surfaces a substrate-layer observability need — a signal that lets power-user workflows verify inference stability without relying on output quality as proxy. The ask holds regardless of whether any degradation claim is true, because the workaround cost (deterministic canaries burning quota) exists either way.

Fix Action

Fix / Workaround

Platform-side variance (effort defaults, capacity management, KV-cache quantization tier, routing changes — whether publicly acknowledged or opaque) produces disproportionate impact relative to standard-usage workflows.
Without a signal, the user cannot distinguish between:
- A prompt/workflow design issue on their end
- Platform-side variance of any cause
- Normal stochasticity
The current workaround — running deterministic canaries (modular exponentiation with baseline ranges, anchor-recall tests, semantic discrimination checks) as preflight diagnostics before work sessions — consumes non-trivial quota from the workloads the harness was built to do.

RAW_BUFFERClick to expand / collapse

Summary

Problem

Certain Claude Code workflows operate intentionally close to the model's effective attention budget: high-context orchestration, multi-hop reasoning chains, implicit-context-heavy prompts, custom harnesses that saturate thinking capacity by design.

For these workflows:

Platform-side variance (effort defaults, capacity management, KV-cache quantization tier, routing changes — whether publicly acknowledged or opaque) produces disproportionate impact relative to standard-usage workflows.
Without a signal, the user cannot distinguish between:
- A prompt/workflow design issue on their end
- Platform-side variance of any cause
- Normal stochasticity
The current workaround — running deterministic canaries (modular exponentiation with baseline ranges, anchor-recall tests, semantic discrimination checks) as preflight diagnostics before work sessions — consumes non-trivial quota from the workloads the harness was built to do.

The existing user-side toggles (/effort high/max, CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING, CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000, showThinkingSummaries: true) address tuning — they let the user configure their input. They do not address observability of platform-side state, which is what determines whether the configured inputs are processed with the same computational quality today as yesterday.

This is not a claim of silent degradation. It is a request for observability that would allow users to verify platform stability on their own, rather than inferring it from output quality.

Proposal

Either of the following would eliminate the diagnostic overhead:

Option A — /v1/inference-health endpoint Any structured signal. Even an opaque {"status": "nominal"} boolean would eliminate the LLM-based self-diagnostic loop entirely. Richer signals (current quantization tier, context pressure level, capacity flag) would enable more precise architectural adaptation but are not required for the primary use case.

Option B — Tagged research token bucket A separately-metered quota class for developers whose workflows depend on measurable inference stability. Not a free tier — a scoped budget (e.g., capped monthly, tagged request metadata) to run canaries and baseline checks without cannibalizing production quota.

Either resolves the bottleneck. Option A offers tighter integration with the existing API surface; Option B offers looser coupling (no infrastructure internals exposed).

Alternative approaches considered

Using /bug for diagnostic (already offered by maintainers in #42796). The /bug flow is post-hoc by construction — it requires the user to first experience anomalous output, then recognize it as anomaly rather than prompt/workflow issue, then generate and send a transcript. For preflight observability (deciding whether to start a high-effort session at all), it doesn't close the loop. Useful as a reactive feedback channel, not as a verification mechanism before work begins.

Using Anthropic's published evals as baseline. Running published retrieval/reasoning evals locally provides a useful third-party benchmark reference, but answers a different question: "how does the model compare to benchmark today?" rather than "is my current session processed with the same baseline as my last session?". Complementary, not substitutable.

Response-header variant of Option A. Instead of a separate /v1/inference-health endpoint, an X-Inference-Health: nominal header on every API response would provide continuous in-band signal with zero extra latency and zero extra quota. Tighter coupling to the response flow (every call carries the status), looser political boundary (no dedicated endpoint to version or document as contract). Potentially cheaper than a separate endpoint if the underlying semantics can be kept simple.

Relationship to #42796

stellaraccident's issue proposed exposing thinking token counts per request for engineering observability. It was closed after bcherny's substantive technical response: disclosure of the two February product changes (adaptive thinking default on Feb 9, effort=85 default on Mar 3) with documented opt-outs; clarification that redact-thinking-2026-02-12 is a UI-only header (with showThinkingSummaries: true opt-out); and Anthropic's evaluation that these changes did not cause the regression pattern reported there.

#42796's substrate: enterprise engineering workflows, thinking-token observability
This issue's substrate: cognitive-orchestration harness for atypical cognition, inference-stability observability

Different observables, adjacent user segment (workflows saturating effective capacity), convergent ask (reduce opacity to enable user-side verification).

Acknowledgments

Aware of the public discourse around the February updates and Anthropic's public technical responses: adaptive thinking default with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING opt-out, effort=85 default with /effort high/max to increase, redact-thinking-2026-02-12 as UI-only with showThinkingSummaries: true opt-out, transcript-based diagnostics via /bug, and Anthropic's own evaluation that none of these changes caused the regression pattern in #42796.

This request is orthogonal. It asks for a verification mechanism — a signal that would support Anthropic's position by letting users empirically confirm inference stability on their own, rather than infer it from degraded output. The observability gap exists whether or not any degradation claim is true.

Rationale / extended context

Narrative context, architectural details of the harness, use case background, and reasoning for why this bottleneck matters for a specific solo-developer segment: https://gist.github.com/gfrom085/962575f77a4460b7113e6c3207784082

The gist is included as rationale. The technical ask above stands independently of it.

extent analysis

TL;DR

Implementing an /v1/inference-health endpoint or a tagged research token bucket would provide the necessary observability signal to distinguish platform-side inference variance from prompt/workflow-side issues.

Guidance

Consider implementing Option A — /v1/inference-health endpoint: This would provide a structured signal, such as {"status": "nominal"}, to indicate platform-side inference health.
Alternatively, consider Option B — Tagged research token bucket: This would provide a separately-metered quota class for developers to run canaries and baseline checks without cannibalizing production quota.
Evaluate the trade-offs between the two options, considering factors such as integration with the existing API surface, coupling, and infrastructure internals exposure.
Review the provided rationale and use case background in the gist to understand the context and importance of this observability need.

Example

No code snippet is provided as the issue focuses on the need for an observability signal rather than a specific implementation.

Notes

The proposed solutions aim to address the substrate-layer observability need, which is orthogonal to the product-layer observability gaps addressed in #42796. The request is not about claiming silent degradation but rather about providing a verification mechanism for users to confirm inference stability.

Recommendation

Apply Option A — /v1/inference-health endpoint as it offers tighter integration with the existing API surface and provides a clear, structured signal for users to verify platform-side inference health.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #API routing #API middleware #SSR setup #ISR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Feature Request: Inference-health signal for power-user workflows (extends #42796) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Problem

Proposal

Alternative approaches considered

Relationship to #42796

Acknowledgments

Rationale / extended context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Feature Request: Inference-health signal for power-user workflows (extends #42796) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Problem

Proposal

Alternative approaches considered

Relationship to #42796

Acknowledgments

Rationale / extended context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING