litellm - 💡(How to fix) Fix [Bug]: /health endpoint fails for reasoning models — max_tokens=1 ping consumed by reasoning_tokens [1 participants]

litellm2026-05-01 16:20:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26987•Fetched 2026-05-02 05:28:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mayazbay

Participants

mayazbay

Timeline (top)

labeled ×1

/health reports a model in unhealthy_endpoints with BadRequestError: Could not finish the message because max_tokens or model output limit was reached, even though the same model returns valid completions on /v1/chat/completions immediately after.

Root Cause

The internal health-check probes with max_tokens: 1 regardless of the per-model max_tokens set in model_list. For reasoning models (gpt-5.5, deepseek-v4-pro, grok-4.20-reasoning, sonnet-4-5-thinking with extended thinking, etc.) the model spends reasoning tokens before producing visible output. With max_tokens=1, the model exhausts its budget on internal reasoning and returns the BadRequestError above.

Fix Action

Fix / Workaround

Workarounds tried (don't work)

(a) is the cleanest. (b) is the lowest-friction. Either closes the false-negative class without operators having to invent workarounds.

Code Example

model_list:
- model_name: gpt-5.5
  litellm_params:
    model: gpt-5.5
    api_key: os.environ/OPENAI_API_KEY
    max_tokens: 16384
    timeout: 240

---

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# → unhealthy_endpoints[gpt-5.5]: "Could not finish the message because max_tokens or model output limit was reached"

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H 'Content-Type: application/json' \
     -d '{"model":"gpt-5.5","messages":[{"role":"user","content":"reply OK"}],"max_tokens":4096}' \
     http://localhost:4000/v1/chat/completions
# → {"choices":[{"message":{"content":"OK", ...}}], usage.completion_tokens_details.reasoning_tokens > 0}

RAW_BUFFERClick to expand / collapse

Summary

Root cause

Reproduction (LiteLLM proxy v1.x, latest as of 2026-04-30)

config.yaml:

model_list:
- model_name: gpt-5.5
  litellm_params:
    model: gpt-5.5
    api_key: os.environ/OPENAI_API_KEY
    max_tokens: 16384
    timeout: 240

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# → unhealthy_endpoints[gpt-5.5]: "Could not finish the message because max_tokens or model output limit was reached"

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H 'Content-Type: application/json' \
     -d '{"model":"gpt-5.5","messages":[{"role":"user","content":"reply OK"}],"max_tokens":4096}' \
     http://localhost:4000/v1/chat/completions
# → {"choices":[{"message":{"content":"OK", ...}}], usage.completion_tokens_details.reasoning_tokens > 0}

Why this matters

Routing strategy simple-shuffle and most ops dashboards interpret unhealthy_count > 0 as a deploy-RED. False negatives on reasoning models force operators to ignore /health, which defeats its purpose.

Workarounds tried (don't work)

health_check_params: {max_tokens: 4096} in litellm_params → LiteLLM passes it through to the provider as a real param; OpenAI rejects with Unknown parameter: 'health_check_params'. Doesn't appear in current docs as a router option.
Bumping max_tokens in litellm_params to 16384 — health-check ignores this and uses 1.

Suggested fix

Either:

(a) Add a documented per-model field health_check_params (or similar) that ONLY applies to the internal /health probe and is stripped before forwarding to the provider.
(b) Have the health-check default to a higher max_tokens (e.g. 256) when the provider is known to be a reasoning-class model (gpt-5.5, o1*, deepseek-r*, sonnet-thinking, grok-*-reasoning). This is heuristic but covers the major cases.
(c) Skip the max_tokens=1 ping for any model whose litellm_params.max_tokens >= 4096, and use a bigger probe that respects the model's reasoning budget.

(a) is the cleanest. (b) is the lowest-friction. Either closes the false-negative class without operators having to invent workarounds.

Environment

LiteLLM proxy (latest as of 2026-04-30)
Models tested: gpt-5.5, deepseek-v4-pro, grok-4.20-0309-reasoning — all show the same false-negative on /health while succeeding on /v1/chat/completions

Receipts

Real /v1/chat/completions responses show usage.completion_tokens_details.reasoning_tokens > 0 confirming the model consumed reasoning tokens before producing the visible content.
Codified downstream as ceo-hierarchy AP-4 in our skill substrate (Nous AGaaS).

Filed by: Madi Ayazbay (Nous AGaaS); session-id s108-mac-21655-20260430T2221.

extent analysis

TL;DR

The health check for reasoning models can be fixed by introducing a per-model health_check_params field or defaulting to a higher max_tokens value for known reasoning models.

Guidance

Introduce a health_check_params field in litellm_params that only applies to the internal /health probe and is stripped before forwarding to the provider.
Consider defaulting to a higher max_tokens value (e.g., 256) for known reasoning models like gpt-5.5 and deepseek-v4-pro.
Alternatively, skip the max_tokens=1 ping for models with litellm_params.max_tokens >= 4096 and use a bigger probe that respects the model's reasoning budget.

Example

No code snippet is provided as the issue is more related to configuration and model behavior.

Notes

The suggested fixes aim to address the false-negative issue for reasoning models without requiring significant changes to the existing infrastructure. However, the optimal solution may depend on the specific requirements and constraints of the LiteLLM proxy and the models being used.

Recommendation

Apply workaround (a) by introducing a health_check_params field in litellm_params to allow for per-model configuration of the health check parameters. This approach provides the most flexibility and control over the health check behavior for different models.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: /health endpoint fails for reasoning models — max_tokens=1 ping consumed by reasoning_tokens [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workarounds tried (don't work)

Code Example

Summary

Root cause

Reproduction (LiteLLM proxy v1.x, latest as of 2026-04-30)

Why this matters

Workarounds tried (don't work)

Suggested fix

Environment

Receipts

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: /health endpoint fails for reasoning models — max_tokens=1 ping consumed by reasoning_tokens [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workarounds tried (don't work)

Code Example

Summary

Root cause

Reproduction (LiteLLM proxy v1.x, latest as of 2026-04-30)

Why this matters

Workarounds tried (don't work)

Suggested fix

Environment

Receipts

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING