litellm - 💡(How to fix) Fix [Bug]: /health endpoint fails for reasoning models — max_tokens=1 ping consumed by reasoning_tokens [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26987Fetched 2026-05-02 05:28:10
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

/health reports a model in unhealthy_endpoints with BadRequestError: Could not finish the message because max_tokens or model output limit was reached, even though the same model returns valid completions on /v1/chat/completions immediately after.

Root Cause

The internal health-check probes with max_tokens: 1 regardless of the per-model max_tokens set in model_list. For reasoning models (gpt-5.5, deepseek-v4-pro, grok-4.20-reasoning, sonnet-4-5-thinking with extended thinking, etc.) the model spends reasoning tokens before producing visible output. With max_tokens=1, the model exhausts its budget on internal reasoning and returns the BadRequestError above.

Fix Action

Fix / Workaround

Workarounds tried (don't work)

(a) is the cleanest. (b) is the lowest-friction. Either closes the false-negative class without operators having to invent workarounds.

Code Example

model_list:
- model_name: gpt-5.5
  litellm_params:
    model: gpt-5.5
    api_key: os.environ/OPENAI_API_KEY
    max_tokens: 16384
    timeout: 240

---

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# → unhealthy_endpoints[gpt-5.5]: "Could not finish the message because max_tokens or model output limit was reached"

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H 'Content-Type: application/json' \
     -d '{"model":"gpt-5.5","messages":[{"role":"user","content":"reply OK"}],"max_tokens":4096}' \
     http://localhost:4000/v1/chat/completions
# → {"choices":[{"message":{"content":"OK", ...}}], usage.completion_tokens_details.reasoning_tokens > 0}
RAW_BUFFERClick to expand / collapse

Summary

/health reports a model in unhealthy_endpoints with BadRequestError: Could not finish the message because max_tokens or model output limit was reached, even though the same model returns valid completions on /v1/chat/completions immediately after.

Root cause

The internal health-check probes with max_tokens: 1 regardless of the per-model max_tokens set in model_list. For reasoning models (gpt-5.5, deepseek-v4-pro, grok-4.20-reasoning, sonnet-4-5-thinking with extended thinking, etc.) the model spends reasoning tokens before producing visible output. With max_tokens=1, the model exhausts its budget on internal reasoning and returns the BadRequestError above.

Reproduction (LiteLLM proxy v1.x, latest as of 2026-04-30)

config.yaml:

model_list:
- model_name: gpt-5.5
  litellm_params:
    model: gpt-5.5
    api_key: os.environ/OPENAI_API_KEY
    max_tokens: 16384
    timeout: 240
curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health
# → unhealthy_endpoints[gpt-5.5]: "Could not finish the message because max_tokens or model output limit was reached"

curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H 'Content-Type: application/json' \
     -d '{"model":"gpt-5.5","messages":[{"role":"user","content":"reply OK"}],"max_tokens":4096}' \
     http://localhost:4000/v1/chat/completions
# → {"choices":[{"message":{"content":"OK", ...}}], usage.completion_tokens_details.reasoning_tokens > 0}

Why this matters

Routing strategy simple-shuffle and most ops dashboards interpret unhealthy_count > 0 as a deploy-RED. False negatives on reasoning models force operators to ignore /health, which defeats its purpose.

Workarounds tried (don't work)

  • health_check_params: {max_tokens: 4096} in litellm_params → LiteLLM passes it through to the provider as a real param; OpenAI rejects with Unknown parameter: 'health_check_params'. Doesn't appear in current docs as a router option.
  • Bumping max_tokens in litellm_params to 16384 — health-check ignores this and uses 1.

Suggested fix

Either:

  • (a) Add a documented per-model field health_check_params (or similar) that ONLY applies to the internal /health probe and is stripped before forwarding to the provider.
  • (b) Have the health-check default to a higher max_tokens (e.g. 256) when the provider is known to be a reasoning-class model (gpt-5.5, o1*, deepseek-r*, sonnet-thinking, grok-*-reasoning). This is heuristic but covers the major cases.
  • (c) Skip the max_tokens=1 ping for any model whose litellm_params.max_tokens >= 4096, and use a bigger probe that respects the model's reasoning budget.

(a) is the cleanest. (b) is the lowest-friction. Either closes the false-negative class without operators having to invent workarounds.

Environment

  • LiteLLM proxy (latest as of 2026-04-30)
  • Models tested: gpt-5.5, deepseek-v4-pro, grok-4.20-0309-reasoning — all show the same false-negative on /health while succeeding on /v1/chat/completions

Receipts

  • Real /v1/chat/completions responses show usage.completion_tokens_details.reasoning_tokens > 0 confirming the model consumed reasoning tokens before producing the visible content.
  • Codified downstream as ceo-hierarchy AP-4 in our skill substrate (Nous AGaaS).

Filed by: Madi Ayazbay (Nous AGaaS); session-id s108-mac-21655-20260430T2221.

extent analysis

TL;DR

The health check for reasoning models can be fixed by introducing a per-model health_check_params field or defaulting to a higher max_tokens value for known reasoning models.

Guidance

  • Introduce a health_check_params field in litellm_params that only applies to the internal /health probe and is stripped before forwarding to the provider.
  • Consider defaulting to a higher max_tokens value (e.g., 256) for known reasoning models like gpt-5.5 and deepseek-v4-pro.
  • Alternatively, skip the max_tokens=1 ping for models with litellm_params.max_tokens >= 4096 and use a bigger probe that respects the model's reasoning budget.

Example

No code snippet is provided as the issue is more related to configuration and model behavior.

Notes

The suggested fixes aim to address the false-negative issue for reasoning models without requiring significant changes to the existing infrastructure. However, the optimal solution may depend on the specific requirements and constraints of the LiteLLM proxy and the models being used.

Recommendation

Apply workaround (a) by introducing a health_check_params field in litellm_params to allow for per-model configuration of the health check parameters. This approach provides the most flexibility and control over the health check behavior for different models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: /health endpoint fails for reasoning models — max_tokens=1 ping consumed by reasoning_tokens [1 participants]