vllm - 💡(How to fix) Fix [Bug]: reasoning_effort passed to MistralCommonTokenizer.apply_chat_template breaks Mistral Small 4 chat completions on vLLM 0.18.0 [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38560Fetched 2026-04-08 01:53:21
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Timeline (top)
commented ×2subscribed ×2labeled ×1mentioned ×1

Error Message

Important: The request body does not need to include reasoning_effort. A minimal call is enough to trigger the error (LiteLLM is not required—we reproduced with direct HTTP to vLLM). Actual: HTTP 400, error message as above. Expected: Request accepted and completion generated (or a clear validation error only if the client sends unsupported fields).

Code Example

curl -sS "http://<vllm-host>:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-id>",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 16,
    "temperature": 0.0
  }'
RAW_BUFFERClick to expand / collapse

Your current environment

Environment vLLM: 0.18.0 (e.g. Docker image vllm/vllm-openai or internal vllm-audio:v0.18.0) Model: mistralai/Mistral-Small-4-119B-2603 (or equivalent weights served with Mistral tokenizer path) Hardware / stack: (fill in: GPU type, CUDA, --tensor-parallel-size, etc.)

🐛 Describe the bug

Problem POST /v1/chat/completions fails with 400 and: ValueError: Kwargs ['reasoning_effort'] are not supported by MistralCommonTokenizer.apply_chat_template. The failure occurs in vLLM’s Mistral chat rendering path, e.g.: File ".../vllm/entrypoints/openai/chat_completion/serving.py", line 209, in render_chat_request return await self.openai_serving_render.render_chat(request) ... File ".../vllm/renderers/mistral.py", line 125, in render_messages_async prompt_raw = await self._apply_chat_template_async( ... File ".../vllm/renderers/mistral.py", line 34, in safe_apply_chat_template return tokenizer.apply_chat_template(messages, **kwargs) ... File ".../transformers/tokenization_mistral_common.py", line 1432, in apply_chat_template raise ValueError( ValueError: Kwargs ['reasoning_effort'] are not supported by MistralCommonTokenizer.apply_chat_template.

Reproduction Important: The request body does not need to include reasoning_effort. A minimal call is enough to trigger the error (LiteLLM is not required—we reproduced with direct HTTP to vLLM). Example:

curl -sS "http://<vllm-host>:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-id>",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 16,
    "temperature": 0.0
  }'

Actual: HTTP 400, error message as above. Expected: Request accepted and completion generated (or a clear validation error only if the client sends unsupported fields). Context The Hugging Face chat template for Mistral Small 4 uses reasoning_effort in Jinja for [MODEL_SETTINGS], but MistralCommonTokenizer.apply_chat_template in Transformers rejects reasoning_effort as an unsupported kwarg when vLLM forwards it into that API. We confirmed the same failure when bypassing LiteLLM, so this is not proxy-specific. What would help vLLM should not pass reasoning_effort (and any other unsupported kwargs) into MistralCommonTokenizer.apply_chat_template, or align with Transformers / tokenizer behavior for Mistral 4–style templates. If there is an intended flag or server-side default for Mistral Small 4 + OpenAI chat API, documenting it would help.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to modify the vllm code to exclude unsupported kwargs, such as reasoning_effort, when calling MistralCommonTokenizer.apply_chat_template.

Here are the steps:

  • Identify the render_messages_async function in vllm/renderers/mistral.py where apply_chat_template is called.
  • Modify the function to filter out unsupported kwargs before passing them to apply_chat_template.
  • Example code:
# in vllm/renderers/mistral.py
def render_messages_async(self, messages, **kwargs):
    # ...
    supported_kwargs = ['max_tokens', 'temperature']  # add other supported kwargs
    filtered_kwargs = {key: value for key, value in kwargs.items() if key in supported_kwargs}
    prompt_raw = await self._apply_chat_template_async(messages, **filtered_kwargs)
    # ...

Alternatively, you can also modify the safe_apply_chat_template function to catch and ignore the ValueError exception raised by apply_chat_template when encountering unsupported kwargs.

Verification

To verify the fix, you can retry the curl command that previously triggered the error:

curl -sS "http://<vllm-host>:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-id>",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 16,
    "temperature": 0.0
  }'

The request should now be accepted and a completion generated without returning a 400 error.

Extra Tips

  • Make sure to update the vllm documentation to reflect any changes to the supported kwargs for MistralCommonTokenizer.apply_chat_template.
  • Consider adding a validation step to check for unsupported kwargs before calling apply_chat_template to provide a clearer error message to the user.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING