vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 outputs only '!!!!!!!!!!' in reasoning field, content is always null [1 pull requests, 6 comments, 5 participants]

vllm2026-03-11 09:16:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36763•Fetched 2026-04-08 00:34:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×7commented ×6cross-referenced ×5mentioned ×2

Root Cause

Analysis — two possible root causes

Fix Action

Fix / Workaround

This bug was fixed by Moonshot in tokenizer_config.json:

Kimi-K2: commit 94a4053
Kimi-K2.5: commit 0102674
HF #52 — content: null, </think> never emitted
HF #18 — random (no content) in responses and tool calls
vLLM PR #33248 — workaround patch (with reported data loss)
vLLM #33654 — empty content when max_tokens is too low
vLLM #35718 — incoherent output after extended runtime

PR fix notes

PR #33248: Kimi K2.5 model generates "(no content)" placeholder in tool call responses

Repository: vllm-project/vllm
Author: shivamashtikar
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/33248

Description (problem / solution / changelog)

Purpose

Fixes the issue where Kimi K2.5 model generates "(no content)" placeholder text in responses when making tool calls without accompanying text content. This placeholder leaks through to clients in both non-streaming and streaming modes.

Key Changes:

kimi_k2_reasoning_parser.py:
- Add _clean_content() method to strip "(no content)" placeholder
- Filter placeholder from extract_reasoning() output
- Filter placeholder from extract_reasoning_streaming() DeltaMessage results
- Return None when content becomes empty after cleaning
- Preserve reasoning and tool_calls when content is filtered
kimi_k2_tool_parser.py:
- Add _clean_content() method to strip "(no content)" placeholder
- Add _strip_all_tool_markers() to prevent marker leakage into content
- Filter placeholder from all extraction methods (non-tool, tool call, error paths)
- Increase buffer_max_size from 1024 to 4096 to handle larger tool arguments
- Add regex pattern to strip thinking blocks that may leak into content
- Improve streaming content handling:
  - Extract and return content before tool section markers
  - Return None instead of empty DeltaMessage for cleaned content
  - Strip all markers from post-section content
- Fix IndexError by initializing prev_tool_call_arr before returning
- Strip whitespace from tool_call_portion for better regex matching
- Fix array bounds checking in tool call state management
- Add deferred_section_exit handling in exception handler

Before this fix:

{ choices: [{ message: { content: (no content), tool_calls: [...] } }] }

After this fix:

{ choices: [{ message: { content: null, tool_calls: [...] } }] }

Test Plan

To test this fix, run the following tests with Kimi K2/K2.5 model using tool calls:

Test 1: Non-streaming tool call without content

python -c "
from vllm import LLM, SamplingParams
from vllm.schemas import ChatMessage, Tool
llm = LLM(model='moonshot-v1-8k')
messages = [
    ChatMessage(role='user', content='What is the weather in Tokyo?'),
]
tools = [
    Tool(type='function', function={
        'name': 'get_weather',
        'description': 'Get weather for a location',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            },
            'required': ['location']
        }
    })
]
outputs = llm.chat(messages, tools=tools)
print(outputs[0].outputs[0].text)
# Should NOT contain '(no content)' in the text
"

Test 2: Streaming tool call without content

python -c "
from vllm import LLM, SamplingParams
from vllm.schemas import ChatMessage, Tool
llm = LLM(model='moonshot-v1-8k')
messages = [
    ChatMessage(role='user', content='What is the weather in Tokyo?'),
]
tools = [
    Tool(type='function', function={
        'name': 'get_weather',
        'description': 'Get weather for a location',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            },
            'required': ['location']
        }
    })
]
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
stream = llm.chat(messages, tools=tools, sampling_params=sampling_params, stream=True)
for output in stream:
    for delta in output.outputs[0].delta:
        # Verify no chunk contains '(no content)'
        assert '(no content)' not in str(delta.content), f'Found placeholder in delta: {delta}'
        print(delta)
"

Test 3: Existing tests should still pass

pytest tests/tool_parsers/test_kimi_k2_tool_parser.py -v pytest tests/reasoning/test_kimi_k2_reasoning_parser.py -v Test Result Expected Results:

Non-streaming test:
- Response content should be null or empty string (not "(no content)")
- Tool calls should be properly extracted
- No placeholder text appears in the response
Streaming test:
- No delta chunk contains "(no content)" in content
- Empty content chunks are skipped (return None)
- Tool calls are properly parsed and returned
- Content before tool markers is correctly extracted
Existing tests:
- All existing parser tests should continue to pass
- No regression in functionality Manual Testing Observations:

Verified that "(no content)" placeholder is filtered from all response paths
Confirmed tool markers and thinking blocks do not leak into content
Streaming responses no longer emit empty chunks
Parser handles edge cases without errors

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary> - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan, such as providing test command. - [x] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc (https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details> BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing> (anything written below this line will be removed by GitHub Actions)

Changed files

vllm/reasoning/kimi_k2_reasoning_parser.py (modified, +47/-2)
vllm/tool_parsers/kimi_k2_tool_parser.py (modified, +132/-27)

Code Example

{
  "content": null,
  "reasoning": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
}

---

curl https://<host>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "how are you?"}],
    "chat_template_kwargs": {"thinking": false},
    "max_tokens": 1024,
    "temperature": 0.6
  }'

---

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

RAW_BUFFERClick to expand / collapse

Your current environment

Model: moonshotai/Kimi-K2.5
Parsers: --reasoning-parser kimi_k2 --tool-call-parser kimi_k2
vLLM image tested: cu130-nightly and cu130-nightly-fff3711a244dd9e2915323e31c20768d922e90b5
GPUs: 8xB300 Nvidia

🐛 Describe the bug

The model responds to requests but produces entirely unusable output: the reasoning field contains only repeated exclamation marks in a loop, and content is null. The finish_reason is length, meaning the model consumes the entire token budget generating ! characters.

{
  "content": null,
  "reasoning": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
}

Steps to reproduce

Via curl:

curl https://<host>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "how are you?"}],
    "chat_template_kwargs": {"thinking": false},
    "max_tokens": 1024,
    "temperature": 0.6
  }'

Via Python client:

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

Same result in both cases.

What was tried without success

Tuning hyperparameters (temperature, top_p, top_k, repetition_penalty)
Disabling thinking via chat_template_kwargs: {"thinking": false}
Switching vLLM image (cu130-nightly-fff3711a244dd9e2915323e31c20768d922e90b5)

Analysis — two possible root causes

1. The `</think>` token is never emitted

Kimi-K2.5 runs in Thinking mode by default. The expected flow is <think>...</think> followed by the final response. If </think> is never generated, the kimi_k2 reasoning parser absorbs everything into reasoning_content and content remains null. The !!!! output confirms the model enters an incoherent state from the very first generated token.

This is consistent with HF discussion #52 (Model not generating </think> token).

2. Chat template incompatible with vLLM — `add_generation_prompt` silently dropped

Kimi's chat template uses **kwargs in apply_chat_template. vLLM, by a deliberate safety decision (PR #25794), only injects explicitly declared arguments — so add_generation_prompt is silently discarded.

The result is that the prompt is truncated before the assistant turn start token (<|im_assistant|>assistant<|im_middle|>), causing the model to go off the rails from the first token.

This bug was fixed by Moonshot in tokenizer_config.json:

Kimi-K2: commit 94a4053
Kimi-K2.5: commit 0102674

Reference: vLLM Blog — Kimi-K2 Accuracy (Oct. 2025)

Investigation leads

Check the deployed snapshot: the tokenizer_config.json must be from a commit after 94a4053 (Kimi-K2) / 0102674 (Kimi-K2.5).
Test without --reasoning-parser kimi_k2: if content becomes non-null, the issue is in the parser.
Provide an explicit chat template via --chat-template ./chat_template.jinja using the up-to-date official template.
Apply PR #33248 in diagnostic mode to confirm the root cause (note: data loss reported in that PR).

Related issues

HF #52 — content: null, </think> never emitted
HF #18 — random (no content) in responses and tool calls
vLLM PR #33248 — workaround patch (with reported data loss)
vLLM #33654 — empty content when max_tokens is too low
vLLM #35718 — incoherent output after extended runtime

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the model producing unusable output with repeated exclamation marks and null content, we need to ensure that the deployed snapshot is up-to-date and the chat template is correctly configured.

Step 1: Update the Deployed Snapshot

Verify that the tokenizer_config.json is from a commit after 94a4053 (Kimi-K2) or 0102674 (Kimi-K2.5).

Step 2: Provide an Explicit Chat Template

Use the up-to-date official chat template via --chat-template ./chat_template.jinja.

Step 3: Test Without Reasoning Parser

Run the model without --reasoning-parser kimi_k2 to check if content becomes non-null.

Step 4: Apply Diagnostic Mode Patch (Optional)

Apply PR #33248 in diagnostic mode to confirm the root cause.

Example Code Changes

To update the chat template, you can use the following code:

import os

# Define the path to the chat template
chat_template_path = os.path.join(os.getcwd(), 'chat_template.jinja')

# Update the chat template
client.chat_template = chat_template_path

Alternatively, you can use the --chat-template flag when running the model:

--chat-template ./chat_template.jinja

Verification

After applying the fixes, verify that the model produces coherent output with non-null content and the reasoning field contains the expected output.

Extra Tips

Ensure that the max_tokens parameter is set to a sufficient value to allow the model to generate coherent output.
Monitor the model's performance and adjust the hyperparameters as needed to prevent similar issues in the future.
Refer to the official documentation and discussion forums for further guidance on troubleshooting and optimizing the model's performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 outputs only '!!!!!!!!!!' in reasoning field, content is always null [1 pull requests, 6 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Analysis — two possible root causes

Fix Action

Fix / Workaround

PR fix notes

PR #33248: Kimi K2.5 model generates "(no content)" placeholder in tool call responses

Description (problem / solution / changelog)

Purpose

Key Changes:

Before this fix:

After this fix:

Test Plan

Test 1: Non-streaming tool call without content

Test 2: Streaming tool call without content

Test 3: Existing tests should still pass

Changed files

Code Example

Your current environment

🐛 Describe the bug

Steps to reproduce

What was tried without success

Analysis — two possible root causes

1. The </think> token is never emitted

2. Chat template incompatible with vLLM — add_generation_prompt silently dropped

Investigation leads

Related issues

Before submitting a new issue...

extent analysis

Fix Plan

Step 1: Update the Deployed Snapshot

Step 2: Provide an Explicit Chat Template

Step 3: Test Without Reasoning Parser

Step 4: Apply Diagnostic Mode Patch (Optional)

Example Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. The `</think>` token is never emitted

2. Chat template incompatible with vLLM — `add_generation_prompt` silently dropped