vllm - ✅(Solved) Fix [Bug]: Kimi-K2.5 outputs only '!!!!!!!!!!' in reasoning field, content is always null [1 pull requests, 6 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36763Fetched 2026-04-08 00:34:58
View on GitHub
Comments
6
Participants
5
Timeline
21
Reactions
6
Author
Timeline (top)
subscribed ×7commented ×6cross-referenced ×5mentioned ×2

Root Cause

Analysis — two possible root causes

Fix Action

Fix / Workaround

This bug was fixed by Moonshot in tokenizer_config.json:

  • Kimi-K2: commit 94a4053

  • Kimi-K2.5: commit 0102674

  • HF #52content: null, </think> never emitted

  • HF #18 — random (no content) in responses and tool calls

  • vLLM PR #33248 — workaround patch (with reported data loss)

  • vLLM #33654 — empty content when max_tokens is too low

  • vLLM #35718 — incoherent output after extended runtime

PR fix notes

PR #33248: Kimi K2.5 model generates "(no content)" placeholder in tool call responses

Description (problem / solution / changelog)

Purpose

Fixes the issue where Kimi K2.5 model generates "(no content)" placeholder text in responses when making tool calls without accompanying text content. This placeholder leaks through to clients in both non-streaming and streaming modes.

Key Changes:

  1. kimi_k2_reasoning_parser.py:
    • Add _clean_content() method to strip "(no content)" placeholder
    • Filter placeholder from extract_reasoning() output
    • Filter placeholder from extract_reasoning_streaming() DeltaMessage results
    • Return None when content becomes empty after cleaning
    • Preserve reasoning and tool_calls when content is filtered
  2. kimi_k2_tool_parser.py:
    • Add _clean_content() method to strip "(no content)" placeholder
    • Add _strip_all_tool_markers() to prevent marker leakage into content
    • Filter placeholder from all extraction methods (non-tool, tool call, error paths)
    • Increase buffer_max_size from 1024 to 4096 to handle larger tool arguments
    • Add regex pattern to strip thinking blocks that may leak into content
    • Improve streaming content handling:
      • Extract and return content before tool section markers
      • Return None instead of empty DeltaMessage for cleaned content
      • Strip all markers from post-section content
    • Fix IndexError by initializing prev_tool_call_arr before returning
    • Strip whitespace from tool_call_portion for better regex matching
    • Fix array bounds checking in tool call state management
    • Add deferred_section_exit handling in exception handler

Before this fix:

{ choices: [{ message: { content: (no content), tool_calls: [...] } }] }

After this fix:

{ choices: [{ message: { content: null, tool_calls: [...] } }] }

Test Plan

To test this fix, run the following tests with Kimi K2/K2.5 model using tool calls:

Test 1: Non-streaming tool call without content

python -c "
from vllm import LLM, SamplingParams
from vllm.schemas import ChatMessage, Tool
llm = LLM(model='moonshot-v1-8k')
messages = [
    ChatMessage(role='user', content='What is the weather in Tokyo?'),
]
tools = [
    Tool(type='function', function={
        'name': 'get_weather',
        'description': 'Get weather for a location',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            },
            'required': ['location']
        }
    })
]
outputs = llm.chat(messages, tools=tools)
print(outputs[0].outputs[0].text)
# Should NOT contain '(no content)' in the text
"

Test 2: Streaming tool call without content

python -c "
from vllm import LLM, SamplingParams
from vllm.schemas import ChatMessage, Tool
llm = LLM(model='moonshot-v1-8k')
messages = [
    ChatMessage(role='user', content='What is the weather in Tokyo?'),
]
tools = [
    Tool(type='function', function={
        'name': 'get_weather',
        'description': 'Get weather for a location',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string'}
            },
            'required': ['location']
        }
    })
]
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
stream = llm.chat(messages, tools=tools, sampling_params=sampling_params, stream=True)
for output in stream:
    for delta in output.outputs[0].delta:
        # Verify no chunk contains '(no content)'
        assert '(no content)' not in str(delta.content), f'Found placeholder in delta: {delta}'
        print(delta)
"

Test 3: Existing tests should still pass

pytest tests/tool_parsers/test_kimi_k2_tool_parser.py -v pytest tests/reasoning/test_kimi_k2_reasoning_parser.py -v Test Result Expected Results:

  1. Non-streaming test:
    • Response content should be null or empty string (not "(no content)")
    • Tool calls should be properly extracted
    • No placeholder text appears in the response
  2. Streaming test:
    • No delta chunk contains "(no content)" in content
    • Empty content chunks are skipped (return None)
    • Tool calls are properly parsed and returned
    • Content before tool markers is correctly extracted
  3. Existing tests:
    • All existing parser tests should continue to pass
    • No regression in functionality Manual Testing Observations:
  • Verified that "(no content)" placeholder is filtered from all response paths
  • Confirmed tool markers and thinking blocks do not leak into content
  • Streaming responses no longer emit empty chunks
  • Parser handles edge cases without errors

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary> - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan, such as providing test command. - [x] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc (https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details> BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing> (anything written below this line will be removed by GitHub Actions)

Changed files

  • vllm/reasoning/kimi_k2_reasoning_parser.py (modified, +47/-2)
  • vllm/tool_parsers/kimi_k2_tool_parser.py (modified, +132/-27)

Code Example

{
  "content": null,
  "reasoning": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
}

---

curl https://<host>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "how are you?"}],
    "chat_template_kwargs": {"thinking": false},
    "max_tokens": 1024,
    "temperature": 0.6
  }'

---

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
RAW_BUFFERClick to expand / collapse

Your current environment

  • Model: moonshotai/Kimi-K2.5
  • Parsers: --reasoning-parser kimi_k2 --tool-call-parser kimi_k2
  • vLLM image tested: cu130-nightly and cu130-nightly-fff3711a244dd9e2915323e31c20768d922e90b5
  • GPUs: 8xB300 Nvidia

🐛 Describe the bug

The model responds to requests but produces entirely unusable output: the reasoning field contains only repeated exclamation marks in a loop, and content is null. The finish_reason is length, meaning the model consumes the entire token budget generating ! characters.

{
  "content": null,
  "reasoning": " !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
}

Steps to reproduce

Via curl:

curl https://<host>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "how are you?"}],
    "chat_template_kwargs": {"thinking": false},
    "max_tokens": 1024,
    "temperature": 0.6
  }'

Via Python client:

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

Same result in both cases.


What was tried without success

  • Tuning hyperparameters (temperature, top_p, top_k, repetition_penalty)
  • Disabling thinking via chat_template_kwargs: {"thinking": false}
  • Switching vLLM image (cu130-nightly-fff3711a244dd9e2915323e31c20768d922e90b5)

Analysis — two possible root causes

1. The </think> token is never emitted

Kimi-K2.5 runs in Thinking mode by default. The expected flow is <think>...</think> followed by the final response. If </think> is never generated, the kimi_k2 reasoning parser absorbs everything into reasoning_content and content remains null. The !!!! output confirms the model enters an incoherent state from the very first generated token.

This is consistent with HF discussion #52 (Model not generating </think> token).

2. Chat template incompatible with vLLM — add_generation_prompt silently dropped

Kimi's chat template uses **kwargs in apply_chat_template. vLLM, by a deliberate safety decision (PR #25794), only injects explicitly declared arguments — so add_generation_prompt is silently discarded.

The result is that the prompt is truncated before the assistant turn start token (<|im_assistant|>assistant<|im_middle|>), causing the model to go off the rails from the first token.

This bug was fixed by Moonshot in tokenizer_config.json:

Reference: vLLM Blog — Kimi-K2 Accuracy (Oct. 2025)


Investigation leads

  1. Check the deployed snapshot: the tokenizer_config.json must be from a commit after 94a4053 (Kimi-K2) / 0102674 (Kimi-K2.5).
  2. Test without --reasoning-parser kimi_k2: if content becomes non-null, the issue is in the parser.
  3. Provide an explicit chat template via --chat-template ./chat_template.jinja using the up-to-date official template.
  4. Apply PR #33248 in diagnostic mode to confirm the root cause (note: data loss reported in that PR).

Related issues

  • HF #52content: null, </think> never emitted
  • HF #18 — random (no content) in responses and tool calls
  • vLLM PR #33248 — workaround patch (with reported data loss)
  • vLLM #33654 — empty content when max_tokens is too low
  • vLLM #35718 — incoherent output after extended runtime

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the model producing unusable output with repeated exclamation marks and null content, we need to ensure that the deployed snapshot is up-to-date and the chat template is correctly configured.

Step 1: Update the Deployed Snapshot

Verify that the tokenizer_config.json is from a commit after 94a4053 (Kimi-K2) or 0102674 (Kimi-K2.5).

Step 2: Provide an Explicit Chat Template

Use the up-to-date official chat template via --chat-template ./chat_template.jinja.

Step 3: Test Without Reasoning Parser

Run the model without --reasoning-parser kimi_k2 to check if content becomes non-null.

Step 4: Apply Diagnostic Mode Patch (Optional)

Apply PR #33248 in diagnostic mode to confirm the root cause.

Example Code Changes

To update the chat template, you can use the following code:

import os

# Define the path to the chat template
chat_template_path = os.path.join(os.getcwd(), 'chat_template.jinja')

# Update the chat template
client.chat_template = chat_template_path

Alternatively, you can use the --chat-template flag when running the model:

--chat-template ./chat_template.jinja

Verification

After applying the fixes, verify that the model produces coherent output with non-null content and the reasoning field contains the expected output.

Extra Tips

  • Ensure that the max_tokens parameter is set to a sufficient value to allow the model to generate coherent output.
  • Monitor the model's performance and adjust the hyperparameters as needed to prevent similar issues in the future.
  • Refer to the official documentation and discussion forums for further guidance on troubleshooting and optimizing the model's performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING