vllm - 💡(How to fix) Fix [Bug]: When using streaming tool calls in kimi-k2.5, only the content before the tool call can be obtained [1 participants]

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

When I deployed kimi-k2.5 in a dual-machine setup using v0.18.0, its streaming return only included the content before the tool call, and the subsequent content was not returned until stop_reason:'length'. When checking the debug logs, the delta_text printed by the tool parser can be concatenated into a complete content, but the characters between <|tool_calls_section_begin|> and <|tool_calls_section_end|> exceed thirty thousand, and it almost always prints 'Not enough token', yet none of the content is returned. The configuration and some results will be provided later.

---------------------------------------------------------configuration---------------------------------------------------- export HCCL_IF_IP=<IP> export GLOO_SOCKET_IFNAME="bond0" export TP_SOCKET_IFNAME="bond0" export HCCL_SOCKET_IFNAME="bond0" export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 export OMP_PROC_BIND=false export OMP_NUM_THREADS=5 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" export VLLM_USE_V1=1 export HCCL_BUFFSIZE=1024 echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl -w kernel.sched_migration_cost_ns=50000 export HCCL_OP_EXPANSION_MODE="AIV"

vllm serve /mnt/sdc/Kimi-K2.5/weight/Kimi-K2.5-w4a8
--host 0.0.0.0
--port 8088
--seed 1024
--served-model-name kimi_k2.5
--allowed-local-media-path /
--quantization ascend
--trust-remote-code
--tensor-parallel-size 8
--data-parallel-size 2
--data-parallel-size-local 1
--data-parallel-start-rank 0
--data-parallel-address <IP>
--data-parallel-rpc-port <port>
--enable-expert-parallel
--async-scheduling
--mm-encoder-tp-mode 'data'
--mm_processor_cache_type="shm"
--max-num-seqs 128
--max-model-len 65536
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.9
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,16,32,64,128,196], "cudagraph_mode":"FULL_DECODE_ONLY"}'
--additional-config '{"multistream_overlap_shared_expert":true}'

--------------------------------------------------some results ------------------------------------------------- .................... (APIServer pid=282) DEBUG 04-28 07:40:13 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: 第一章 (APIServer pid=282) DEBUG 04-28 07:40:13 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [42969] (APIServer pid=282) DEBUG 04-28 07:40:13 [tool_parsers/kimi_k2_tool_parser.py:281] No tool call tokens found! (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: 和 (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [488] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:281] No tool call tokens found! (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: 第二章 (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [44754] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:281] No tool call tokens found! (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: 。 (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [292] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:281] No tool call tokens found! (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: <|tool_calls_section_begin|> (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [163595] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:239] Entering tool section (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:331] In tool section before first tool, suppressing:
(APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: <|tool_call_begin|> (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [163597] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:370] Starting on a new tool 0 (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: 1 (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [16] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:466] Not enough token (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: c (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [66] (APIServer pid=282) DEBUG 04-28 07:40:14 [tool_parsers/kimi_k2_tool_parser.py:466] Not enough token .......................... APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: <|tool_calls_section_end|> (APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [163596] (APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:337] Generating text content! skipping tool parsing. (APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:211] delta_text: (APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:212] delta_token_ids: [163586] (APIServer pid=282) DEBUG 04-28 07:43:34 [tool_parsers/kimi_k2_tool_parser.py:337] Generating text content! skipping tool parsing.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Increase the token limit or adjust the parsing logic to handle large tool call sections.

Guidance

Review the max-num-batched-tokens and max-model-len configuration parameters to ensure they are sufficient for the input data.
Investigate the tool_parsers/kimi_k2_tool_parser.py logic to determine why it's skipping tool parsing and not returning content after the <|tool_calls_section_begin|> token.
Consider adjusting the HCCL_BUFFSIZE environment variable to increase the buffer size for handling large inputs.
Verify that the vllm serve command is correctly configured to handle the specified input data and model.

Example

No specific code snippet can be provided without modifying the existing tool_parsers/kimi_k2_tool_parser.py logic.

Notes

The issue seems to be related to the token limit and parsing logic, but without more information about the tool_parsers/kimi_k2_tool_parser.py code, it's difficult to provide a more specific solution.

Recommendation

Apply a workaround by increasing the max-num-batched-tokens and max-model-len configuration parameters to a higher value, such as 16384 or 32768, to see if it resolves the issue.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: When using streaming tool calls in kimi-k2.5, only the content before the tool call can be obtained [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: When using streaming tool calls in kimi-k2.5, only the content before the tool call can be obtained [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING