vllm - 💡(How to fix) Fix [Bug]: 0.17.0rc1在A2部署GLM-4.7，开启MTP后工具调用异常 [1 comments, 1 participants]

vllm2026-03-23 03:53:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37846•Fetched 2026-04-08 01:17:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

samsuzhang

Participants

samsuzhang

Timeline (top)

commented ×1cross-referenced ×1labeled ×1

RAW_BUFFERClick to expand / collapse

Your current environment

0.17.0rc1，A2 910B1到B3，GLM-4.7，HDK 25.2.3

🐛 Describe the bug

问题1： · 关闭 MTP 时，工具调用正常工作；开启 MTP3 时，工具调用必定出现 JSON 格式错误；错误信息：JSON 解析失败，提示缺少 } · 参考官方Issue尝试修复，经验证有效：GitHub Issue #34449: [Bug]: GLM-5-FP8 malformed tool calls · 原因分析：当 MTP (Multi-Token Prediction) 开启时，vLLM 会并行预测多个 token。但 GLM 系列的 tool parser 使用 partial_json_parser 进行 autocomplete（自动补全不完整的 JSON），这导致：autocomplete 结果与实际输出不匹配 - MTP 并行生成时，token 边界可能错乱；计算 remaining_call 时出错 - 用 autocomplete 后的完整 JSON 减去已发送的内容，结果可能是重复的、截断的或畸形的 JSON；最终发送到客户端的 JSON 是错误的 - 客户端解析失败

问题2： · 修复问题1后，发现工具调用仍出现概率性错误，错误表现仍为JSON 解析失败，提示缺少 } · 根据用户反馈（用户规模1000+，周一早上3小时内就有约10个用户反馈问题，比较严重了），开启MTP前基本没有出现此问题，开启后此问题才开始出现 · 是概率性错误，可以通过重试来绕过，但是非常影响使用体验 · 当前使用的是GLM-4.7基于最新msmodelslim官方工具W8A8量化模型，手动拼接Float MTP权重；部署方案是mooncake V1 PD分离 · 原因分析：感觉更像是模型精度问题？

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the JSON parsing errors when MTP is enabled, we need to modify the tool parser to handle the parallel prediction output correctly.

Step-by-Step Solution:

Update the partial_json_parser: Modify the parser to account for the parallel token generation when MTP is enabled. This can be achieved by buffering the output and reassembling the JSON objects.
Implement a retry mechanism: For the probabilistic errors, implement a retry mechanism with a limited number of attempts to handle transient errors.
Model precision adjustment: Consider adjusting the model precision or exploring alternative models to reduce the occurrence of probabilistic errors.

Example Code Snippet (Python):

import json

class MTPJsonParser:
    def __init__(self):
        self.buffer = []

    def parse(self, output):
        self.buffer.append(output)
        try:
            # Attempt to parse the buffered output as JSON
            json_output = json.loads(''.join(self.buffer))
            self.buffer = []
            return json_output
        except json.JSONDecodeError:
            # If parsing fails, continue buffering
            return None

    def retry_parse(self, output, max_retries=3):
        for _ in range(max_retries):
            parsed_output = self.parse(output)
            if parsed_output is not None:
                return parsed_output
        # If all retries fail, raise an error
        raise ValueError("Failed to parse JSON output after retries")

Verification

To verify the fix, test the tool calls with MTP enabled and disabled, ensuring that the JSON output is correctly parsed in both cases. Monitor the error rates and user feedback to confirm that the probabilistic errors are significantly reduced.

Extra Tips

Regularly review and update the model to ensure the best possible precision and reduce errors.
Consider implementing additional logging and monitoring to quickly identify and address any recurring issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: 0.17.0rc1在A2部署GLM-4.7，开启MTP后工具调用异常 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution:

Example Code Snippet (Python):

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: 0.17.0rc1在A2部署GLM-4.7，开启MTP后工具调用异常 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution:

Example Code Snippet (Python):

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING