vllm - 💡(How to fix) Fix GLM-5.1-FP8: Tool results ignored via /v1/chat/completions with --tool-call-parser glm47 but work correctly via /v1/completions [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39611Fetched 2026-04-12 13:24:24
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1

Error Message

When using /v1/chat/completions with --tool-call-parser glm47, GLM-5.1-FP8 ignores tool results in multi-turn conversations (the inbound round-trip). The model always responds as if the tool returned no data / an error, even though the tool result content is clearly provided.

Root Cause

Suspected Root Cause

Fix Action

Workaround

Use /v1/completions with a manually pre-formatted GLM prompt (bypassing the chat completions pipeline entirely). This works but requires client-side chat template rendering.

Code Example

--model zai-org/GLM-5.1-FP8
  --tensor-parallel-size 8
  --tool-call-parser glm47
  --reasoning-parser glm45
  --enable-auto-tool-choice
  --served-model-name glm-5.1-fp8

---

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "messages": [
      {"role":"user","content":"What is the weather in Vancouver?"}
    ],
    "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather for a city","parameters":{"type":"object","properties":{"city":{"type":"string","description":"City name"}},"required":["city"]}}}],
    "tool_choice":"auto",
    "chat_template_kwargs": {"enable_thinking": false}
  }'

---

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "messages": [
      {"role":"user","content":"What is the weather in Vancouver?"},
      {"role":"assistant","content":null,"tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"Vancouver\"}"}}]},
      {"role":"tool","tool_call_id":"call_1","content":"15°C, partly cloudy"}
    ],
    "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

---

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "prompt": "[gMASK]<sop><|system|>\n# Tools\nYou may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"name\": \"get_weather\", \"description\": \"Get weather\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}\n</tools>\nFor each function call, output the function name and arguments within the following XML format:\n<tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value>...</tool_call><|user|>What is the weather in Vancouver?<|assistant|><tool_call>get_weather<arg_key>city</arg_key><arg_value>Vancouver</arg_value></tool_call><|observation|><tool_response>15C, partly cloudy</tool_response><|assistant|>",
    "max_tokens": 100,
    "temperature": 0.1
  }'

---

[gMASK]<sop><|system|># Tools
...
<|user|>Weather?<|assistant|><think></think>None<tool_call>get_weather<arg_key>city</arg_key><arg_value>Vancouver</arg_value></tool_call><|observation|><tool_response>15C cloudy</tool_response>
RAW_BUFFERClick to expand / collapse

GLM-5.1-FP8: Tool results ignored via /v1/chat/completions but work perfectly via /v1/completions

Environment

  • vLLM version: 0.19.1.dev1+g43a9b1afb
  • transformers version: 5.4.0
  • Docker image: vllm/vllm-openai:glm51-cu130
  • Model: zai-org/GLM-5.1-FP8
  • Hardware: NVIDIA B300 DGX, 8 GPUs
  • Startup flags:
    --model zai-org/GLM-5.1-FP8
    --tensor-parallel-size 8
    --tool-call-parser glm47
    --reasoning-parser glm45
    --enable-auto-tool-choice
    --served-model-name glm-5.1-fp8

Bug Description

When using /v1/chat/completions with --tool-call-parser glm47, GLM-5.1-FP8 ignores tool results in multi-turn conversations (the inbound round-trip). The model always responds as if the tool returned no data / an error, even though the tool result content is clearly provided.

When using /v1/completions with the exact same conversation formatted using the correct chat template, the model reads and correctly reports tool results.

This proves the bug is in vLLM's chat completions pipeline, not the model.

Reproduction

Test 1 (Outbound tool call) — ✅ PASSES

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "messages": [
      {"role":"user","content":"What is the weather in Vancouver?"}
    ],
    "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather for a city","parameters":{"type":"object","properties":{"city":{"type":"string","description":"City name"}},"required":["city"]}}}],
    "tool_choice":"auto",
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Expected and actual result: Correctly returns tool_calls: [{"name": "get_weather", "arguments": {"city": "Vancouver"}}]


Test 2 (Inbound tool result round-trip) — ❌ FAILS

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "messages": [
      {"role":"user","content":"What is the weather in Vancouver?"},
      {"role":"assistant","content":null,"tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"Vancouver\"}"}}]},
      {"role":"tool","tool_call_id":"call_1","content":"15°C, partly cloudy"}
    ],
    "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Expected: "The weather in Vancouver is 15°C, partly cloudy."

Actual: "I'm sorry, it seems the weather data for Vancouver couldn't be retrieved at the moment..." ❌ — GLM completely ignores the tool result content.

Note: Using "content": "" (empty string) instead of null for the assistant message produces the same wrong result.


Test 3 (Direct completions bypass — proves model works) — ✅ PASSES

Using the exact prompt generated by AutoTokenizer.apply_chat_template():

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "prompt": "[gMASK]<sop><|system|>\n# Tools\nYou may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"name\": \"get_weather\", \"description\": \"Get weather\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}\n</tools>\nFor each function call, output the function name and arguments within the following XML format:\n<tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value>...</tool_call><|user|>What is the weather in Vancouver?<|assistant|><tool_call>get_weather<arg_key>city</arg_key><arg_value>Vancouver</arg_value></tool_call><|observation|><tool_response>15C, partly cloudy</tool_response><|assistant|>",
    "max_tokens": 100,
    "temperature": 0.1
  }'

Actual result: "The weather in Vancouver is currently **15°C** and **partly cloudy**." ✅ Correct!


Smoking Gun: Token Count Mismatch

Endpointprompt_tokensResult
/v1/chat/completions (Test 2)173❌ Ignores tool result
/v1/completions with correct prompt (Test 3)170✅ Reads tool result

3 extra tokens are being injected by the glm47 parser's chat template rendering pipeline. These extra tokens corrupt the conversation structure so GLM treats the <tool_response> block as unexpected text rather than a function result.

Chat Template Analysis

Running AutoTokenizer.apply_chat_template() manually (with arguments as a dict, not a JSON string):

[gMASK]<sop><|system|># Tools
...
<|user|>Weather?<|assistant|><think></think>None<tool_call>get_weather<arg_key>city</arg_key><arg_value>Vancouver</arg_value></tool_call><|observation|><tool_response>15C cloudy</tool_response>

Note <think></think>None appears before <tool_call> — this is the null content field of the assistant message being rendered as Python's None string. Even with chat_template_kwargs: {"enable_thinking": false} in the API request, vLLM's pipeline produces a different (173-token) prompt compared to the manually-rendered 170-token version.

Additional finding: The chat template crashes with jinja2.exceptions.UndefinedError: 'str object' has no attribute 'items' when arguments is passed as a JSON string (as in the OpenAI wire format) rather than a dict. vLLM's glm47 parser must be deserializing the JSON string before passing to the template — but the 3-token mismatch suggests this deserialization path introduces extra tokens.

Suspected Root Cause

The glm47 parser likely does one or more of:

  1. Injects <think></think> reasoning tokens before the <tool_call> in the assistant turn when reconstructing the prompt, even when enable_thinking=False is requested
  2. Does not properly strip None/null content from the assistant message before template rendering, adding extra tokens
  3. The combination of these extra tokens shifts the position of <|observation|> and <tool_response> such that GLM's attention doesn't associate the tool result with the preceding tool call

Workaround

Use /v1/completions with a manually pre-formatted GLM prompt (bypassing the chat completions pipeline entirely). This works but requires client-side chat template rendering.

Related Issues

  • #34449 — malformed tool calls with MTP speculative decoding (different: outbound formatting, not inbound)
  • #32436 / #31379 — parser regex crash on zero-argument tools (different: crash vs silent wrong result)
  • #27703 — reasoning parser leaking <think> tags in multi-turn tool conversations (closest: similar thinking-token injection issue)

This issue is distinct because: (a) the model produces correct output when bypassed, (b) the token count difference is measurable, and (c) the symptom is silent wrong results rather than a crash or missing field.

extent analysis

TL;DR

The most likely fix for the issue is to modify the glm47 parser to properly handle null content in assistant messages and avoid injecting extra tokens, such as <think></think>, into the chat template.

Guidance

  • Investigate the glm47 parser's template rendering pipeline to identify where the extra tokens are being injected.
  • Modify the parser to strip null content from assistant messages before template rendering.
  • Verify that the modified parser produces the correct 170-token prompt without extra tokens.
  • Test the modified parser with the provided test cases to ensure it produces the expected results.

Example

No code example is provided as the issue requires modification of the existing glm47 parser, which is not fully specified in the issue body.

Notes

The issue is specific to the glm47 parser and the GLM-5.1-FP8 model, and may not be applicable to other parsers or models. The workaround using /v1/completions with a manually pre-formatted GLM prompt is a temporary solution, but it requires client-side chat template rendering.

Recommendation

Apply a workaround by using /v1/completions with a manually pre-formatted GLM prompt until the glm47 parser is modified to properly handle null content and avoid injecting extra tokens. This will ensure correct results for tool calls in multi-turn conversations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING