vllm - 💡(How to fix) Fix [Bug]: tool_choice="required" + speculative decoding with lukealonso/Qwen3.5-397B-A17B-NVFP4 leads to failed tool calls. [7 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38106Fetched 2026-04-08 01:32:17
View on GitHub
Comments
7
Participants
2
Timeline
8
Reactions
0
Timeline (top)
commented ×7labeled ×1

Error Message

elif request.tool_choice == "required": tool_calls = [] with contextlib.suppress(ValidationError): print(f"Inside tool extraction: {content=}") content = content or "" tool_calls = TypeAdapter(list[FunctionDefinition]).validate_json( content ) print(f"Tool calls {tool_calls=}") for tool_call in tool_calls: function_calls.append( FunctionCall( name=tool_call.name, arguments=json.dumps(tool_call.parameters, ensure_ascii=False), ) ) # content = None # Clear content since tool is called. if len(content) > 0 and len(tool_calls) == 0: print("Using alternative route.") try: tool_parser = tool_parser_cls(tokenizer) except RuntimeError as e: logger.exception("Error in tool parser creation.") raise e tool_call_info = tool_parser.extract_tool_calls( content if content is not None else "", request=request, # type: ignore ) if tool_call_info is not None and tool_call_info.tools_called: # extract_tool_calls() returns a list of tool calls. function_calls.extend( FunctionCall( id=tool_call.id, name=tool_call.function.name, arguments=tool_call.function.arguments, ) for tool_call in tool_call_info.tool_calls ) content = tool_call_info.content if content and content.strip() == "": content = None else: # No tool calls. return None, content content = None

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 9274F 24-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 95% CPU max MHz: 4304.1870 CPU min MHz: 1500.0000 BogoMIPS: 8100.28 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap Virtualization: AMD-V L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 24 MiB (24 instances) L3 cache: 256 MiB (8 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Vulnerable: Clear CPU buffers attempted, no microcode Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Solutions I found

  1. disabled speculative decoding, qwen3.5-397b as a base model seems to pretty much only generate json as tool calls
  2. I also added the tool call parser as a fallback for tool_choice="required" and it worked, i.e. this patch at https://github.com/vllm-project/vllm/blob/bcf2be96120005e9aea171927f85055a6a5c0cf6/vllm/entrypoints/openai/engine/serving.py#L1128
        elif request.tool_choice == "required":
            tool_calls = []
            with contextlib.suppress(ValidationError):
                print(f"Inside tool extraction: {content=}")
                content = content or ""
                tool_calls = TypeAdapter(list[FunctionDefinition]).validate_json(
                    content
                )
            print(f"Tool calls {tool_calls=}")
            for tool_call in tool_calls:
                function_calls.append(
                    FunctionCall(
                        name=tool_call.name,
                        arguments=json.dumps(tool_call.parameters, ensure_ascii=False),
                    )
                )
            # content = None  # Clear content since tool is called.
            if len(content) > 0 and len(tool_calls) == 0:
                print("Using alternative route.")
                try:
                    tool_parser = tool_parser_cls(tokenizer)
                except RuntimeError as e:
                    logger.exception("Error in tool parser creation.")
                    raise e
                tool_call_info = tool_parser.extract_tool_calls(
                    content if content is not None else "",
                    request=request,  # type: ignore
                )
                if tool_call_info is not None and tool_call_info.tools_called:
                    # extract_tool_calls() returns a list of tool calls.
                    function_calls.extend(
                        FunctionCall(
                            id=tool_call.id,
                            name=tool_call.function.name,
                            arguments=tool_call.function.arguments,
                        )
                        for tool_call in tool_call_info.tool_calls
                    )
                    content = tool_call_info.content
                    if content and content.strip() == "":
                        content = None
                else:
                    # No tool calls.
                    return None, content
            content = None

I can make this into PR if you want, otherwise I'll let you fix this in your own time.

Code Example

==============================
        System Info
==============================
OS                           : Debian GNU/Linux 13 (trixie) (x86_64)
GCC version                  : (Debian 14.2.0-19) 14.2.0
Clang version                : Could not collect
CMake version                : version 4.2.3
Libc version                 : glibc-2.41

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.13.1 (main, Jan 14 2025, 22:47:38) [Clang 19.1.6 ] (64-bit runtime)
Python platform              : Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Nvidia driver version        : 580.82.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9274F 24-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      95%
CPU max MHz:                             4304.1870
CPU min MHz:                             1500.0000
BogoMIPS:                                8100.28
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               768 KiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                24 MiB (24 instances)
L3 cache:                                256 MiB (8 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260325 (git sha: bcf2be961, date: 20260325)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	0-47	0		N/A
GPU1	NODE	 X 	NODE	NODE	0-47	0		N/A
GPU2	NODE	NODE	 X 	NODE	0-47	0		N/A
GPU3	NODE	NODE	NODE	 X 	0-47	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ubuntu

---

elif request.tool_choice == "required":
            tool_calls = []
            with contextlib.suppress(ValidationError):
                print(f"Inside tool extraction: {content=}")
                content = content or ""
                tool_calls = TypeAdapter(list[FunctionDefinition]).validate_json(
                    content
                )
            print(f"Tool calls {tool_calls=}")
            for tool_call in tool_calls:
                function_calls.append(
                    FunctionCall(
                        name=tool_call.name,
                        arguments=json.dumps(tool_call.parameters, ensure_ascii=False),
                    )
                )
            # content = None  # Clear content since tool is called.
            if len(content) > 0 and len(tool_calls) == 0:
                print("Using alternative route.")
                try:
                    tool_parser = tool_parser_cls(tokenizer)
                except RuntimeError as e:
                    logger.exception("Error in tool parser creation.")
                    raise e
                tool_call_info = tool_parser.extract_tool_calls(
                    content if content is not None else "",
                    request=request,  # type: ignore
                )
                if tool_call_info is not None and tool_call_info.tools_called:
                    # extract_tool_calls() returns a list of tool calls.
                    function_calls.extend(
                        FunctionCall(
                            id=tool_call.id,
                            name=tool_call.function.name,
                            arguments=tool_call.function.arguments,
                        )
                        for tool_call in tool_call_info.tool_calls
                    )
                    content = tool_call_info.content
                    if content and content.strip() == "":
                        content = None
                else:
                    # No tool calls.
                    return None, content
            content = None
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Debian GNU/Linux 13 (trixie) (x86_64)
GCC version                  : (Debian 14.2.0-19) 14.2.0
Clang version                : Could not collect
CMake version                : version 4.2.3
Libc version                 : glibc-2.41

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.13.1 (main, Jan 14 2025, 22:47:38) [Clang 19.1.6 ] (64-bit runtime)
Python platform              : Linux-6.12.48+deb13-amd64-x86_64-with-glibc2.41

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Nvidia driver version        : 580.82.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9274F 24-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      95%
CPU max MHz:                             4304.1870
CPU min MHz:                             1500.0000
BogoMIPS:                                8100.28
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               768 KiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                24 MiB (24 instances)
L3 cache:                                256 MiB (8 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260325 (git sha: bcf2be961, date: 20260325)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	0-47	0		N/A
GPU1	NODE	 X 	NODE	NODE	0-47	0		N/A
GPU2	NODE	NODE	 X 	NODE	0-47	0		N/A
GPU3	NODE	NODE	NODE	 X 	0-47	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ubuntu
</details>

🐛 Describe the bug

Hello,

Setup

I'm sending requests that want structured outputs, as such I set tool_choice="required" and have a tool that enforces the output structure. Example config, without prompts:

<details> <summary>Example config, without prompts:</summary> { "messages": [], "model": "ai_model", "max_completion_tokens": 16000, "presence_penalty": 1.5, "reasoning_effort": "low", "seed": 42, "stream": false, "temperature": 1.0, "tool_choice": "required", "tools": [ { "type": "function", "function": { "name": "final_result", "description": "The final response which ends this conversation", "parameters": { "properties": { "chain_of_thought": { "description": "", "type": "string" }, "conclusion": { "description": "The conclusion based on the occurrences.", "enum": ["detection", "no detection", "unsure"], "type": "string" } }, "required": ["chain_of_thought", "conclusion"], "title": "", "type": "object" } } } ], "top_p": 0.95, "top_k": 20, "priority": 0.0, "chat_template_kwargs": { "enable_thinking": true } } </details> <details> <summary> Setup for my vllm server: </summary> --model lukealonso/Qwen3.5-397B-A17B-NVFP4 --served-model-name ai_model --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --trust-remote-code --gpu-memory-utilization 0.90 --max-num-batched-tokens 16384 --max-num-seqs 84 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --mm-encoder-tp-mode data --mm-processor-cache-type shm --speculative-config '{"method":"mtp","num_speculative_tokens":3}' </details>

Now what happens is that the speculative decoding model of qwen3.5 (haven't tested if quantization is partly to blame) is generating tool calls as xml instead of json. This leads to an error in https://github.com/vllm-project/vllm/blob/bcf2be96120005e9aea171927f85055a6a5c0cf6/vllm/entrypoints/openai/engine/serving.py#L1132 since it expects json, which gets suppressed, which leads to an empty tool call list, but with finish_reason="tool_calls" in https://github.com/vllm-project/vllm/blob/bcf2be96120005e9aea171927f85055a6a5c0cf6/vllm/entrypoints/openai/chat_completion/serving.py#L1445. I.e. an output like this (not from vllm, but from my client library, but you get the point):

<details> <summary> Output: </summary> ChatCompletion( id='chatcmpl-9cabc085690ac46a', choices=[ Choice( finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage( content='', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning='...' # normal reasoning ), stop_reason=None, token_ids=None ) ], created=1774435104, model='ai_model', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=849, prompt_tokens=4592, total_tokens=5441, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, prompt_token_ids=None, kv_transfer_params=None ) </details>

Secondary bug

Imo finish_reason="tool_calls" without any tool called is another bug, but realistically one that can only happen if another bug is happening. You could put some check in the tool_choice="required" path, if there were actually any tool calls and if no, then error.

Solutions I found

  1. disabled speculative decoding, qwen3.5-397b as a base model seems to pretty much only generate json as tool calls
  2. I also added the tool call parser as a fallback for tool_choice="required" and it worked, i.e. this patch at https://github.com/vllm-project/vllm/blob/bcf2be96120005e9aea171927f85055a6a5c0cf6/vllm/entrypoints/openai/engine/serving.py#L1128
        elif request.tool_choice == "required":
            tool_calls = []
            with contextlib.suppress(ValidationError):
                print(f"Inside tool extraction: {content=}")
                content = content or ""
                tool_calls = TypeAdapter(list[FunctionDefinition]).validate_json(
                    content
                )
            print(f"Tool calls {tool_calls=}")
            for tool_call in tool_calls:
                function_calls.append(
                    FunctionCall(
                        name=tool_call.name,
                        arguments=json.dumps(tool_call.parameters, ensure_ascii=False),
                    )
                )
            # content = None  # Clear content since tool is called.
            if len(content) > 0 and len(tool_calls) == 0:
                print("Using alternative route.")
                try:
                    tool_parser = tool_parser_cls(tokenizer)
                except RuntimeError as e:
                    logger.exception("Error in tool parser creation.")
                    raise e
                tool_call_info = tool_parser.extract_tool_calls(
                    content if content is not None else "",
                    request=request,  # type: ignore
                )
                if tool_call_info is not None and tool_call_info.tools_called:
                    # extract_tool_calls() returns a list of tool calls.
                    function_calls.extend(
                        FunctionCall(
                            id=tool_call.id,
                            name=tool_call.function.name,
                            arguments=tool_call.function.arguments,
                        )
                        for tool_call in tool_call_info.tool_calls
                    )
                    content = tool_call_info.content
                    if content and content.strip() == "":
                        content = None
                else:
                    # No tool calls.
                    return None, content
            content = None

I can make this into PR if you want, otherwise I'll let you fix this in your own time.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the speculative decoding model generating tool calls as XML instead of JSON, we can implement the following steps:

  • Disable speculative decoding: As a temporary workaround, disabling speculative decoding can prevent the issue from occurring. However, this may impact model performance.
  • Implement a fallback tool call parser: Modify the tool_choice="required" path to use a fallback tool call parser when the primary parser fails. This can be achieved by adding a try-except block to catch ValidationError exceptions and then attempting to parse the tool calls using an alternative parser.

Here's an example code snippet that demonstrates the fallback parser implementation:

elif request.tool_choice == "required":
    tool_calls = []
    with contextlib.suppress(ValidationError):
        content = content or ""
        tool_calls = TypeAdapter(list[FunctionDefinition]).validate_json(content)
    if len(content) > 0 and len(tool_calls) == 0:
        try:
            tool_parser = tool_parser_cls(tokenizer)
        except RuntimeError as e:
            logger.exception("Error in tool parser creation.")
            raise e
        tool_call_info = tool_parser.extract_tool_calls(
            content if content is not None else "",
            request=request,  # type: ignore
        )
        if tool_call_info is not None and tool_call_info.tools_called:
            function_calls.extend(
                FunctionCall(
                    id=tool_call.id,
                    name=tool_call.function.name,
                    arguments=tool_call.function.arguments,
                )
                for tool_call in tool_call_info.tool_calls
            )
            content = tool_call_info.content
            if content and content.strip() == "":
                content = None
        else:
            # No tool calls.
            return None, content
    content = None

Verification

To verify that the fix worked, you can test the model with the same input that previously caused the issue. Check that the tool calls are now correctly parsed as JSON and that the finish_reason is set to "tool_calls" only when actual tool calls are made.

Extra Tips

  • Consider adding additional logging or debugging statements to help identify the root cause of the issue and verify the effectiveness of the fix.
  • If the issue persists, you may want to investigate further to determine why the speculative decoding model is generating XML tool calls instead of JSON. This could involve analyzing the model's output or modifying the model's configuration.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING