vllm - ✅(Solved) Fix [Bug]: AssertionError: Multiple tool calls in one delta is not supported - Responses API streaming crashes when model generates parallel tool calls [2 pull requests, 5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39584Fetched 2026-04-12 13:24:34
View on GitHub
Comments
5
Participants
2
Timeline
16
Reactions
0
Author
Participants
Timeline (top)
commented ×5referenced ×5cross-referenced ×2subscribed ×2

Error Message

  • HTTP 500 error to the client During handling of the above exception, another exception occurred: ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) At minimum, document this limitation clearly in the Responses API documentation and return a proper error message instead of crashing.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: ARM Model name: Cortex-X925 Model: 1 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 1 Stepping: r0p1 Frequency boost: disabled CPU(s) scaling MHz: 100% CPU max MHz: 3900.0000 CPU min MHz: 1378.0000 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt Model name: Cortex-A725 Model: 1 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 1 Stepping: r0p1 CPU(s) scaling MHz: 100% CPU max MHz: 2808.0000 CPU min MHz: 338.0000 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt L1d cache: 1.3 MiB (20 instances) L1i cache: 1.3 MiB (20 instances) L2 cache: 25 MiB (20 instances) L3 cache: 24 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-19 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #39586: fix(responses): handle multiple tool calls per delta in streaming

Description (problem / solution / changelog)

Purpose

Fix #39584

The Responses API streaming code in _process_simple_streaming_events asserted that only one tool call can appear per delta:

assert len(delta_message.tool_calls) == 1, (
    "Multiple tool calls in one delta is not supported"
)

This causes an AssertionError when models generate parallel tool calls (e.g., calling multiple functions simultaneously in a single response).

The fix replaces the assertions and hardcoded tool_calls[0] references with proper iteration over all tool calls in each delta:

  1. First delta: Iterate over all tool calls, emitting a separate ResponseOutputItemAddedEvent for each with incremented output_index
  2. Streaming continuation: Iterate over all tool calls, handling argument deltas and tool-call-to-tool-call transitions (finalizing the previous tool call, starting a new one)
  3. Finalization: Collect arguments from all tool calls instead of asserting single tool call per delta

Test Plan

  • Syntax validation: python -c "import ast; ast.parse(...)" passes
  • Full test suite requires vLLM installation with CUDA dependencies

Test Result

Syntax check passes. The logic preserves the existing single-tool-call behavior (the common case) while correctly handling multiple tool calls per delta.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/entrypoints/openai/responses/serving.py (modified, +164/-116)

PR #39600: [Bugfix] Support parallel tool calls in Responses API streaming

Description (problem / solution / changelog)

Summary

Fixes #39584

When a model generates multiple tool calls in a single streaming delta (e.g. Qwen3 with qwen3_xml parser), _process_simple_streaming_events crashed with:

AssertionError: Multiple tool calls in one delta is not supported

There were 3 crash/silent-bug points and 1 ignored parameter:

  • Line 1375: assert len == 1 on first delta hard crash
  • Line 1574: tool_calls[0] mid-stream silently drops all calls after the first
  • Line 1693: assert len == 1 at finalization hard crash
  • parallel_tool_calls=False was never enforced in the Responses streaming path

Root Cause

The function used three scalar variables (current_tool_call_id, current_tool_call_name, current_item_id) to track tool call state. This architecture only supports one tool call at a time processing a second call overwrites the first's state.

Fix

Replace the three scalars with dict[int, ToolCallStreamState] keyed by DeltaToolCall.index:

`python @dataclass class ToolCallStreamState: item_id: str call_id: str name: str output_index: int args_parts: list[str] = field(default_factory=list)

tool_call_states: dict[int, ToolCallStreamState] = {} `

Each tool call is tracked independently across all streaming phases:

  • First delta: loop all TCs, register each with its own output_index, emit output_item.added per call
  • Mid-stream: route argument fragments to the correct call via tc.index lookup
  • Finalization: iterate all states, emit arguments.done + output_item.done per call independently

parallel_tool_calls=False is now enforced by filtering to index == 0 at the start of both the first-delta and mid-stream loops.

Additionally handles reasoningtool_call mid-stream transitions (when a reasoning model invokes tools without text content in between).

Testing

Run tests: ash pytest tests/entrypoints/openai/responses/test_serving_responses.py::TestParallelToolCallStreaming -v

6 unit tests covering:

  • test_single_tool_call_no_regression: original single-TC behavior unchanged
  • test_two_tool_calls_in_first_delta: two TCs in first delta produce separate output items
  • test_parallel_args_attributed_correctly_by_index: arg fragments routed by DeltaToolCall.index
  • test_parallel_tool_calls_false_keeps_only_first: parallel_tool_calls=False filters index != 0
  • test_first_delta_args_preserved: arguments bundled with name in registration delta are not lost
  • test_reasoning_then_tool_call_transition: reasoning close events fire before tool call open events

Verification:

  • Syntax: python -m py_compile
  • Imports: ruff check --select I
  • Text-only and reasoning-only paths: untouched

Related

  • Closes #39584
  • Related to #39586 (alternative fix; this PR addresses the state management root cause)

PR Checklist (Essential Elements)

  • Purpose of the changes is described in the PR description
  • Test plan is described and includes runnable test commands
  • Tests pass locally (syntax + lint; full CI requires Linux)
  • Docs not applicable (no API surface change)
  • Release notes not applicable (bugfix)

Changed files

  • tests/entrypoints/openai/responses/test_serving_responses.py (modified, +313/-0)
  • vllm/entrypoints/openai/responses/serving.py (modified, +267/-144)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (aarch64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.12.0.dev20260406+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.2.51
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GB10
Nvidia driver version        : 580.142
cuDNN version                : Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.20.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  20
On-line CPU(s) list:                     0-19
Vendor ID:                               ARM
Model name:                              Cortex-X925
Model:                                   1
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
Stepping:                                r0p1
Frequency boost:                         disabled
CPU(s) scaling MHz:                      100%
CPU max MHz:                             3900.0000
CPU min MHz:                             1378.0000
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Model name:                              Cortex-A725
Model:                                   1
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
Stepping:                                r0p1
CPU(s) scaling MHz:                      100%
CPU max MHz:                             2808.0000
CPU min MHz:                             338.0000
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
L1d cache:                               1.3 MiB (20 instances)
L1i cache:                               1.3 MiB (20 instances)
L2 cache:                                25 MiB (20 instances)
L3 cache:                                24 MiB (2 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-19
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.12.0.dev20260406+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0.dev20260402+cu130
[pip3] torchvision==0.27.0.dev20260406+cu130
[pip3] transformers==5.5.0
[pip3] triton==3.7.0+git9c288bc5
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1.dev0+g2a69949bd.d20260408 (git sha: 2a69949bd, date: 20260408)
vLLM Build Flags:
  CUDA Archs: 12.1a; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-19    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_REQUIRE_CUDA=cuda>=13.2 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581 brand=unknown,driver>=590,driver<591 brand=grid,driver>=590,driver<591 brand=tesla,driver>=590,driver<591 brand=nvidia,driver>=590,driver<591 brand=quadro,driver>=590,driver<591 brand=quadrortx,driver>=590,driver<591 brand=nvidiartx,driver>=590,driver<591 brand=vapps,driver>=590,driver<591 brand=vpc,driver>=590,driver<591 brand=vcs,driver>=590,driver<591 brand=vws,driver>=590,driver<591 brand=cloudgaming,driver>=590,driver<591
CUDA_VERSION=13.2.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
MAX_JOBS=16
TORCH_CUDA_ARCH_LIST=12.1a
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (aarch64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.12.0.dev20260406+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1014-nvidia-aarch64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.2.51
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GB10
Nvidia driver version        : 580.142
cuDNN version                : Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.20.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  20
On-line CPU(s) list:                     0-19
Vendor ID:                               ARM
Model name:                              Cortex-X925
Model:                                   1
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
Stepping:                                r0p1
Frequency boost:                         disabled
CPU(s) scaling MHz:                      100%
CPU max MHz:                             3900.0000
CPU min MHz:                             1378.0000
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Model name:                              Cortex-A725
Model:                                   1
Thread(s) per core:                      1
Core(s) per socket:                      10
Socket(s):                               1
Stepping:                                r0p1
CPU(s) scaling MHz:                      100%
CPU max MHz:                             2808.0000
CPU min MHz:                             338.0000
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
L1d cache:                               1.3 MiB (20 instances)
L1i cache:                               1.3 MiB (20 instances)
L2 cache:                                25 MiB (20 instances)
L3 cache:                                24 MiB (2 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-19
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.12.0.dev20260406+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0.dev20260402+cu130
[pip3] torchvision==0.27.0.dev20260406+cu130
[pip3] transformers==5.5.0
[pip3] triton==3.7.0+git9c288bc5
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1.dev0+g2a69949bd.d20260408 (git sha: 2a69949bd, date: 20260408)
vLLM Build Flags:
  CUDA Archs: 12.1a; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-19    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_REQUIRE_CUDA=cuda>=13.2 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581 brand=unknown,driver>=590,driver<591 brand=grid,driver>=590,driver<591 brand=tesla,driver>=590,driver<591 brand=nvidia,driver>=590,driver<591 brand=quadro,driver>=590,driver<591 brand=quadrortx,driver>=590,driver<591 brand=nvidiartx,driver>=590,driver<591 brand=vapps,driver>=590,driver<591 brand=vpc,driver>=590,driver<591 brand=vcs,driver>=590,driver<591 brand=vws,driver>=590,driver<591 brand=cloudgaming,driver>=590,driver<591
CUDA_VERSION=13.2.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
MAX_JOBS=16
TORCH_CUDA_ARCH_LIST=12.1a
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

Description When using the vLLM Responses API with streaming enabled and multiple tools configured, the server crashes with an AssertionError when the model generates multiple tool calls that get bundled into a single SSE delta event. The assertion at vllm/entrypoints/openai/responses/serving.py:1761 assumes exactly one tool call per delta:

assert len(pm.tool_calls) == 1, ( "Multiple tool calls in one delta is not supported" )

This violates the OpenAI Responses API specification, which explicitly states that multiple tool calls per turn are supported and should be emitted as separate events. Steps to Reproduce

  1. Start vLLM Server docker run -d --name vllm-qwen35-v2
    --gpus all
    --net=host
    --ipc=host
    -v ~/qwen-v2-project/models:/models
    vllm-qwen35-v2
    serve /models/qwen35-122b-hybrid
    --served-model-name qwen
    --max-model-len 262144
    --gpu-memory-utilization 0.90
    --reasoning-parser qwen3
    --enable-auto-tool-choice
    --tool-call-parser qwen3_xml
    --port 8000
    --host 0.0.0.0
    --load-format fastsafetensors
    --attention-backend FLASHINFER
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
    --override-generation-config '{ "temperature": 0.6, "top_p": 0.9, "top_k": 20, "repetition_penalty": 1.05, "presence_penalty": 0.0, "frequency_penalty": 0.0, "reasoning_effort": "high" }'

  2. Send Request with Multiple Tools import json import httpx import asyncio

TOOLS = [ {"type": "function", "name": "read_file", "description": "...", "parameters": {...}}, {"type": "function", "name": "write_file", "description": "...", "parameters": {...}}, {"type": "function", "name": "list_files", "description": "...", "parameters": {...}}, ]

PROMPT = """You need to analyze a codebase and make several changes. Please:

  1. Read the file 'tests/engine/xstate/run_xstate_gauntlet.py'
  2. Read the file 'tests/engine/xstate/run_xstate_batch.py'
  3. List the files in 'tests/engine/xstate/' Provide your analysis after examining all three sources."""

async def test(): async with httpx.AsyncClient(timeout=None) as client: async with client.stream( "POST", "http://192.168.1.176:8000/v1/responses", json={ "model": "qwen", "input": [{"type": "message", "role": "user", "content": [{"type": "input_text", "text": PROMPT}]}], "tools": TOOLS, "stream": True, "max_output_tokens": 4096 }, headers={"Authorization": "Bearer no-key-required"} ) as response: async for line in response.aiter_lines(): if line.startswith("data: ") and line[6:].strip() != "[DONE]": chunk = json.loads(line[6:]) if "tool_calls" in chunk and len(chunk["tool_calls"]) > 1: print(f"MULTI-CALL DELTA DETECTED: {chunk}")

asyncio.run(test())

  1. Observe Server Crash The vLLM server will crash with:

AssertionError: Multiple tool calls in one delta is not supported

Expected Behavior According to the OpenAI Responses API specification: "The model may choose to call multiple functions in a single turn." Each tool call should be emitted as a separate SSE event with its own output_index, not bundled together in a single delta. The server should handle multiple tool calls gracefully by:

  1. Emitting separate response.output_item.added events for each tool call
  2. Streaming arguments for each call independently via response.function_call_arguments.delta
  3. Completing each call with response.function_call_arguments.done Actual Behavior When the model generates multiple tool calls in rapid succession, vLLM bundles them into a single delta event. The _process_simple_streaming_events function then hits the assertion: assert len(pm.tool_calls) == 1, ( "Multiple tool calls in one delta is not supported" )

This causes:

  • Immediate stream termination
  • HTTP 500 error to the client
  • Connection reset (peer closed connection without sending complete message body) Stack Trace File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/responses/serving.py", line 1761, in _process_simple_streaming_events assert len(pm.tool_calls) == 1, ( ^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Multiple tool calls in one delta is not supported

During handling of the above exception, another exception occurred:

ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

Impact Critical: Breaks parallel tool calling, a core feature of the Responses API Blocks production deployments for agents that require multiple sequential tool calls Violates OpenAI API compatibility expectations Possible Solutions Option 1: Split Multi-Call Deltas (Recommended) Modify _process_simple_streaming_events to detect and split multi-call deltas: if len(delta_message.tool_calls) > 1: # Emit each tool call as a separate event for i, tool_call in enumerate(delta_message.tool_calls): yield ResponseOutputItemAddedEvent( type="response.output_item.added", output_index=current_output_index + i, item=ResponseFunctionToolCallItem(...) ) # ... stream arguments for each call separately else: # Existing single-call logic ...

Option 2: Buffer and Serialize Buffer multiple tool calls internally and emit them sequentially across multiple iterations of the streaming loop. Option 3: Document Limitation At minimum, document this limitation clearly in the Responses API documentation and return a proper error message instead of crashing. Related Issues OpenAI API Specification: Function Calling Guide vLLM Responses API Implementation: vllm/entrypoints/openai/responses/serving.py Additional Context This issue occurs with Qwen3.5 models that have aggressive parallel tool generation capabilities The problem is exacerbated with speculative decoding enabled, as draft models may generate multiple tool calls in a single forward pass Similar issues may exist in other tool parsers beyond qwen3_xml

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify the _process_simple_streaming_events function to detect and split multi-call deltas into separate events.

Guidance

  1. Identify the root cause: The issue arises from the assert statement in vllm/entrypoints/openai/responses/serving.py that assumes only one tool call per delta, which is not compatible with the OpenAI Responses API specification.
  2. Implement a fix: Modify the _process_simple_streaming_events function to handle multiple tool calls per delta by iterating over each tool call and emitting a separate event for each.
  3. Test the fix: Verify that the server no longer crashes when handling multiple tool calls in a single delta and that each tool call is emitted as a separate SSE event.
  4. Consider alternative solutions: If modifying the _process_simple_streaming_events function is not feasible, consider buffering and serializing multiple tool calls or documenting the limitation and returning a proper error message.

Example

if len(delta_message.tool_calls) > 1:
    # Emit each tool call as a separate event
    for i, tool_call in enumerate(delta_message.tool_calls):
        yield ResponseOutputItemAddedEvent(
            type="response.output_item.added",
            output_index=current_output_index + i,
            item=ResponseFunctionToolCallItem(...)
        )
        # ... stream arguments for each call separately

Notes

This fix assumes that the ResponseOutputItemAddedEvent and ResponseFunctionToolCallItem classes are defined and can be used to emit separate events for each tool call. Additionally, this fix may require modifications to the streaming loop to handle the separate events correctly.

Recommendation

Apply the workaround by modifying the _process_simple_streaming_events function to handle multiple tool calls per delta, as this is the most straightforward and compatible solution with the OpenAI Responses API specification.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING