vllm - ✅(Solved) Fix [Bug]: GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator) [2 pull requests, 1 comments, 2 participants]

Q: Expected behavior

`/v1/chat/completions` with streaming enabled should not crash the API server when GPT-OSS emits tool-related output. ---

vllm2026-03-12 04:56:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36849•Fetched 2026-04-08 00:34:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

subscribed ×3cross-referenced ×2assigned ×1closed ×1

Error Message

The relevant traceback is: (APIServer pid=1) ERROR [serving.py:1390] Error in chat completion stream generator. (APIServer pid=1) ERROR [serving.py:1390] Traceback (most recent call last): (APIServer pid=1) ERROR [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 1262, in chat_completion_stream_generator (APIServer pid=1) ERROR [serving.py:1390] args = tool_parser.prev_tool_call_arr[index].get( (APIServer pid=1) ERROR [serving.py:1390] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^ (APIServer pid=1) ERROR [serving.py:1390] IndexError: list index out of range (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] Error in chat completion stream generator. (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] Traceback (most recent call last): (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 1262, in chat_completion_stream_generator (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] args = tool_parser.prev_tool_call_arr[index].get( (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^ (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] IndexError: list index out of range

Root Cause

sudo docker logs -f vllm-gptoss /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] █▄█▀ █ █ █ █ model openai/gpt-oss-20b (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:302] (APIServer pid=1) INFO 03-12 04:37:44 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'host': '0.0.0.0', 'port': 8080, 'disable_uvicorn_access_log': True, 'model': 'openai/gpt-oss-20b', 'max_model_len': 100000, 'quantization': 'mxfp4', 'served_model_name': ['vllm/openai/gpt-oss-20b'], 'reasoning_parser': 'openai_gptoss', 'distributed_executor_backend': 'mp', 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'prefix_caching_hash_algo': 'sha256_cbor', 'max_num_batched_tokens': 98304, 'max_num_seqs': 128, 'long_prefill_token_threshold': 8192, 'enable_chunked_prefill': True, 'async_scheduling': True, 'stream_interval': 20, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024], 'max_cudagraph_capture_size': 1024, 'compilation_config': {'level': None, 'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}} (APIServer pid=1) INFO 03-12 04:37:49 [model.py:531] Resolved architecture: GptOssForCausalLM Parse safetensors files: 100%|██████████| 3/3 [00:00<00:00, 5.11it/s] (APIServer pid=1) INFO 03-12 04:37:51 [model.py:1554] Using max model len 100000 (APIServer pid=1) INFO 03-12 04:37:51 [cache.py:223] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor. (APIServer pid=1) INFO 03-12 04:37:51 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=98304. (APIServer pid=1) INFO 03-12 04:37:51 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (APIServer pid=1) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (EngineCore_DP0 pid=202) INFO 03-12 04:37:56 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=100000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=vllm/openai/gpt-oss-20b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [98304], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=202) WARNING 03-12 04:37:56 [multiproc_executor.py:945] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_DP0 pid=202) INFO 03-12 04:37:56 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2 (Worker pid=274) INFO 03-12 04:38:00 [parallel_state.py:1393] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:48837 backend=nccl (Worker pid=273) INFO 03-12 04:38:00 [parallel_state.py:1393] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48837 backend=nccl (Worker pid=274) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (Worker pid=274) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (Worker pid=273) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (Worker pid=273) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (Worker pid=273) INFO 03-12 04:38:01 [pynccl.py:111] vLLM is using nccl==2.27.5 (Worker pid=273) WARNING 03-12 04:38:01 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=274) WARNING 03-12 04:38:01 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=273) WARNING 03-12 04:38:01 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=274) WARNING 03-12 04:38:01 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=273) INFO 03-12 04:38:01 [parallel_state.py:1715] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker pid=274) INFO 03-12 04:38:01 [parallel_state.py:1715] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A (Worker pid=274) INFO 03-12 04:38:02 [base.py:106] Offloader set to NoopOffloader (Worker pid=273) INFO 03-12 04:38:02 [base.py:106] Offloader set to NoopOffloader (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:02 [gpu_model_runner.py:4281] Starting to load model openai/gpt-oss-20b... (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:02 [cuda.py:405] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN']. (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:02 [mxfp4.py:165] Using Marlin backend (Worker pid=274) (Worker_TP1 pid=274) INFO 03-12 04:38:02 [mxfp4.py:165] Using Marlin backend Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.27it/s] (Worker pid=274) (Worker_TP1 pid=274) WARNING 03-12 04:38:06 [marlin_utils_fp4.py:338] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads. Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.36it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.79it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.64it/s] (Worker pid=273) (Worker_TP0 pid=273) (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:06 [default_loader.py:293] Loading weights took 1.85 seconds (Worker pid=273) (Worker_TP0 pid=273) WARNING 03-12 04:38:06 [marlin_utils_fp4.py:338] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads. (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:07 [gpu_model_runner.py:4364] Model loading took 7.11 GiB memory and 4.377632 seconds (Worker pid=274) (Worker_TP1 pid=274) WARNING 03-12 04:38:09 [allreduce_rms_fusion.py:736] Flashinfer allreduce fusion is not supported for world size 2 or max size is not provided (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:09 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/71286153e3/rank_0_0/backbone for vLLM's torch.compile (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:09 [backends.py:976] Dynamo bytecode transform time: 2.42 s (Worker pid=273) (Worker_TP0 pid=273) WARNING 03-12 04:38:09 [allreduce_rms_fusion.py:736] Flashinfer allreduce fusion is not supported for world size 2 or max size is not provided (Worker pid=273) (Worker_TP0 pid=273) WARNING 03-12 04:38:10 [allreduce_rms_fusion.py:845] AllReduce fusion pass is disabled. (Worker pid=274) (Worker_TP1 pid=274) WARNING 03-12 04:38:10 [allreduce_rms_fusion.py:845] AllReduce fusion pass is disabled. (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:12 [backends.py:350] Cache the graph of compile range (1, 98304) for later use (Worker pid=274) (Worker_TP1 pid=274) INFO 03-12 04:38:12 [backends.py:350] Cache the graph of compile range (1, 98304) for later use (Worker pid=274) (Worker_TP1 pid=274) INFO 03-12 04:38:28 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/029a99f92c01bc24b2abe98c3e21a32e91558afa6cc5eec08d93b4458fda6777/rank_1_0/model (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:28 [backends.py:366] Compiling a graph for compile range (1, 98304) takes 18.39 s (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:28 [monitor.py:35] torch.compile takes 21.20 s in total (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:28 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/029a99f92c01bc24b2abe98c3e21a32e91558afa6cc5eec08d93b4458fda6777/rank_0_0/model (Worker pid=274) (Worker_TP1 pid=274) INFO 03-12 04:38:28 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/029a99f92c01bc24b2abe98c3e21a32e91558afa6cc5eec08d93b4458fda6777/rank_1_0/model (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:28 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/029a99f92c01bc24b2abe98c3e21a32e91558afa6cc5eec08d93b4458fda6777/rank_0_0/model (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:33 [gpu_worker.py:424] Available KV cache memory: 16.62 GiB (EngineCore_DP0 pid=202) INFO 03-12 04:38:33 [kv_cache_utils.py:1314] GPU KV cache size: 1,452,000 tokens (EngineCore_DP0 pid=202) INFO 03-12 04:38:33 [kv_cache_utils.py:1319] Maximum concurrency for 100,000 tokens per request: 14.63x (Worker pid=273) (Worker_TP0 pid=273) 2026-03-12 04:38:33,548 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... (Worker pid=274) (Worker_TP1 pid=274) 2026-03-12 04:38:33,548 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... (Worker pid=274) (Worker_TP1 pid=274) 2026-03-12 04:38:34,761 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends (Worker pid=273) (Worker_TP0 pid=273) 2026-03-12 04:38:34,761 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 11/11 [00:03<00:00, 3.59it/s] Capturing CUDA graphs (decode, FULL): 100%|██████████| 8/8 [00:02<00:00, 3.75it/s] (Worker pid=273) (Worker_TP0 pid=273) INFO 03-12 04:38:40 [gpu_model_runner.py:5386] Graph capturing finished in 6 secs, took -3.92 GiB (EngineCore_DP0 pid=202) INFO 03-12 04:38:40 [core.py:282] init engine (profile, create kv cache, warmup model) took 33.53 seconds (EngineCore_DP0 pid=202) WARNING 03-12 04:38:43 [kv_cache_utils.py:96] PYTHONHASHSEED is not set. This will lead to non-reproducible block-hashes when using CBOR-based hash functions such as sha256_cbor or xxhash_cbor. Consider setting PYTHONHASHSEED to a fixed value for reproducibility. (EngineCore_DP0 pid=202) INFO 03-12 04:38:43 [vllm.py:747] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=202) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore_DP0 pid=202) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (APIServer pid=1) INFO 03-12 04:38:43 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-12 04:38:43 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) WARNING 03-12 04:38:43 [serving.py:225] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use. (APIServer pid=1) INFO 03-12 04:38:46 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) INFO 03-12 04:38:46 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-12 04:38:49 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=1) INFO 03-12 04:38:49 [serving.py:210] Chat template warmup completed in 2960.8ms (APIServer pid=1) INFO 03-12 04:38:49 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) INFO 03-12 04:38:50 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /docs, Methods: HEAD, GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /redoc, Methods: HEAD, GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-12 04:38:50 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (APIServer pid=1) WARNING 03-12 04:39:07 [input_processor.py:254] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat(). (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] Error in chat completion stream generator. (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] Traceback (most recent call last): (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 1262, in chat_completion_stream_generator (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] args = tool_parser.prev_tool_call_arr[index].get( (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^ (APIServer pid=1) ERROR 03-12 04:39:08 [serving.py:1390] IndexError: list index out of range (APIServer pid=1) INFO 03-12 04:39:10 [loggers.py:259] Engine 000: Avg prompt throughput: 378.1 tokens/s, Avg generation throughput: 85.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 7.4% (APIServer pid=1) INFO 03-12 04:39:20 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 7.4%

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 9950X3D 16-Core Processor CPU family: 26 Model: 68 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 CPU(s) scaling MHz: 71% CPU max MHz: 5752.0000 CPU min MHz: 600.0000 BogoMIPS: 8599.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d Virtualization: AMD-V L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 128 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Workarounds

PR fix notes

PR #36888: fix(serving): add bounds check for prev_tool_call_arr in streaming tool calls

Repository: vllm-project/vllm
Author: gambletan
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36888

Description (problem / solution / changelog)

Summary

Adds bounds checks for tool_parser.prev_tool_call_arr and tool_parser.streamed_args_for_tool before accessing them by index in the streaming chat completion generator
When using /v1/chat/completions with streaming + tool_call_parser=openai, the stream generator crashes with IndexError: list index out of range at tool_parser.prev_tool_call_arr[index] if the tool parser's prev_tool_call_arr is empty when the model emits tool-call-like output

Root Cause

In chat_completion_stream_generator, when output.finish_reason is not None:

auto_tools_called = len(tool_parser.prev_tool_call_arr) > 0 → False (empty)
index = 0 (the else branch)
_should_check_for_unstreamed_tool_arg_tokens returns True because delta_message.tool_calls exists
tool_parser.prev_tool_call_arr[0] → IndexError

Fix

Add index < len(tool_parser.prev_tool_call_arr) and index < len(tool_parser.streamed_args_for_tool) to the guard condition, so the unstreamed-args recovery block is skipped when the arrays are empty.

Fixes #36849

Changed files

vllm/entrypoints/openai/chat_completion/serving.py (modified, +2/-0)

PR #37958: [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser

Repository: vllm-project/vllm
Author: chaunceyjiang
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37958

Description (problem / solution / changelog)

Purpose

FIX https://github.com/vllm-project/vllm/issues/37937 FIX https://github.com/vllm-project/vllm/issues/36849

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/entrypoints/openai/chat_completion/serving.py (modified, +6/-6)

Code Example

sudo docker run -d \
  --name vllm-gptoss \
  --restart unless-stopped \
  --gpus all \
  --ipc=host \
  -p 8080:8080 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HOME="/root/.cache/huggingface" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd)/GPT-OSS_MAXED.yaml:/etc/vllm/config.yaml:ro" \
  vllm/vllm-openai:deploy \
  --config /etc/vllm/config.yaml \
  --served-model-name vllm/openai/gpt-oss-20b

RAW_BUFFERClick to expand / collapse

My current environment

<details> <summary>Full environment dump</summary>

============================== System Info

OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version : Could not collect CMake version : version 3.28.3 Libc version : glibc-2.39

============================== PyTorch Info

PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-6.8.0-85-generic-x86_64-with-glibc2.39

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : 12.0.140 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090 GPU 1: NVIDIA GeForce RTX 5090

Nvidia driver version : 580.95.05 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.1 [pip3] nvidia-cutlass-dsl-libs-base==4.4.1 [pip3] nvidia-ml-py==13.590.48 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.10.0 [pip3] torchvision==0.25.0 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.17.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-31 0 N/A GPU1 PHB X 0-31 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_rnd

</details>

🐛 Describe the bug

GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator)

I’m seeing a reproducible crash with openai/gpt-oss-20b on vLLM 0.17.1 when using the OpenAI-compatible chat completions endpoint with streaming and tool parsing enabled.

The failure happens in chat_completion_stream_generator with:

IndexError: list index out of range

The relevant traceback is:

(APIServer pid=1) ERROR [serving.py:1390] Error in chat completion stream generator.
(APIServer pid=1) ERROR [serving.py:1390] Traceback (most recent call last):
(APIServer pid=1) ERROR [serving.py:1390]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 1262, in chat_completion_stream_generator
(APIServer pid=1) ERROR [serving.py:1390]     args = tool_parser.prev_tool_call_arr[index].get(
(APIServer pid=1) ERROR [serving.py:1390]            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
(APIServer pid=1) ERROR [serving.py:1390] IndexError: list index out of range

What I’m running

Model

openai/gpt-oss-20b

vLLM

0.17.1

Important server args / non-default config

tool_call_parser: "openai"
enable_auto_tool_choice: true
reasoning_parser: "openai_gptoss"
quantization: "mxfp4"
kv_cache_dtype: "fp8_e4m3"
tensor_parallel_size: 2
distributed_executor_backend: "mp"
max_model_len: 100000
enable_prefix_caching: true
enable_chunked_prefill: true
async_scheduling: true
stream_interval: 20

Server log confirms the relevant config and behavior:

non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', ... 'reasoning_parser': 'openai_gptoss', ...}

(APIServer pid=1) INFO [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING [serving.py:225] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.

Server startup

The server is started using Docker with the following command:

sudo docker run -d \
  --name vllm-gptoss \
  --restart unless-stopped \
  --gpus all \
  --ipc=host \
  -p 8080:8080 \
  -e HF_TOKEN="$HF_TOKEN" \
  -e HF_HOME="/root/.cache/huggingface" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd)/GPT-OSS_MAXED.yaml:/etc/vllm/config.yaml:ro" \
  vllm/vllm-openai:deploy \
  --config /etc/vllm/config.yaml \
  --served-model-name vllm/openai/gpt-oss-20b

The full configuration file (GPT-OSS_MAXED.yaml)

model: "openai/gpt-oss-20b" quantization: "mxfp4" kv_cache_dtype: "fp8_e4m3" tensor-parallel-size: 2 distributed-executor-backend: "mp" tool_call_parser: "openai" enable_auto_tool_choice: true reasoning_parser: "openai_gptoss" enable-chunked-prefill: true long-prefill-token-threshold: 8192 max-num-partial-prefills: 1 max-long-partial-prefills: 1 prefix-caching-hash-algo: "sha256_cbor" disable-uvicorn-access-log: true cudagraph-capture-sizes: [1,2,4,8,16,32,64,128,256,512,1024] compilation_config: '{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}' async-scheduling: true max-cudagraph-capture-size: 1024 stream-interval: 20 gpu-memory-utilization: 0.90 max-model-len: 100000 max-num-batched-tokens: 98304 max-num-seqs: 128 enable-prefix-caching: true host: "0.0.0.0" port: 8080

Endpoint / request shape

This happens on:

/v1/chat/completions
stream=true

The crash occurs during streamed generation after the request has started normally.

I’m reproducing this through Open WebUI. At the moment Open WebUI does not fully support the OpenAI /v1/responses workflow for GPT-OSS-style models, so in practice this model is exercised through /v1/chat/completions in my setup.

There are existing Open WebUI feature requests / issues related to incomplete Responses API support and GPT-OSS response parsing. However, this report is specifically about the vLLM server-side crash occurring in the chat-completions streaming path.

Expected behavior

/v1/chat/completions with streaming enabled should not crash the API server when GPT-OSS emits tool-related output.

Actual behavior

The API server crashes the streaming generator with:

IndexError: list index out of range

from:

tool_parser.prev_tool_call_arr[index]

Why this looks like a bug

This appears to be an out-of-sync state issue between streamed tool-call parsing and GPT-OSS output/reasoning handling in the chat-completions path.

The logs also show:

For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.

So GPT-OSS seems to always enter the tool-use path here, even if the frontend/request behavior may not align with what chat_completion_stream_generator expects.

Reproduction notes

The issue is reproducible when these conditions are present together:

reasoning_parser=openai_gptoss
tool_call_parser=openai
/v1/chat/completions
stream=true

Workarounds

The crash can be avoided by disabling the tool parsing path (e.g. removing tool_call_parser / tool usage for this model when using /v1/chat/completions).

The model otherwise loads and serves normally; the failure appears specific to the streamed chat-completions tool-calling path.

Full startup / runtime logs

(see logs attached below)

Environment

Full environment information is provided and generated by collect_env.py.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the IndexError: list index out of range issue in the chat_completion_stream_generator, we need to ensure that the index used to access tool_parser.prev_tool_call_arr is within the valid range of the list.

Here are the steps to fix the issue:

Check the length of tool_parser.prev_tool_call_arr: Before accessing the list with the index, check if the index is less than the length of the list.
Handle out-of-range indices: If the index is out of range, handle it accordingly. This could involve skipping the current iteration, logging an error, or using a default value.
Verify the logic of index calculation: Review the code that calculates the index to ensure it is correct and does not exceed the list length.

Example code to handle out-of-range indices:

if index < len(tool_parser.prev_tool_call_arr):
    args = tool_parser.prev_tool_call_arr[index].get(...)
else:
    # Handle out-of-range index, e.g., log an error or use a default value
    logging.error(f"Index {index} out of range for tool_parser.prev_tool_call_arr")
    # Use a default value or skip the current iteration
    args = {}

Verification

To verify that the fix worked, test the chat_completion_stream_generator function with different inputs and edge cases, including:

Valid indices within the range of tool_parser.prev_tool_call_arr
Out-of-range indices
Empty tool_parser.prev_tool_call_arr list

Monitor the logs and output to ensure that the function behaves as expected and does not raise any IndexError exceptions.

Extra Tips

Review the documentation and code comments to understand the expected behavior of tool_parser.prev_tool_call_arr and index.
Consider adding additional logging or debugging statements to help diagnose issues with the index calculation or list access.
If the issue persists, try to reproduce it with a minimal example and provide more details about the input data and environment.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

/v1/chat/completions with streaming enabled should not crash the API server when GPT-OSS emits tool-related output.

#api #ssr #output truncation #response parsing #model loading #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator) [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

Workarounds

PR fix notes

PR #36888: fix(serving): add bounds check for prev_tool_call_arr in streaming tool calls

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Changed files

PR #37958: [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

My current environment

============================== System Info

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

🐛 Describe the bug

GPT-OSS-20B /v1/chat/completions streaming crashes with tool_call_parser=openai (IndexError in chat_completion_stream_generator)

What I’m running

Server startup

Endpoint / request shape

Expected behavior

Actual behavior

Why this looks like a bug

Reproduction notes

Workarounds

Full startup / runtime logs

Environment

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING