vllm - ✅(Solved) Fix [Bug]: tokenizing long redundant sequences causes API server deadlock (harmony and others) [1 pull requests, 8 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38266Fetched 2026-04-08 01:36:52
View on GitHub
Comments
8
Participants
4
Timeline
28
Reactions
0
Author
Timeline (top)
commented ×8subscribed ×6mentioned ×5referenced ×5

Error Message

parser.error("-size must be provided.")

Root Cause

  • High processing time caused by long sequence of similar characters can be problematic: depending on the tokenizer implementation it can be quite slow, for instance in harmony cases, up to 10m for 1M characters, 5s for 100k characters..

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8452Y CPU family: 6 Model: 143 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 Stepping: 8 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 128 MiB (32 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Vulnerability Vmscape: Not affected

PR fix notes

PR #38318: fix(harmony): run render_for_completion in thread pool to unblock event loop

Description (problem / solution / changelog)

Purpose

Resolves #38266. When a request contains a very long repetitive sequence, the synchronous Harmony tokenizer call (render_for_completion) could block the asyncio event loop for minutes, making the server unresponsive to all other requests.

Fix: wrap render_for_completion with make_async() (run_in_executor) and propagate async/await through the entire call chain (_make_request_with_harmony, HarmonyContext.render_for_completion, StreamingHarmonyContext.render_for_completion) so the event loop is free to serve concurrent requests while tokenization runs in a thread. The same fix is applied to OpenAIServingRender._make_request_with_harmony in the Chat Completions path.

Checked for existing PRs addressing #38266 — none found as of March 26, 2026.

Test Plan

Add a unit test, test_render_for_completion_async.py, to ensure all callers of the function are async and that the function itself releases the event loop.

Also run existing tests that call _make_request_with_harmony, since those call sites were updated to add await.

.venv/bin/python -m pytest tests/entrypoints/openai/parser/test_render_for_completion_async.py -v
.venv/bin/python -m pytest tests/entrypoints/openai/responses/test_serving_responses.py -v

Chat Completion test (requires model artifacts)

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -v

Test Result

This fix was developed with AI assistance (Claude). I have reviewed every changed line and personally ran the first two test commands above.

Note: test_serving_chat.py::TestGPTOSSChat requires the Harmony vocab file (not available in a standard dev environment) and was not run locally. The await fixes to that file are mechanical call-site corrections that follow directly from the async signature change.

.venv/bin/python -m pytest tests/entrypoints/openai/parser/test_render_for_completion_async.py -v
====================================== test session starts ======================================
platform darwin -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /Users/tzhou/Documents/GitHub/vllm/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/tzhou/Documents/GitHub/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 9 items                                                                               

tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestMakeAsyncReleasesEventLoop::test_blocking_call_in_executor_allows_other_coroutines PASSED [ 11%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestMakeAsyncReleasesEventLoop::test_make_async_returns_awaitable PASSED [ 22%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_render_for_completion_async_exported PASSED [ 33%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_harmony_context_render_for_completion_is_coroutine PASSED [ 44%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_streaming_harmony_context_render_for_completion_is_coroutine PASSED [ 55%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_make_request_with_harmony_is_coroutine_in_render_serving PASSED [ 66%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_make_request_with_harmony_is_coroutine_in_responses_serving PASSED [ 77%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestGetEncodingThreadSafety::test_get_encoding_initialized_exactly_once_under_concurrency PASSED [ 88%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncResult::test_async_wrapper_preserves_return_value PASSED [100%]
.venv/bin/python -m pytest tests/entrypoints/openai/responses/test_serving_responses.py -v
====================================== test session starts ======================================
platform darwin -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /Users/tzhou/Documents/GitHub/vllm/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/tzhou/Documents/GitHub/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 16 items                                                                              

tests/entrypoints/openai/responses/test_serving_responses.py::test_extract_tool_types PASSED [  6%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestInitializeToolSessions::test_initialize_tool_sessions PASSED [ 12%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestInitializeToolSessions::test_validate_create_responses_input PASSED [ 18%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestValidateGeneratorInput::test_validate_generator_input PASSED [ 25%]
tests/entrypoints/openai/responses/test_serving_responses.py::test_reasoning_tokens_counted_for_text_reasoning_model PASSED [ 31%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_basic_formats PASSED [ 37%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_star_normalization PASSED [ 43%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_filters_non_mcp PASSED [ 50%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_emits_text_events PASSED [ 56%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_second_token_no_added PASSED [ 62%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_function_recipient_not_preamble PASSED [ 68%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_done_emits_text_done_events PASSED [ 75%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_recipient_no_preamble_done PASSED [ 81%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_mixed_delta_reasoning_and_content_emits_reasoning_delta PASSED [ 87%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_transition_without_mixed_delta_no_extra_reasoning_event PASSED [ 93%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_reasoning_only_stream_no_content PASSED [100%]
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +52/-39)
  • tests/entrypoints/openai/parser/test_render_for_completion_async.py (added, +200/-0)
  • tests/entrypoints/openai/responses/test_serving_responses.py (modified, +1/-1)
  • vllm/entrypoints/openai/chat_completion/batch_serving.py (modified, +1/-1)
  • vllm/entrypoints/openai/parser/harmony_utils.py (modified, +11/-1)
  • vllm/entrypoints/openai/responses/context.py (modified, +8/-8)
  • vllm/entrypoints/openai/responses/serving.py (modified, +5/-5)
  • vllm/entrypoints/serve/render/serving.py (modified, +4/-4)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-106-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  32
On-line CPU(s) list:                     0-31
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8452Y
CPU family:                              6
Model:                                   143
Thread(s) per core:                      1
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                8
BogoMIPS:                                4000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               1 MiB (32 instances)
L1i cache:                               1 MiB (32 instances)
L2 cache:                                128 MiB (32 instances)
L3 cache:                                16 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-31
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Mitigation; TSX disabled
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-31    0               N/A
GPU1    NV18     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
VLLM_LOGGING_LEVEL=debug
VLLM_LOGGING_CONFIG_PATH=
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

import argparse
import json


def generate_query(target_size=None):
    data = {
        "model": "openai/gpt-oss-120b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello there"},
        ],
        "max_completion_tokens": 500,
    }
    if target_size is not None:
        # Pad the content to reach the target size
        current_size = len(json.dumps(data).encode("utf-8"))
        if current_size < target_size:
            extra_content = "a" * (target_size - current_size)
            data["messages"][1]["content"] += extra_content
    return data


if __name__ == "__main__":
    # Set up argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-s",
        "--size",
        type=int,
        default=1,
        help="Target file size in megabytes (default: 1 byte)",
    )
    parser.add_argument(
        "-o",
        "--output",
        type=str,
        default="big-payload.json",
        help="Output file path (default: big-payload.json)",
    )
    args = parser.parse_args()

    # Validate arguments
    if args.size is None:
        parser.error("-size must be provided.")

    with open(args.output, "w") as f:
        total_bytes = args.size
        data = generate_query(total_bytes)
        f.write(json.dumps(data) + "\n")
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-106-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  32
On-line CPU(s) list:                     0-31
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8452Y
CPU family:                              6
Model:                                   143
Thread(s) per core:                      1
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                8
BogoMIPS:                                4000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               1 MiB (32 instances)
L1i cache:                               1 MiB (32 instances)
L2 cache:                                128 MiB (32 instances)
L3 cache:                                16 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-31
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Mitigation; TSX disabled
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-31    0               N/A
GPU1    NV18     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
VLLM_LOGGING_LEVEL=debug
VLLM_LOGGING_CONFIG_PATH=
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

Hello,

It seems there is a critical issue with the tokenization implementation in the web server

When submitting a query with a lots of redundant characters, the server takes a lot of time to process the query and become unresponsive.

While focusing on the v0.17.0 version, I reproduce the same behavior with the v0.18.0 and the VLLM_USE_V2_MODEL_RUNNER=1 env.

There are 2 separate issue I want to emphasis

  • The tokenizer runtime for a given HTTP query should not impact serving other concurrent requests, currently that is the case and it cause unavailability of the service

  • High processing time caused by long sequence of similar characters can be problematic: depending on the tokenizer implementation it can be quite slow, for instance in harmony cases, up to 10m for 1M characters, 5s for 100k characters..

Taking the gpt-oss model (120b/20b) as example, when performing a call to /v1/chat/completions this method is called to perform the tokenization.

This is the function where most of the time is spent, I tried to update the whole code path to become "async" (with help of the make_async wrapper) but it didn't showed any impact on the concurrency aspect of the issue unfortunately.

<img width="1902" height="74" alt="Image" src="https://github.com/user-attachments/assets/fdfdad22-8fbe-4d7e-a369-72cd9b1d2f32" />

My guess is that something is oddly wrong with the web serving part since I assume each request should ideally be encapsulated inside it's own dedicated coroutine ?

how to reproduce

Generate a payload with long repetitive sequence

import argparse
import json


def generate_query(target_size=None):
    data = {
        "model": "openai/gpt-oss-120b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello there"},
        ],
        "max_completion_tokens": 500,
    }
    if target_size is not None:
        # Pad the content to reach the target size
        current_size = len(json.dumps(data).encode("utf-8"))
        if current_size < target_size:
            extra_content = "a" * (target_size - current_size)
            data["messages"][1]["content"] += extra_content
    return data


if __name__ == "__main__":
    # Set up argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-s",
        "--size",
        type=int,
        default=1,
        help="Target file size in megabytes (default: 1 byte)",
    )
    parser.add_argument(
        "-o",
        "--output",
        type=str,
        default="big-payload.json",
        help="Output file path (default: big-payload.json)",
    )
    args = parser.parse_args()

    # Validate arguments
    if args.size is None:
        parser.error("-size must be provided.")

    with open(args.output, "w") as f:
        total_bytes = args.size
        data = generate_query(total_bytes)
        f.write(json.dumps(data) + "\n")
  • with 1M tokens of padding

python3 generate-big-payload.py -s 1000000

  • perform the query and ⚠️ deadlock the server ⚠️

curl https://<address>/v1/chat/completions -d @big-payload.json -H "Content-Type: application/json"

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the tokenizer runtime impacting serving other concurrent requests and the high processing time caused by long sequences of similar characters, we can implement the following steps:

  • Async Tokenization: Modify the tokenization function to run asynchronously using a thread pool or process pool to prevent blocking the main thread.
  • Timeout and Cancellation: Implement a timeout and cancellation mechanism for the tokenization task to prevent it from running indefinitely and to allow for cancellation when a request is cancelled or times out.
  • Rate Limiting: Implement rate limiting on the number of concurrent tokenization tasks to prevent overwhelming the server with too many requests.

Here's an example of how you can modify the harmony_utils.py file to use async tokenization:

import asyncio
from concurrent.futures import ThreadPoolExecutor

def tokenize_async(input_text):
    # Create a thread pool with 5 worker threads
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Submit the tokenization task to the thread pool
        future = executor.submit(tokenize, input_text)
        # Wait for the task to complete with a timeout of 10 seconds
        try:
            result = future.result(timeout=10)
        except asyncio.TimeoutError:
            # Handle timeout error
            result = None
    return result

def tokenize(input_text):
    # Original tokenization function implementation
    # ...
    pass

To implement rate limiting, you can use a library like asyncio-semaphore to limit the number of concurrent tokenization tasks:

import asyncio
from asyncio import Semaphore

sem = Semaphore(5)  # Allow up to 5 concurrent tokenization tasks

async def tokenize_async(input_text):
    async with sem:
        # Submit the tokenization task to the thread pool
        result = await asyncio.to_thread(tokenize, input_text)
    return result

Verification

To verify that the fix worked, you can test the server with a large payload and check that it no longer deadlocks and that the tokenization task completes within the expected time limit.

You can use the generate-big-payload.py script to generate a large payload and test the server:

python3 generate-big-payload.py -s 1000000
curl https://<address>/v1/chat/completions -d @big-payload.json -H "Content-Type: application/json"

Monitor the server's performance and check that it can handle multiple concurrent requests without deadlocking.

Extra Tips

To prevent similar issues in the future, consider implementing the following:

  • Monitoring and Logging: Implement monitoring and logging to detect and diagnose performance issues and errors.
  • Load Testing: Perform regular load testing to identify performance bottlenecks and optimize the server for high traffic.
  • Code Review: Regularly review code changes to ensure that they do not introduce performance issues

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING