vllm - 💡(How to fix) Fix [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) [3 comments, 2 participants]

vllm2026-03-16 22:52:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37235•Fetched 2026-04-08 00:48:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

NoahLundSyrdal

Participants

KrxGu

NoahLundSyrdal

Timeline (top)

commented ×3closed ×1labeled ×1

I observed non-monotonic latency behavior when serving long-context requests locally with vLLM.

For a ctx=8192 prompt bucket, p95 latency at concurrency=1 was significantly worse than concurrency=2.

Example:

concurrency=1 → p95 ≈ 30.29s
concurrency=2 → p95 ≈ 6.37s

This inversion (~4.7× difference) did not appear in smaller context buckets (512 or 2048).

Root Cause

According to the vLLM documentation chatbot and related issues (e.g. #3096 and #4498), latency—especially TTFT—typically increases with concurrency for long contexts because requests queue behind the prefill stage.

Code Example

OS: Ubuntu 24.04.4 LTS
Python: 3.12.3
PyTorch: 2.10.0+cpu
CUDA available: False
vLLM version: 0.15.0

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS: Ubuntu 24.04.4 LTS
Python: 3.12.3
PyTorch: 2.10.0+cpu
CUDA available: False
vLLM version: 0.15.0

</details>

🐛 Describe the bug

Summary

I observed non-monotonic latency behavior when serving long-context requests locally with vLLM.

For a ctx=8192 prompt bucket, p95 latency at concurrency=1 was significantly worse than concurrency=2.

Example:

concurrency=1 → p95 ≈ 30.29s
concurrency=2 → p95 ≈ 6.37s

This inversion (~4.7× difference) did not appear in smaller context buckets (512 or 2048).

Prior investigation

However, in my experiments with a ctx=8192 prompt bucket, I observed the opposite behavior between concurrency=1 and concurrency=2:

concurrency=1 → p95 ≈ 30.29s
concurrency=2 → p95 ≈ 6.37s

This suggests something unusual in how the scheduler or batching interacts with very long prompts at low concurrency.

Serve command used for the experiment:

vllm serve Qwen/Qwen2-0.5B-Instruct --host 0.0.0.0 --port 8000

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the batching strategy to handle long-context requests more efficiently at low concurrency.

Step-by-Step Solution:

Update vLLM configuration: Modify the vllm_config.json file to adjust the batching parameters for long-context requests.
Implement custom batching logic: Create a custom batching function that prioritizes long-context requests at low concurrency.

Example code snippet:

import torch

def custom_batching_function(requests, ctx_size):
    # Prioritize long-context requests at low concurrency
    if ctx_size == 8192 and len(requests) == 1:
        # Process the request immediately
        return [requests]
    else:
        # Default batching logic
        batches = []
        batch = []
        for request in requests:
            batch.append(request)
            if len(batch) == 2:
                batches.append(batch)
                batch = []
        if batch:
            batches.append(batch)
        return batches

# Update the vLLM configuration to use the custom batching function
vllm_config = {
    'batching_function': custom_batching_function,
    # Other configuration parameters...
}

Apply the updated configuration: Restart the vLLM server with the updated configuration.

Verification

To verify the fix, re-run the experiment with the updated configuration and measure the p95 latency at concurrency=1 and concurrency=2.

Example command:

vllm serve Qwen/Qwen2-0.5B-Instruct --host 0.0.0.0 --port 8000 --config vllm_config.json

Monitor the latency metrics and compare the results with the previous measurements.

Extra Tips

Monitor the system resources and adjust the batching parameters accordingly to avoid overloading the system.
Consider implementing a more sophisticated batching strategy that takes into account the context size and concurrency level.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #dependency conflict #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Summary

Prior investigation

extent analysis

Fix Plan

Step-by-Step Solution:

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Summary

Prior investigation

extent analysis

Fix Plan

Step-by-Step Solution:

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING