vllm - 💡(How to fix) Fix [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37235Fetched 2026-04-08 00:48:37
View on GitHub
Comments
3
Participants
2
Timeline
5
Reactions
0
Timeline (top)
commented ×3closed ×1labeled ×1

I observed non-monotonic latency behavior when serving long-context requests locally with vLLM.

For a ctx=8192 prompt bucket, p95 latency at concurrency=1 was significantly worse than concurrency=2.

Example:

  • concurrency=1 → p95 ≈ 30.29s
  • concurrency=2 → p95 ≈ 6.37s

This inversion (~4.7× difference) did not appear in smaller context buckets (512 or 2048).

Root Cause

According to the vLLM documentation chatbot and related issues (e.g. #3096 and #4498), latency—especially TTFT—typically increases with concurrency for long contexts because requests queue behind the prefill stage.

Code Example

OS: Ubuntu 24.04.4 LTS
Python: 3.12.3
PyTorch: 2.10.0+cpu
CUDA available: False
vLLM version: 0.15.0
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS: Ubuntu 24.04.4 LTS
Python: 3.12.3
PyTorch: 2.10.0+cpu
CUDA available: False
vLLM version: 0.15.0
</details>

🐛 Describe the bug

Summary

I observed non-monotonic latency behavior when serving long-context requests locally with vLLM.

For a ctx=8192 prompt bucket, p95 latency at concurrency=1 was significantly worse than concurrency=2.

Example:

  • concurrency=1 → p95 ≈ 30.29s
  • concurrency=2 → p95 ≈ 6.37s

This inversion (~4.7× difference) did not appear in smaller context buckets (512 or 2048).

Prior investigation

According to the vLLM documentation chatbot and related issues (e.g. #3096 and #4498), latency—especially TTFT—typically increases with concurrency for long contexts because requests queue behind the prefill stage.

However, in my experiments with a ctx=8192 prompt bucket, I observed the opposite behavior between concurrency=1 and concurrency=2:

  • concurrency=1 → p95 ≈ 30.29s
  • concurrency=2 → p95 ≈ 6.37s

This suggests something unusual in how the scheduler or batching interacts with very long prompts at low concurrency.

Serve command used for the experiment:

vllm serve Qwen/Qwen2-0.5B-Instruct --host 0.0.0.0 --port 8000

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the batching strategy to handle long-context requests more efficiently at low concurrency.

Step-by-Step Solution:

  1. Update vLLM configuration: Modify the vllm_config.json file to adjust the batching parameters for long-context requests.
  2. Implement custom batching logic: Create a custom batching function that prioritizes long-context requests at low concurrency.

Example code snippet:

import torch

def custom_batching_function(requests, ctx_size):
    # Prioritize long-context requests at low concurrency
    if ctx_size == 8192 and len(requests) == 1:
        # Process the request immediately
        return [requests]
    else:
        # Default batching logic
        batches = []
        batch = []
        for request in requests:
            batch.append(request)
            if len(batch) == 2:
                batches.append(batch)
                batch = []
        if batch:
            batches.append(batch)
        return batches

# Update the vLLM configuration to use the custom batching function
vllm_config = {
    'batching_function': custom_batching_function,
    # Other configuration parameters...
}
  1. Apply the updated configuration: Restart the vLLM server with the updated configuration.

Verification

To verify the fix, re-run the experiment with the updated configuration and measure the p95 latency at concurrency=1 and concurrency=2.

Example command:

vllm serve Qwen/Qwen2-0.5B-Instruct --host 0.0.0.0 --port 8000 --config vllm_config.json

Monitor the latency metrics and compare the results with the previous measurements.

Extra Tips

  • Monitor the system resources and adjust the batching parameters accordingly to avoid overloading the system.
  • Consider implementing a more sophisticated batching strategy that takes into account the context size and concurrency level.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Non-monotonic latency at ctx=8192: concurrency 1 slower than concurrency 2 (local vLLM) [3 comments, 2 participants]