vllm - ✅(Solved) Fix [Performance]: qwen3.5 vs qwen3 [1 pull requests, 9 comments, 6 participants]

fangbaolei · 2026-03-10T09:41:41Z

[vllm] PR 36844: Core Guard mamba prefill split fragmentation - Repository: vllm-project/vllm - Author: yunseoLee0343 - State: closed | merged: False - Link: h… # PR #36844: [Core] Guard mamba prefill split fragmentation - Repository: vllm-project/vllm - Author: yunseoLee0343 - State: closed | merged: False - Link: https://github.com/vllm-project/vllm/pull/36844 ## Description (problem / solution / changelog) ## Summary Prefill fragmentation in `_mamba_block_aligned_split()` may amplify TTFT when hybrid (Mamba / DeltaNet) layers are used. ## What this PR does 1. Add scheduler instrumentation 2. Add fragmentation counters 3. Add minimal zero-collapse safeguard 4. Add benchmark script 5. Add unit tests ## Instrumentation The following counters are introduced to observe scheduling behavior: - `mamba_fragmentation_count` - `mamba_zero_collapse_count` - `_scheduler_iteration` These metrics help track: - how often prefill chunks are fragmented - whether alignment collapses to zero - how many scheduler rounds occur during prefill ## Benchmark Benchmark script included: `benchmarks/mamba_prefill_fragmentation.py` Default configuration: ``` model = Qwen3-5.2B prompt length ≈ 2000 tokens batch size = 1 ``` Metrics reported: - TTFT - tokens/sec - scheduler rounds - fragmentation events ## Safety The safeguard only triggers when alignment collapses to zero while the scheduler requested tokens. It prevents no-progress scheduling rounds without changing normal block alignment behavior. ## Motivation Related discussion: https://github.com/vllm-project/vllm/issues/36627 ## Notes I currently do not have access to an A10/A100 GPU, so the patch is based on source analysis and local instrumentation. Benchmark validation from others with production GPUs would be appreciated. ## Changed files - `.buildkite/performance-benchmarks/scripts/compare-json-results.py` (modified, +92/-301) - `benchmarks/mamba_prefill_fragmentation.py` (added, +149/-0) - `tests/config/test_config_utils.py` (modified, +1/-1) - `tests/cuda/test_cuda_compatibility_path.py` (modified, +3/-2) - `tests/models/language/generation/test_gemma.py` (modified, +3/-1) - `tests/models/multimodal/generation/test_common.py` (modified, +120/-40) - `tests/models/multimodal/generation/vlm_utils/custom_inputs.py` (modified, +3/-1) - `tests/models/multimodal/generation/vlm_utils/model_utils.py` (modified, +14/-14) - `tests/models/multimodal/processing/test_qwen2_5_omni_embed.py` (modified, +4/-6) - `tests/test_mamba_scheduler_split.py` (added, +41/-0) - `tests/v1/kv_connector/unit/test_offloading_connector.py` (modified, +27/-29) - `vllm/benchmarks/datasets.py` (modified, +5/-3) - `vllm/envs.py` (modified, +76/-75) - `vllm/model_executor/layers/activation.py` (modified, +11/-9) - `vllm/model_executor/layers/mamba/ops/mamba_ssm.py` (modified, +3/-2) - `vllm/model_executor/layers/quantization/kv_cache.py` (modified, +2/-2) - `vllm/model_executor/model_loader/bitsandbytes_loader.py` (modified, +2/-3) - `vllm/model_executor/models/whisper_causal.py` (modified, +9/-11) - `vllm/multimodal/inputs.py` (modified, +5/-3) - `vllm/third_party/pynvml.py` (modified, +2788/-1702) - `vllm/v1/core/sched/scheduler.py` (modified, +31/-0) - `vllm/v1/engine/core.py` (modified, +3/-2) ## Fixed - Fixed by PR: [Core] Guard mamba prefill split fragmentation (https://github.com/vllm-project/vllm/pull/36844) ### Proposal to improve performance _No response_ ### Report of performance regression _No response_ ### Misc discussion on performance Why does the actual performance test show that Qwen3.5 is not better than Qwen3, especially with TTFT being much slower than Qwen3? Environment: A10 24G Driver Version: 590.48.01 CUDA Version: 13.1 VLLM Version: vllm/vllm-openai:nightly Service startup command: docker run --rm \ --name qwen-server \ --runtime nvidia \ --gpus device=0 \ -v /data/cache/huggingface:/root/.cache/huggingface \ -v /work:/models \ -p 13003:8000 \ --ipc=host \ vllm/vllm-openai:nightly \ --model /models/Qwen3-8B \ --served-model-name Qwen3-8B \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 \ --reasoning-parser qwen3 \ --enable-prefix-caching ### Your current environment (if you think it is necessary) ```text The output of `python collect_env.py` ``` ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-10 09:41:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36627•Fetched 2026-04-08 00:35:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9subscribed ×9mentioned ×2closed ×1

Fix Action

Fixed

Fixed by PR: [Core] Guard mamba prefill split fragmentation (https://github.com/vllm-project/vllm/pull/36844)

PR fix notes

PR #36844: [Core] Guard mamba prefill split fragmentation

Repository: vllm-project/vllm
Author: yunseoLee0343
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/36844

Description (problem / solution / changelog)

Summary

Prefill fragmentation in _mamba_block_aligned_split() may amplify TTFT when hybrid (Mamba / DeltaNet) layers are used.

What this PR does

Add scheduler instrumentation
Add fragmentation counters
Add minimal zero-collapse safeguard
Add benchmark script
Add unit tests

Instrumentation

The following counters are introduced to observe scheduling behavior:

mamba_fragmentation_count
mamba_zero_collapse_count
_scheduler_iteration

These metrics help track:

how often prefill chunks are fragmented
whether alignment collapses to zero
how many scheduler rounds occur during prefill

Benchmark

Benchmark script included: benchmarks/mamba_prefill_fragmentation.py

Default configuration:

model = Qwen3-5.2B
prompt length ≈ 2000 tokens
batch size = 1

Metrics reported:

TTFT
tokens/sec
scheduler rounds
fragmentation events

Safety

The safeguard only triggers when alignment collapses to zero while the scheduler requested tokens. It prevents no-progress scheduling rounds without changing normal block alignment behavior.

Motivation

Notes

I currently do not have access to an A10/A100 GPU, so the patch is based on source analysis and local instrumentation. Benchmark validation from others with production GPUs would be appreciated.

Changed files

.buildkite/performance-benchmarks/scripts/compare-json-results.py (modified, +92/-301)
benchmarks/mamba_prefill_fragmentation.py (added, +149/-0)
tests/config/test_config_utils.py (modified, +1/-1)
tests/cuda/test_cuda_compatibility_path.py (modified, +3/-2)
tests/models/language/generation/test_gemma.py (modified, +3/-1)
tests/models/multimodal/generation/test_common.py (modified, +120/-40)
tests/models/multimodal/generation/vlm_utils/custom_inputs.py (modified, +3/-1)
tests/models/multimodal/generation/vlm_utils/model_utils.py (modified, +14/-14)
tests/models/multimodal/processing/test_qwen2_5_omni_embed.py (modified, +4/-6)
tests/test_mamba_scheduler_split.py (added, +41/-0)
tests/v1/kv_connector/unit/test_offloading_connector.py (modified, +27/-29)
vllm/benchmarks/datasets.py (modified, +5/-3)
vllm/envs.py (modified, +76/-75)
vllm/model_executor/layers/activation.py (modified, +11/-9)
vllm/model_executor/layers/mamba/ops/mamba_ssm.py (modified, +3/-2)
vllm/model_executor/layers/quantization/kv_cache.py (modified, +2/-2)
vllm/model_executor/model_loader/bitsandbytes_loader.py (modified, +2/-3)
vllm/model_executor/models/whisper_causal.py (modified, +9/-11)
vllm/multimodal/inputs.py (modified, +5/-3)
vllm/third_party/pynvml.py (modified, +2788/-1702)
vllm/v1/core/sched/scheduler.py (modified, +31/-0)
vllm/v1/engine/core.py (modified, +3/-2)

Code Example

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Why does the actual performance test show that Qwen3.5 is not better than Qwen3, especially with TTFT being much slower than Qwen3?

Environment: A10 24G Driver Version: 590.48.01 CUDA Version: 13.1

VLLM Version: vllm/vllm-openai:nightly

Service startup command:

docker run --rm
--name qwen-server
--runtime nvidia
--gpus device=0
-v /data/cache/huggingface:/root/.cache/huggingface
-v /work:/models
-p 13003:8000
--ipc=host
vllm/vllm-openai:nightly
--model /models/Qwen3-8B
--served-model-name Qwen3-8B
--max-model-len 8192
--gpu-memory-utilization 0.9
--reasoning-parser qwen3
--enable-prefix-caching

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To improve performance, we will focus on optimizing the Qwen3.5 model configuration and adjusting the service startup command.

Step-by-Step Solution:

Update the gpu-memory-utilization parameter: Reduce the value to prevent GPU memory overload.
Implement prefix caching: Enable prefix caching to improve inference speed.
Optimize model configuration: Adjust the max-model-len parameter to balance performance and accuracy.

Example Code:

# Updated service startup command
docker run --rm \
  --name qwen-server \
  --runtime nvidia \
  --gpus device=0 \
  -v /data/cache/huggingface:/root/.cache/huggingface \
  -v /work:/models \
  -p 13003:8000 \
  --ipc=host \
  vllm/vllm-openai:nightly \
  --model /models/Qwen3.5-8B \
  --served-model-name Qwen3.5-8B \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.7 \
  --reasoning-parser qwen3.5 \
  --enable-prefix-caching \
  --cache-size 1000

Verification

Monitor the performance test results to ensure the updated configuration improves performance. Check the GPU memory utilization and adjust the gpu-memory-utilization parameter as needed.

Extra Tips

Regularly update the driver version and CUDA version to ensure compatibility and optimal performance.
Consider implementing a load balancer to distribute the workload and prevent single-point failures.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.