vllm - ✅(Solved) Fix [Performance]: qwen3.5 vs qwen3 [1 pull requests, 9 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36627Fetched 2026-04-08 00:35:51
View on GitHub
Comments
9
Participants
6
Timeline
24
Reactions
3
Timeline (top)
commented ×9subscribed ×9mentioned ×2closed ×1

Fix Action

Fixed

PR fix notes

PR #36844: [Core] Guard mamba prefill split fragmentation

Description (problem / solution / changelog)

Summary

Prefill fragmentation in _mamba_block_aligned_split() may amplify TTFT when hybrid (Mamba / DeltaNet) layers are used.

What this PR does

  1. Add scheduler instrumentation
  2. Add fragmentation counters
  3. Add minimal zero-collapse safeguard
  4. Add benchmark script
  5. Add unit tests

Instrumentation

The following counters are introduced to observe scheduling behavior:

  • mamba_fragmentation_count
  • mamba_zero_collapse_count
  • _scheduler_iteration

These metrics help track:

  • how often prefill chunks are fragmented
  • whether alignment collapses to zero
  • how many scheduler rounds occur during prefill

Benchmark

Benchmark script included: benchmarks/mamba_prefill_fragmentation.py

Default configuration:

model = Qwen3-5.2B
prompt length ≈ 2000 tokens
batch size = 1

Metrics reported:

  • TTFT
  • tokens/sec
  • scheduler rounds
  • fragmentation events

Safety

The safeguard only triggers when alignment collapses to zero while the scheduler requested tokens. It prevents no-progress scheduling rounds without changing normal block alignment behavior.

Motivation

Related discussion: https://github.com/vllm-project/vllm/issues/36627

Notes

I currently do not have access to an A10/A100 GPU, so the patch is based on source analysis and local instrumentation. Benchmark validation from others with production GPUs would be appreciated.

Changed files

  • .buildkite/performance-benchmarks/scripts/compare-json-results.py (modified, +92/-301)
  • benchmarks/mamba_prefill_fragmentation.py (added, +149/-0)
  • tests/config/test_config_utils.py (modified, +1/-1)
  • tests/cuda/test_cuda_compatibility_path.py (modified, +3/-2)
  • tests/models/language/generation/test_gemma.py (modified, +3/-1)
  • tests/models/multimodal/generation/test_common.py (modified, +120/-40)
  • tests/models/multimodal/generation/vlm_utils/custom_inputs.py (modified, +3/-1)
  • tests/models/multimodal/generation/vlm_utils/model_utils.py (modified, +14/-14)
  • tests/models/multimodal/processing/test_qwen2_5_omni_embed.py (modified, +4/-6)
  • tests/test_mamba_scheduler_split.py (added, +41/-0)
  • tests/v1/kv_connector/unit/test_offloading_connector.py (modified, +27/-29)
  • vllm/benchmarks/datasets.py (modified, +5/-3)
  • vllm/envs.py (modified, +76/-75)
  • vllm/model_executor/layers/activation.py (modified, +11/-9)
  • vllm/model_executor/layers/mamba/ops/mamba_ssm.py (modified, +3/-2)
  • vllm/model_executor/layers/quantization/kv_cache.py (modified, +2/-2)
  • vllm/model_executor/model_loader/bitsandbytes_loader.py (modified, +2/-3)
  • vllm/model_executor/models/whisper_causal.py (modified, +9/-11)
  • vllm/multimodal/inputs.py (modified, +5/-3)
  • vllm/third_party/pynvml.py (modified, +2788/-1702)
  • vllm/v1/core/sched/scheduler.py (modified, +31/-0)
  • vllm/v1/engine/core.py (modified, +3/-2)

Code Example

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Why does the actual performance test show that Qwen3.5 is not better than Qwen3, especially with TTFT being much slower than Qwen3?

Environment: A10 24G Driver Version: 590.48.01 CUDA Version: 13.1

VLLM Version: vllm/vllm-openai:nightly

Service startup command:

docker run --rm
--name qwen-server
--runtime nvidia
--gpus device=0
-v /data/cache/huggingface:/root/.cache/huggingface
-v /work:/models
-p 13003:8000
--ipc=host
vllm/vllm-openai:nightly
--model /models/Qwen3-8B
--served-model-name Qwen3-8B
--max-model-len 8192
--gpu-memory-utilization 0.9
--reasoning-parser qwen3
--enable-prefix-caching

<img width="840" height="417" alt="Image" src="https://github.com/user-attachments/assets/ae0d2ec0-b173-4cb9-9d8f-6bf7370c0a89" />

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To improve performance, we will focus on optimizing the Qwen3.5 model configuration and adjusting the service startup command.

Step-by-Step Solution:

  1. Update the gpu-memory-utilization parameter: Reduce the value to prevent GPU memory overload.
  2. Implement prefix caching: Enable prefix caching to improve inference speed.
  3. Optimize model configuration: Adjust the max-model-len parameter to balance performance and accuracy.

Example Code:

# Updated service startup command
docker run --rm \
  --name qwen-server \
  --runtime nvidia \
  --gpus device=0 \
  -v /data/cache/huggingface:/root/.cache/huggingface \
  -v /work:/models \
  -p 13003:8000 \
  --ipc=host \
  vllm/vllm-openai:nightly \
  --model /models/Qwen3.5-8B \
  --served-model-name Qwen3.5-8B \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.7 \
  --reasoning-parser qwen3.5 \
  --enable-prefix-caching \
  --cache-size 1000

Verification

Monitor the performance test results to ensure the updated configuration improves performance. Check the GPU memory utilization and adjust the gpu-memory-utilization parameter as needed.

Extra Tips

  • Regularly update the driver version and CUDA version to ensure compatibility and optimal performance.
  • Consider implementing a load balancer to distribute the workload and prevent single-point failures.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING