vllm - 💡(How to fix) Fix [Roadmap] [Draft] vLLM Roadmap Q2 2026 [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39749Fetched 2026-04-15 06:20:35
View on GitHub
Comments
1
Participants
2
Timeline
47
Reactions
7
Author
Participants
Timeline (top)
subscribed ×24mentioned ×17added_to_project_v2 ×1comment_deleted ×1

Fix Action

Fix / Workaround

  • Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads.
  • Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc
  • Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors.
  • Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections
  • Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights.
  • Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel.
RAW_BUFFERClick to expand / collapse

In #32455, we broke down vLLM’s goal into various special interest groups (SIGs). Please find below the SIG’s area and their roadmap. You can find regular meetings of these SIGs at this public calendar.

Core

Slack Channel: #sig-core
Members: @WoosukKown @njhill

The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path.

  • Model Runner V2 hardening and making it default:
  • expand testing coverage
  • support wide-ep out of the box.
  • Continuing to fill gaps Model Runner V2 Design Docs. Currently, SIG Core’s goal is to focus on a stable and efficient core that is principled, modular and clean. This means MRV1 will stay in Q2 to handle long tail use cases as we enable more use cases for MRV2
  • KV cache manager rethink for complex KV cache layout
  • Offloading: CPU offloading + Disk + overall connector API on this part of the path
  • Address known scheduler issues (avoid excessive preemption, prefill HoL blocking) Scheduler Items
  • Further process management hardening/simplification
  • Work out auto-tuning / out-of-box performance improvement

Large Scale Serving

Slack Channel: #sig-large-scale-serving Project board: https://github.com/orgs/vllm-project/projects/47 Members: @tlrmchlsmth

The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of GB200, B200, and H200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team.

  • Zero cost async EPLB
  • Experimental fault tolerant EP
  • Elastic EP (scale up/down) production ready
  • Bidirectional KV transfers
  • Numerics monitoring/debug harness

Call for experiments/prototypes:

  • Experimental AFD
  • Pipeline parallel optimizations?

Model Performance

Channel: #sig-model-performance Members: @robertgshaw2-redhat @simon-mo

The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability.

  • Nightly performance evaluation for prioritized models on hardware cluster
    • Models: Kimi K2.5, Qwen 3.5, DeepSeek V3.2, Minimax 2.7, GLM 5.1
    • Hardware: GB200, B300, H200, (maybe) MI355
    • Workload: InferenceX, and bottom up workload (bs=1, bs=16, etc)
  • Weekly progress on performance gaps and trace sharing
  • No accuracy regression as we turn on performance enhancement by default, by nightly accuracy sweep.

Acceleration

Channel: #sig-quantization Members: @mgoin @dsikka

vLLM's quantization support, including native online, LLM Compressor, and external integrations like ModelOpt.

  • Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads.
  • Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc
  • Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors.
  • Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections
  • Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights.
  • Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel.

Torch Compile

Channel: #sig-torch-compile Members: @ProExpertProg @zou3519

The team focuses on improving performance, portability, and developer productivity via PyTorch compilation integration. Work includes custom compile & fusion passes, vLLM IR for kernel registration, reducing compile time via caching, improving developer UX with torch.compile, and co-development of new torch.compile features.

  • Improve torch.compile compilation times overall.
    • Targeting up to 1.3x cold compile time speedups (with PyTorch 2.12)
    • Reduce warm compile time down to <= 2s (aka up to 5x speedup) (with PyTorch 2.12)
    • Add an option to overlap weight loading and compilation (unstable feature for Q2, stable in Q3)
  • Full vLLM IR migration
  • Ship the Improved perf dashboard to track compile speedups and breakdown warm and cold start times.
  • vLLM begins using at least 1 custom helion kernel by default
  • Support for torch.compile x CUDA streams (in PyTorch 2.12)
  • Support for torch.compile x nvsymmetric memory integration (in PyTorch 2.12)
  • Unwrap wrapped custom ops (MLA, Fused MoE) - exposes more operators to Inductor and custom passes for optimization.
  • Continue enabling more optimizations by default (Inductor partition, attn+quant fusion, Async TP)
  • Roughly 1/4th of vLLM multimodal models have encoder compilation supported
  • Drive alignment with the OSS community on a backed → unbacked migration plan, and execute initial adoption by enabling X+ models to use unbacked shapes by default.
  • Inductor generates more fusions natively. This may include padding/quant/collective fusions. (PyTorch 2.12)
  • Roll out Inductor PDL where profitable, improve implementation

RL

Channel: #sig-post-training

The team focuses on delivering vLLM the best engine features for RL rollout including weight sync, kv cache reset, and ease-of-modification.

  • Complete modular weight sync milestone 3 #31848
  • Enhancement and collaboration with open source RL training runs
  • Harden external launcher mode

MultiModality & Omni Modality

Channel: #sig-multi-modality
Members: @ywang96 @DarkLight1337

The team supports the abstractions, model support, and optimizations of multi-modality input.

  • Enhance testing coverage on ViT cuda graph + torch compile
  • Turn encoder optimizations from MLPerf sprint on by default whenever available
  • Make API more flexible & less abstraction

On vllm-omni side,

  • Large-scale serving, support “PD” but for vLLM-Omni, Individual “Stages” can be initialized with different number of replicas
  • Support large scale users of vLLM-Omni

CI, Build, and Release

Channel: #sig-ci Members: @khluu

The team focuses on developing world class infrastructure for vLLM’s CI system and ensuring we have a secure and reliable build and release process.

  • Time to signal -> 30 mins
  • Model eval coverage for popular models x hardware matrix
  • Automatic test target determination
  • Improving signals with nightly torch
  • More AMD test coverage

On improving release gating signals, going beyond a green build and tests

  • All CI tests (that are not soft fail)
  • Separate e2e integration tests into long-running release tests suite
  • Model eval
  • Perf benchmark for regression

—------ This roadmap covers the majority of tracked items. The vLLM team continues to review issues and pull requests and open for wider collaboration expanding model, hardware, and optimization coverage. Please feel free to leave any feedback and comments, and directly work with SIG areas for deeper collaborations.

extent analysis

TL;DR

To address the issues and improve the vLLM engine, focus on completing the Model Runner V2 hardening, expanding testing coverage, and supporting wide-ep out of the box, as these are critical for a stable and efficient core.

Guidance

  • Review the SIG Core's goals and roadmap to understand the priorities for the vLLM Engine Core, including the Scheduler, KV Cache Manager, and Model Runner.
  • Focus on addressing known scheduler issues, such as excessive preemption and prefill HoL blocking, to improve the overall performance and stability of the engine.
  • Explore the use of torch.compile to improve performance, portability, and developer productivity, particularly in the context of PyTorch compilation integration.
  • Investigate the quantization support, including native online, LLM Compressor, and external integrations like ModelOpt, to enhance the engine's capabilities.

Example

No specific code snippet is provided due to the lack of technical details in the issue body. However, it is recommended to review the Model Runner V2 Design Docs for more information on the design and implementation.

Notes

The provided issue body appears to be a roadmap for the vLLM engine, outlining various special interest groups (SIGs) and their goals. It does not describe a specific problem or issue to be solved. Therefore, the guidance provided is general and focused on understanding the priorities and goals of the vLLM engine.

Recommendation

Apply the workaround of focusing on the SIG Core's goals and roadmap to prioritize the development and improvement of the vLLM Engine Core. This will help address critical issues and improve the overall performance and stability of the engine.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING