vllm - 💡(How to fix) Fix [Roadmap] [Draft] vLLM Roadmap Q2 2026 [1 comments, 2 participants]

simon-mo · 2026-04-13T23:29:59Z

[vllm] In 32455, we broke down vLLM’s goal into various special interest groups SIGs . Please find below the SIG’s area and their roadmap. You can find regular… ## Fix / Workaround - [ ] Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads. - [ ] Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc - [ ] Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors. - [ ] Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections - [ ] Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights. - [ ] Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel. In #32455, we broke down vLLM’s goal into various special interest groups (SIGs). Please find below the SIG’s area and their roadmap. You can find regular meetings of these SIGs at [this public calendar](https://zoom-lfx.platform.linuxfoundation.org/meetings/vllm). ### Core Slack Channel: #sig-core Members: @WoosukKown @njhill The team focuses on the vLLM Engine Core including Scheduler, KV Cache Manager, Distributed, Model Runner, KV Connector code path. - [ ] Model Runner V2 hardening and making it default: - [ ] expand testing coverage - [ ] support wide-ep out of the box. - [ ] Continuing to fill gaps [Model Runner V2 Design Docs](https://docs.google.com/document/d/1gFqtDkcoqhy9j-X0ndshzbhapX1uNey1-wBENwGPI80/edit?tab=t.192m4u5k37xt#heading=h.i84xcin8owkj). _Currently, SIG Core’s goal is to focus on a stable and efficient core that is principled, modular and clean. This means MRV1 will stay in Q2 to handle long tail use cases as we enable more use cases for MRV2_ - [ ] KV cache manager rethink for complex KV cache layout - [ ] Offloading: CPU offloading + Disk + overall connector API on this part of the path - [ ] Address known scheduler issues (avoid excessive preemption, prefill HoL blocking) [Scheduler Items](https://docs.google.com/document/d/1jqbSj5sG38eNW4yUjlAqiAPp7DnS0IJ6hWijTm2yIk8/edit?tab=t.0#heading=h.xyx62nm8mec4) - [ ] Further process management hardening/simplification - [ ] Work out auto-tuning / out-of-box performance improvement ### Large Scale Serving Slack Channel: #sig-large-scale-serving Project board: https://github.com/orgs/vllm-project/projects/47 Members: @tlrmchlsmth The team focuses on pushing vLLM to speed of light on disaggregated, wide EP, and elastic setting on clusters of GB200, B200, and H200. The team is also responsible for interfacing with ecosystem projects such as llm-d, Dynamo, and AMD team. - [ ] Zero cost async EPLB - [ ] Experimental fault tolerant EP - [ ] Elastic EP (scale up/down) production ready - [ ] Bidirectional KV transfers - [ ] Numerics monitoring/debug harness Call for experiments/prototypes: - [ ] Experimental AFD - [ ] Pipeline parallel optimizations? ### Model Performance Channel: #sig-model-performance Members: @robertgshaw2-redhat @simon-mo The team focuses on pure performance and reliability engineering within vLLM. The work involves capturing performance traces, enabling the right set of kernels by default, and continuously monitoring it. The work also covers monitoring and logging for production stability. - [ ] Nightly performance evaluation for prioritized models on hardware cluster - Models: Kimi K2.5, Qwen 3.5, DeepSeek V3.2, Minimax 2.7, GLM 5.1 - Hardware: GB200, B300, H200, (maybe) MI355 - Workload: InferenceX, and bottom up workload (bs=1, bs=16, etc) - [ ] Weekly progress on performance gaps and trace sharing - [ ] No accuracy regression as we turn on performance enhancement by default, by nightly accuracy sweep. ### Acceleration Channel: [#sig-quantization](https://vllm-dev.slack.com/archives/C07RFT1DVT2) Members: @mgoin @dsikka vLLM's quantization support, including native online, LLM Compressor, and external integrations like ModelOpt. - [ ] Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads. - [ ] Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc - [ ] Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors. - [ ] Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and

vllm2026-04-13 23:29:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39749•Fetched 2026-04-15 06:20:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

simon-mo

Participants

bitbottrap

simon-mo

Timeline (top)

subscribed ×24mentioned ×17added_to_project_v2 ×1comment_deleted ×1

Fix Action

Fix / Workaround

Complete the vLLM online quantization refactor so it becomes more flexible, more memory-efficient, and suitable for production/RL workloads.
Build on INT8 dynamic per-token KV-cache quantization as an initial foundation for future dynamic KV-cache compression like per-token FP8, NVFP4, etc
Make quantization backend dispatch deterministic, inspectable, and maintainable by implementing a single-source-of-truth dispatch oracle, explicit backend capability checks, and clearer unsupported-configuration errors.
Investigate and integrate efficient transforms/rotations where they improve low-bit quantization accuracy, especially for MXFP4 and sensitive layers such as attention projections
Continue improving weight reloading for RL by reducing memory usage further and supporting reload of already-quantized/swizzled weights.
Expand the path toward broader bitwidth kernel support for W{1-8}A{16/8/4}, including integration with humming-kernel.

extent analysis

TL;DR

To address the issues and improve the vLLM engine, focus on completing the Model Runner V2 hardening, expanding testing coverage, and supporting wide-ep out of the box, as these are critical for a stable and efficient core.

Guidance

Review the SIG Core's goals and roadmap to understand the priorities for the vLLM Engine Core, including the Scheduler, KV Cache Manager, and Model Runner.
Focus on addressing known scheduler issues, such as excessive preemption and prefill HoL blocking, to improve the overall performance and stability of the engine.
Explore the use of torch.compile to improve performance, portability, and developer productivity, particularly in the context of PyTorch compilation integration.
Investigate the quantization support, including native online, LLM Compressor, and external integrations like ModelOpt, to enhance the engine's capabilities.

Example

No specific code snippet is provided due to the lack of technical details in the issue body. However, it is recommended to review the Model Runner V2 Design Docs for more information on the design and implementation.

Notes

The provided issue body appears to be a roadmap for the vLLM engine, outlining various special interest groups (SIGs) and their goals. It does not describe a specific problem or issue to be solved. Therefore, the guidance provided is general and focused on understanding the priorities and goals of the vLLM engine.

Recommendation

Apply the workaround of focusing on the SIG Core's goals and roadmap to prioritize the development and improvement of the vLLM Engine Core. This will help address critical issues and improve the overall performance and stability of the engine.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #configuration error #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Roadmap] [Draft] vLLM Roadmap Q2 2026 [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Core

Large Scale Serving

Model Performance

Acceleration

Torch Compile

RL

MultiModality & Omni Modality

CI, Build, and Release

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Roadmap] [Draft] vLLM Roadmap Q2 2026 [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Core

Large Scale Serving

Model Performance

Acceleration

Torch Compile

RL

MultiModality & Omni Modality

CI, Build, and Release

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING