vllm - 💡(How to fix) Fix [Feature]: Polymorphic buffer management for V1 worker (CPU/GPU staged tensors, lower hot-path overhead) [1 participants]

masterFoad · 2026-05-04T08:08:42Z

[vllm] This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while re… This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work. **Prototype branch:** `masterFoad/vllm:pr-polymorphic-buffers-clean` **Branch URL:** https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean **Head commit:** `31e0aa38c` ### 🚀 The feature, motivation and pitch ## Summary This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work. **Prototype branch:** `masterFoad/vllm:pr-polymorphic-buffers-clean` **Branch URL:** https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean **Head commit:** `31e0aa38c` ## Motivation The current buffer/staging flow can incur avoidable overhead in decode and input-prep paths due to temporary allocations and host-side staging patterns. A unified buffer abstraction should improve maintainability and reduce per-step overhead in performance-critical paths. ## Proposed Change - Introduce `DeviceMemoryManager` to centralize staged tensor allocation strategy. - Use polymorphic staged-write tensors via `.device_tensor` instead of GPU-only assumptions. - Keep CPU-compatible and CUDA-compatible staged write paths under one abstraction. - Refactor V1 worker callsites to use the new staged buffer flow. ## Scope in Prototype Branch Representative files touched: - `vllm/v1/worker/gpu/buffer_utils.py` - `vllm/v1/worker/gpu/model_runner.py` - `vllm/v1/worker/gpu/states.py` - `vllm/v1/worker/gpu/block_table.py` - `vllm/v1/worker/gpu/model_states/default.py` - `vllm/v1/worker/gpu/sample/bad_words.py` - `vllm/v1/worker/gpu/sample/logit_bias.py` - `vllm/v1/worker/gpu/sample/penalties.py` - `vllm/v1/worker/gpu/sample/sampler.py` - `vllm/v1/worker/gpu/sample/states.py` - `vllm/v1/worker/gpu/spec_decode/rejection_sampler.py` ## Preliminary Local Result Local setup: - **OS:** WSL Ubuntu 24.04 - **GPU:** RTX 4060 Laptop GPU, 8GB - **PyTorch:** `2.11.0+cu129` - **vLLM:** `0.7.3` - **Model:** `Qwen2.5-1.5B-Instruct` Throughput result: | Branch | Throughput | |---|---:| | `main` | **4.37 req/s** | | `pr-polymorphic-buffers-clean` | **4.78 req/s** | **Delta:** `+9.4%` These are prototype/local measurements and are shared only to indicate potential. ## Request for Feedback If this direction is useful, I can split the work into a smaller reviewable sequence: 1. Memory abstraction groundwork. 2. Targeted hot-path integrations. 3. Focused validation and benchmark reporting. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-04 08:08:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41615•Fetched 2026-05-05 05:44:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

masterFoad

Participants

masterFoad

Timeline (top)

labeled ×1

This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work.

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

Root Cause

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

Motivation

The current buffer/staging flow can incur avoidable overhead in decode and input-prep paths due to temporary allocations and host-side staging patterns.

A unified buffer abstraction should improve maintainability and reduce per-step overhead in performance-critical paths.

Proposed Change

Introduce DeviceMemoryManager to centralize staged tensor allocation strategy.
Use polymorphic staged-write tensors via .device_tensor instead of GPU-only assumptions.
Keep CPU-compatible and CUDA-compatible staged write paths under one abstraction.
Refactor V1 worker callsites to use the new staged buffer flow.

Scope in Prototype Branch

Representative files touched:

vllm/v1/worker/gpu/buffer_utils.py
vllm/v1/worker/gpu/model_runner.py
vllm/v1/worker/gpu/states.py
vllm/v1/worker/gpu/block_table.py
vllm/v1/worker/gpu/model_states/default.py
vllm/v1/worker/gpu/sample/bad_words.py
vllm/v1/worker/gpu/sample/logit_bias.py
vllm/v1/worker/gpu/sample/penalties.py
vllm/v1/worker/gpu/sample/sampler.py
vllm/v1/worker/gpu/sample/states.py
vllm/v1/worker/gpu/spec_decode/rejection_sampler.py

Preliminary Local Result

Local setup:

OS: WSL Ubuntu 24.04
GPU: RTX 4060 Laptop GPU, 8GB
PyTorch: 2.11.0+cu129
vLLM: 0.7.3
Model: Qwen2.5-1.5B-Instruct

Throughput result:

Branch	Throughput
`main`	4.37 req/s
`pr-polymorphic-buffers-clean`	4.78 req/s

Delta: +9.4%

These are prototype/local measurements and are shared only to indicate potential.

Request for Feedback

If this direction is useful, I can split the work into a smaller reviewable sequence:

Memory abstraction groundwork.
Targeted hot-path integrations.
Focused validation and benchmark reporting.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement the proposed polymorphic buffer management path to reduce hot-path overhead and improve performance.

Guidance

Review the prototype branch pr-polymorphic-buffers-clean and its associated files to understand the proposed changes.
Consider splitting the work into smaller, reviewable sequences as suggested: memory abstraction groundwork, targeted hot-path integrations, and focused validation and benchmark reporting.
Verify the performance improvements by running local benchmarks and comparing the results with the main branch.
Evaluate the feasibility of integrating the proposed changes into the main codebase, considering factors such as maintainability and compatibility.

Notes

The proposed changes aim to improve performance by reducing overhead in decode and input-prep paths. However, the effectiveness of these changes may depend on specific use cases and hardware configurations.

Recommendation

Apply the proposed workaround by implementing the polymorphic buffer management path, as it shows a potential performance improvement of +9.4% in the preliminary local results.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Polymorphic buffer management for V1 worker (CPU/GPU staged tensors, lower hot-path overhead) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🚀 The feature, motivation and pitch

Summary

Motivation

Proposed Change

Scope in Prototype Branch

Preliminary Local Result

Request for Feedback

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Polymorphic buffer management for V1 worker (CPU/GPU staged tensors, lower hot-path overhead) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🚀 The feature, motivation and pitch

Summary

Motivation

Proposed Change

Scope in Prototype Branch

Preliminary Local Result

Request for Feedback

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING