vllm - 💡(How to fix) Fix [Feature]: Polymorphic buffer management for V1 worker (CPU/GPU staged tensors, lower hot-path overhead) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41615Fetched 2026-05-05 05:44:40
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work.

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

Root Cause

This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work.

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

This proposes a polymorphic buffer management path for the V1 worker so internal staged tensors can be handled consistently across CPU and GPU, while reducing hot-path overhead from transient buffer work.

Prototype branch: masterFoad/vllm:pr-polymorphic-buffers-clean
Branch URL: https://github.com/masterFoad/vllm/tree/pr-polymorphic-buffers-clean
Head commit: 31e0aa38c

Motivation

The current buffer/staging flow can incur avoidable overhead in decode and input-prep paths due to temporary allocations and host-side staging patterns.

A unified buffer abstraction should improve maintainability and reduce per-step overhead in performance-critical paths.

Proposed Change

  • Introduce DeviceMemoryManager to centralize staged tensor allocation strategy.
  • Use polymorphic staged-write tensors via .device_tensor instead of GPU-only assumptions.
  • Keep CPU-compatible and CUDA-compatible staged write paths under one abstraction.
  • Refactor V1 worker callsites to use the new staged buffer flow.

Scope in Prototype Branch

Representative files touched:

  • vllm/v1/worker/gpu/buffer_utils.py
  • vllm/v1/worker/gpu/model_runner.py
  • vllm/v1/worker/gpu/states.py
  • vllm/v1/worker/gpu/block_table.py
  • vllm/v1/worker/gpu/model_states/default.py
  • vllm/v1/worker/gpu/sample/bad_words.py
  • vllm/v1/worker/gpu/sample/logit_bias.py
  • vllm/v1/worker/gpu/sample/penalties.py
  • vllm/v1/worker/gpu/sample/sampler.py
  • vllm/v1/worker/gpu/sample/states.py
  • vllm/v1/worker/gpu/spec_decode/rejection_sampler.py

Preliminary Local Result

Local setup:

  • OS: WSL Ubuntu 24.04
  • GPU: RTX 4060 Laptop GPU, 8GB
  • PyTorch: 2.11.0+cu129
  • vLLM: 0.7.3
  • Model: Qwen2.5-1.5B-Instruct

Throughput result:

BranchThroughput
main4.37 req/s
pr-polymorphic-buffers-clean4.78 req/s

Delta: +9.4%

These are prototype/local measurements and are shared only to indicate potential.

Request for Feedback

If this direction is useful, I can split the work into a smaller reviewable sequence:

  1. Memory abstraction groundwork.
  2. Targeted hot-path integrations.
  3. Focused validation and benchmark reporting.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement the proposed polymorphic buffer management path to reduce hot-path overhead and improve performance.

Guidance

  • Review the prototype branch pr-polymorphic-buffers-clean and its associated files to understand the proposed changes.
  • Consider splitting the work into smaller, reviewable sequences as suggested: memory abstraction groundwork, targeted hot-path integrations, and focused validation and benchmark reporting.
  • Verify the performance improvements by running local benchmarks and comparing the results with the main branch.
  • Evaluate the feasibility of integrating the proposed changes into the main codebase, considering factors such as maintainability and compatibility.

Notes

The proposed changes aim to improve performance by reducing overhead in decode and input-prep paths. However, the effectiveness of these changes may depend on specific use cases and hardware configurations.

Recommendation

Apply the proposed workaround by implementing the polymorphic buffer management path, as it shows a potential performance improvement of +9.4% in the preliminary local results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING