vllm - ✅(Solved) Fix [Doc]: comprehensive rewrite of disaggregated prefilling (PD) documentation [1 pull requests, 1 participants]

neweyes · 2026-03-30T07:50:41Z

[vllm] PR 38525: Docs : comprehensive rewrite of disaggregated prefilling PD documentation - Repository: vllm-project/vllm - Author: neweyes - State: open | me… # PR #38525: [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation - Repository: vllm-project/vllm - Author: neweyes - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38525 ## Description (problem / solution / changelog) Closes #38524 ## Background The existing disaggregated prefilling documentation is mostly example-oriented and lacks a clear explanation of system design, deployment requirements, and production usage. Users may find it difficult to: - Understand when to use PD disaggregation - Deploy Prefill and Decode instances correctly - Choose appropriate KV cache connectors - Reason about latency trade-offs (TTFT vs ITL) ## What’s Changed This PR introduces a comprehensive rewrite of the `disagg_prefill.md` documentation, covering: ### 1. Concept & Motivation - Clear explanation of Prefill–Decode (PD) disaggregation - When to use it (TTFT, ITL, tail latency control) - Explicit clarification that it optimizes latency, not throughput ### 2. Architecture & Workflow - End-to-end pipeline: Request → Prefill → KV Transfer → Decode - Detailed explanation of: - Request routing - KV cache generation and transfer - Decode execution - Improved diagrams and structure ### 3. Deployment Guidance - Environment and hardware requirements - Deployment prerequisites (multi-instance, routing layer) - Network requirements (RDMA / TCP considerations) - Recommendations for production setups (Kubernetes, monitoring) ### 4. Connectors & Usage - Detailed introduction of all supported connectors - Connector comparison table - Design considerations: - latency vs throughput - network capabilities - memory constraints - system complexity - Advanced usage with MultiConnector ### 5. Developer-Oriented Section - Internal abstractions: - Connector - LookupBuffer - Pipe - High-level architecture and workflow - Guidance for third-party connector implementations ## Why To make disaggregated prefilling: - Easier to understand - Easier to use ## Impact - Documentation only (no code changes) - Improves usability and onboarding for PD disaggregation - Bridges the gap between examples and real-world deployments ## Notes This PR focuses on documentation completeness and structure. Future improvements may include: - Performance tuning guidelines - More production deployment examples - Additional diagrams ## Changed files - `docs/assets/features/disagg_prefill/work_steps.png` (added, +0/-0) - `docs/features/disagg_prefill.md` (modified, +324/-48) ## Fixed - Fixed by PR: [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation (https://github.com/vllm-project/vllm/pull/38525) ### 📚 The doc issue ## Proposed Improvement Enhance the documentation to include: - Conceptual explanation and usage scenarios - End-to-end workflow (Prefill → KV transfer → Decode) - Deployment prerequisites and best practices - Detailed connector documentation and comparison - Design considerations for different environments - Internal architecture (Connector, LookupBuffer, Pipe) ## Expected Outcome - Better onboarding experience - Easier production adoption - Clearer understanding of system design - Improved extensibility for contributors ### Suggest a potential alternative/fix _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-30 07:50:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38524•Fetched 2026-04-08 01:53:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

neweyes

Participants

neweyes

Assignees

neweyes

Timeline (top)

cross-referenced ×2assigned ×1labeled ×1

Fix Action

Fixed

Fixed by PR: [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation (https://github.com/vllm-project/vllm/pull/38525)

PR fix notes

PR #38525: [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation

Repository: vllm-project/vllm
Author: neweyes
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38525

Description (problem / solution / changelog)

Closes #38524

Background

The existing disaggregated prefilling documentation is mostly example-oriented and lacks a clear explanation of system design, deployment requirements, and production usage. Users may find it difficult to:

Understand when to use PD disaggregation
Deploy Prefill and Decode instances correctly
Choose appropriate KV cache connectors
Reason about latency trade-offs (TTFT vs ITL)

What’s Changed

This PR introduces a comprehensive rewrite of the disagg_prefill.md documentation, covering:

1. Concept & Motivation

Clear explanation of Prefill–Decode (PD) disaggregation
When to use it (TTFT, ITL, tail latency control)
Explicit clarification that it optimizes latency, not throughput

2. Architecture & Workflow

End-to-end pipeline: Request → Prefill → KV Transfer → Decode
Detailed explanation of:
- Request routing
- KV cache generation and transfer
- Decode execution
Improved diagrams and structure

3. Deployment Guidance

Environment and hardware requirements
Deployment prerequisites (multi-instance, routing layer)
Network requirements (RDMA / TCP considerations)
Recommendations for production setups (Kubernetes, monitoring)

4. Connectors & Usage

Detailed introduction of all supported connectors
Connector comparison table
Design considerations:
- latency vs throughput
- network capabilities
- memory constraints
- system complexity
Advanced usage with MultiConnector

5. Developer-Oriented Section

Internal abstractions:
- Connector
- LookupBuffer
- Pipe
High-level architecture and workflow
Guidance for third-party connector implementations

Why

To make disaggregated prefilling:

Easier to understand
Easier to use

Impact

Documentation only (no code changes)
Improves usability and onboarding for PD disaggregation
Bridges the gap between examples and real-world deployments

Notes

This PR focuses on documentation completeness and structure. Future improvements may include:

Performance tuning guidelines
More production deployment examples
Additional diagrams

Changed files

docs/assets/features/disagg_prefill/work_steps.png (added, +0/-0)
docs/features/disagg_prefill.md (modified, +324/-48)

RAW_BUFFERClick to expand / collapse

📚 The doc issue

Proposed Improvement

Enhance the documentation to include:

Conceptual explanation and usage scenarios
End-to-end workflow (Prefill → KV transfer → Decode)
Deployment prerequisites and best practices
Detailed connector documentation and comparison
Design considerations for different environments
Internal architecture (Connector, LookupBuffer, Pipe)

Expected Outcome

Better onboarding experience
Easier production adoption
Clearer understanding of system design
Improved extensibility for contributors

Suggest a potential alternative/fix

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix: Enhance Documentation

To address the issue, we will enhance the documentation with the proposed improvements.

Fix Plan

Update the documentation to include:
- Conceptual explanation and usage scenarios
- End-to-end workflow (Prefill → KV transfer → Decode)
- Deployment prerequisites and best practices
- Detailed connector documentation and comparison
- Design considerations for different environments
- Internal architecture (Connector, LookupBuffer, Pipe)
Use clear headings and concise language
Add code snippets and examples where relevant

Example Code Snippet

### End-to-End Workflow
The end-to-end workflow consists of the following steps:
1. Prefill: Initialize the system with required data
2. KV transfer: Transfer data between components
3. Decode: Decode the transferred data

### Example Use Case
```python
import connector

# Prefill
connector.init()

# KV transfer
data = connector.transfer_data()

# Decode
decoded_data = connector.decode_data(data)

Verification

Review the updated documentation for completeness and clarity
Test the end-to-end workflow with example code snippets

Extra Tips

Use a consistent tone and style throughout the documentation
Include diagrams and illustrations to help explain complex concepts
Regularly review and update the documentation to ensure it remains accurate and relevant.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory optimization #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Doc]: comprehensive rewrite of disaggregated prefilling (PD) documentation [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #38525: [Docs]: comprehensive rewrite of disaggregated prefilling (PD) documentation

Description (problem / solution / changelog)

Background

What’s Changed

1. Concept & Motivation

2. Architecture & Workflow

3. Deployment Guidance

4. Connectors & Usage

5. Developer-Oriented Section

Why

Impact

Notes

Changed files

📚 The doc issue

Proposed Improvement

Expected Outcome

Suggest a potential alternative/fix

Before submitting a new issue...

extent analysis

Fix: Enhance Documentation

Fix Plan

Example Code Snippet

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING