vllm - 💡(How to fix) Fix [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? [1 participants]

vllm2026-03-18 13:42:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37437•Fetched 2026-04-08 00:58:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Alan-D-Chen

Participants

Alan-D-Chen

Timeline (top)

added_to_project_v2 ×1labeled ×1project_v2_item_status_changed ×1

Code Example

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Dear vLLM Team, Pardon me for troubling you, and thank you so much for your great work on the vLLM engine. I am currently trying to run inference with Prefill-Decode Disaggregation for the DeepSeek-R1 NVFP4 model across a cluster of GB300 servers. My goal is to deploy a fully disaggregated setup: N dedicated Prefill nodes + M dedicated Decode nodes across multiple physical machines. However, I am facing the following challenges: I am using a GB300-optimized vLLM 0.11.0+custom build, which does not support high-level PD disaggregation CLI arguments such as --separate-prefill-decode, --prefill-node-ips, --decode-node-ips, --role, etc. These flags return "unrecognized arguments". I have tried using basic distributed arguments (--node-rank, --master-addr, --nnodes) to simulate Prefill/Decode splitting, but I cannot achieve real, strict Prefill-Decode Disaggregation where the two stages are fully isolated on separate nodes. My environment: multiple GB300 nodes with InfiniBand, Docker with --network=host, and the same DeepSeek-R1 NVFP4 model accessible across all nodes. I would really appreciate your guidance on: How to properly enable true Prefill-Decode Disaggregation across N Prefill nodes and M Decode nodes in vLLM, even for versions without the high-level PD flags. The correct distributed configuration (tensor parallel, distributed init, KV cache settings) for GB300 and DeepSeek-R1 NVFP4. How to verify that Prefill and Decode workloads are actually running on their dedicated nodes. Thank you very much for your time and help. Best regards,

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To achieve Prefill-Decode Disaggregation in vLLM without high-level CLI arguments, you can manually configure the distributed settings.

Step 1: Modify Configuration

Modify your configuration to include the following settings:

distributed_init: True
tensor_parallel: True
kv_cache_settings: configure according to your GB300 and DeepSeek-R1 NVFP4 model requirements

Step 2: Environment Variables

Set the following environment variables:

NODE_RANK: unique rank for each node
MASTER_ADDR: address of the master node
NNODES: total number of nodes
WORLD_SIZE: total number of processes (equal to NNODES * number of GPUs per node)

Step 3: Launch Commands

Launch your Prefill and Decode nodes using the following commands:

# Prefill node
python -m torch.distributed.launch --nnodes=N --node_rank=0 --master_addr=<master_addr> --master_port=29500 your_prefill_script.py

# Decode node
python -m torch.distributed.launch --nnodes=M --node_rank=0 --master_addr=<master_addr> --master_port=29501 your_decode_script.py

Replace N and M with the number of Prefill and Decode nodes, respectively.

Step 4: Code Modifications

Modify your your_prefill_script.py and your_decode_script.py to include the following code snippets:

import torch.distributed as dist

# Initialize distributed backend
dist.init_process_group('nccl', init_method='env://')

# ... (rest of your code)

# Cleanup
dist.destroy_process_group()

Verification

To verify that Prefill and Decode workloads are running on their dedicated nodes, you can use tools like nvidia-smi to monitor GPU usage on each node. Additionally, you can add logging statements in your code to print the node rank and GPU ID being used.

Extra Tips

Ensure that your GB300 nodes have the same version of PyTorch and vLLM installed.
Use torch.distributed.launch to launch your scripts, as it handles the initialization of the distributed backend.
Configure your kv_cache_settings according to your model's requirements to optimize performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

extent analysis

Fix Plan

Step 1: Modify Configuration

Step 2: Environment Variables

Step 3: Launch Commands

Step 4: Code Modifications

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

extent analysis

Fix Plan

Step 1: Modify Configuration

Step 2: Environment Variables

Step 3: Launch Commands

Step 4: Code Modifications

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING