vllm - 💡(How to fix) Fix [Usage]: How to use the vLLM framework to perform inference testing with Prefill-Decode (PD) Separation for the DeepSeek-R1 NVFP4 model across multiple GB300 server nodes (N Prefill nodes + M Decode nodes)? [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37437Fetched 2026-04-08 00:58:40
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
added_to_project_v2 ×1labeled ×1project_v2_item_status_changed ×1

Code Example

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Dear vLLM Team, Pardon me for troubling you, and thank you so much for your great work on the vLLM engine. I am currently trying to run inference with Prefill-Decode Disaggregation for the DeepSeek-R1 NVFP4 model across a cluster of GB300 servers. My goal is to deploy a fully disaggregated setup: N dedicated Prefill nodes + M dedicated Decode nodes across multiple physical machines. However, I am facing the following challenges: I am using a GB300-optimized vLLM 0.11.0+custom build, which does not support high-level PD disaggregation CLI arguments such as --separate-prefill-decode, --prefill-node-ips, --decode-node-ips, --role, etc. These flags return "unrecognized arguments". I have tried using basic distributed arguments (--node-rank, --master-addr, --nnodes) to simulate Prefill/Decode splitting, but I cannot achieve real, strict Prefill-Decode Disaggregation where the two stages are fully isolated on separate nodes. My environment: multiple GB300 nodes with InfiniBand, Docker with --network=host, and the same DeepSeek-R1 NVFP4 model accessible across all nodes. I would really appreciate your guidance on: How to properly enable true Prefill-Decode Disaggregation across N Prefill nodes and M Decode nodes in vLLM, even for versions without the high-level PD flags. The correct distributed configuration (tensor parallel, distributed init, KV cache settings) for GB300 and DeepSeek-R1 NVFP4. How to verify that Prefill and Decode workloads are actually running on their dedicated nodes. Thank you very much for your time and help. Best regards,

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To achieve Prefill-Decode Disaggregation in vLLM without high-level CLI arguments, you can manually configure the distributed settings.

Step 1: Modify Configuration

Modify your configuration to include the following settings:

  • distributed_init: True
  • tensor_parallel: True
  • kv_cache_settings: configure according to your GB300 and DeepSeek-R1 NVFP4 model requirements

Step 2: Environment Variables

Set the following environment variables:

  • NODE_RANK: unique rank for each node
  • MASTER_ADDR: address of the master node
  • NNODES: total number of nodes
  • WORLD_SIZE: total number of processes (equal to NNODES * number of GPUs per node)

Step 3: Launch Commands

Launch your Prefill and Decode nodes using the following commands:

# Prefill node
python -m torch.distributed.launch --nnodes=N --node_rank=0 --master_addr=<master_addr> --master_port=29500 your_prefill_script.py

# Decode node
python -m torch.distributed.launch --nnodes=M --node_rank=0 --master_addr=<master_addr> --master_port=29501 your_decode_script.py

Replace N and M with the number of Prefill and Decode nodes, respectively.

Step 4: Code Modifications

Modify your your_prefill_script.py and your_decode_script.py to include the following code snippets:

import torch.distributed as dist

# Initialize distributed backend
dist.init_process_group('nccl', init_method='env://')

# ... (rest of your code)

# Cleanup
dist.destroy_process_group()

Verification

To verify that Prefill and Decode workloads are running on their dedicated nodes, you can use tools like nvidia-smi to monitor GPU usage on each node. Additionally, you can add logging statements in your code to print the node rank and GPU ID being used.

Extra Tips

  • Ensure that your GB300 nodes have the same version of PyTorch and vLLM installed.
  • Use torch.distributed.launch to launch your scripts, as it handles the initialization of the distributed backend.
  • Configure your kv_cache_settings according to your model's requirements to optimize performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING