vllm - 💡(How to fix) Fix [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. [1 participants]

vllm2026-04-02 12:30:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38808•Fetched 2026-04-08 02:34:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Taeyang123456

Participants

Taeyang123456

Timeline (top)

labeled ×1

Fix Action

Fix / Workaround

kv_cache = extract_kv_from_layer(kv_layer, request.block_ids) tensor_id = original_request_id + "#" + layer_name self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)

And with these patch, disaggregated_prefill.sh is able to be run correctly.

Code Example

Collecting environment information...        
==============================                                 
        System Info                                            
==============================                                
OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                             
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                   
Clang version                : Could not collect                                                                                       
CMake version                : Could not collect                                                                                       
Libc version                 : glibc-2.39                  
                                                                   
==============================                       
       PyTorch Info                                  
==============================                       
PyTorch version              : 2.10.0+cu128                     
Is debug build               : False                 
CUDA used to build PyTorch   : 12.8                                                                                                                                                                                                                                            ROCM used to build PyTorch   : N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==============================                                                                                                                                                                                                                                                       Python Environment                                                                                                                                                                                                                                                       ==============================                                                                                                                                                                                                                                                 Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)                                                                                                                                                                              Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.39                                                                                                                                                                                                                                                                      
==============================                                     
       CUDA / GPU Info        
==============================            
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :     
GPU 0: NVIDIA H20                                                                                                                      
GPU 1: NVIDIA H20                     
                                                                   
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A             
MIOpen runtime version       : N/A 
Is XNNPACK available         : True 

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260324 (git sha: bcf2be961, date: 20260324)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-47    0               N/A
GPU1    NV18     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

---

cd examples/online_serving && bash disaggregated_prefill.sh

---

INFO:__main__:[prefill] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8100/v1/completions
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:387] 🤝ncclCommInitRank Success, 127.0.0.1:14580👈127.0.0.1:14579, MyRank:1
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 127.0.0.1:14579👉localhost:14580, MyRank:0
(APIServer pid=71627) INFO:     127.0.0.1:52874 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:__main__:[prefill] done request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d status=200 elapsed=0.34s
INFO:__main__:[decode] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8200/v1/completions
(APIServer pid=71627) INFO 04-02 19:44:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
(APIServer pid=71627) INFO 04-02 19:44:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

---

@staticmethod
    def extract_original_request_id(request_id: str) -> str:
        """
        Extract the original request_id from a potentially wrapped version.

        The OpenAI API wraps request_ids with `cmpl-` prefix and `-0-<suffix>`,
        causing tensor_id mismatch between prefill (saves with original ID) and
        decode (looks up with wrapped ID).

        Example:
        - Original: `___prefill_addr_172.17.89.91:14579___decode_addr_..._`
        - Wrapped: `cmpl-___prefill_addr_172.17.89.91:14579___decode_addr_..._`

        Args:
            request_id: The potentially wrapped request_id.

        Returns:
            The original request_id without the cmpl- prefix and -0-<suffix>.
        """
        pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
        match = re.search(pattern, request_id)
        if match:
            return match.group(1)
        # If no match, return the original request_id unchanged
        return request_id

---

# Load the KV for each request each layer
        for request in metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, False)
            remote_address = ip + ":" + str(port + self._rank)
            for layer_name in forward_context.no_compile_layers:
                layer = forward_context.no_compile_layers[layer_name]

                # Only process layers that have kv_cache
                # attribute (attention layers) Skip non-attention
                # layers like FusedMoE
                kv_cache = getattr(layer, "kv_cache", None)
                if kv_cache is None:
                    continue

                layer = kv_cache[forward_context.virtual_engine]
                #.........
                # concat tensor id with original_request_id
                #.........
                tensor_id = original_request_id + "#" + layer_name
                kv_cache = self.p2p_nccl_engine.recv_tensor(tensor_id, remote_address)

                if kv_cache is None:
                    logger.warning("🚧kv_cache is None, %s", original_request_id)
                    continue

                inject_kv_into_layer(
                    layer, kv_cache, request.block_ids, original_request_id
                )

---

for request in connector_metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, True)
            remote_address = ip + ":" + str(port + self._rank)

            kv_cache = extract_kv_from_layer(kv_layer, request.block_ids)
            tensor_id = original_request_id + "#" + layer_name
            self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...        
==============================                                 
        System Info                                            
==============================                                
OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                             
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                   
Clang version                : Could not collect                                                                                       
CMake version                : Could not collect                                                                                       
Libc version                 : glibc-2.39                  
                                                                   
==============================                       
       PyTorch Info                                  
==============================                       
PyTorch version              : 2.10.0+cu128                     
Is debug build               : False                 
CUDA used to build PyTorch   : 12.8                                                                                                                                                                                                                                            ROCM used to build PyTorch   : N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==============================                                                                                                                                                                                                                                                       Python Environment                                                                                                                                                                                                                                                       ==============================                                                                                                                                                                                                                                                 Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)                                                                                                                                                                              Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.39                                                                                                                                                                                                                                                                      
==============================                                     
       CUDA / GPU Info        
==============================            
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :     
GPU 0: NVIDIA H20                                                                                                                      
GPU 1: NVIDIA H20                     
                                                                   
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A             
MIOpen runtime version       : N/A 
Is XNNPACK available         : True 

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260324 (git sha: bcf2be961, date: 20260324)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-47    0               N/A
GPU1    NV18     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

</details>

🐛 Describe the bug

I try to run prefill disaggregate with Qwen3-14B by running command

cd examples/online_serving && bash disaggregated_prefill.sh

this procedure hang, and I cannot get the request output from LLM. The logging message like:

INFO:__main__:[prefill] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8100/v1/completions
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:387] 🤝ncclCommInitRank Success, 127.0.0.1:14580👈127.0.0.1:14579, MyRank:1
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 127.0.0.1:14579👉localhost:14580, MyRank:0
(APIServer pid=71627) INFO:     127.0.0.1:52874 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:__main__:[prefill] done request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d status=200 elapsed=0.34s
INFO:__main__:[decode] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8200/v1/completions
(APIServer pid=71627) INFO 04-02 19:44:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
(APIServer pid=71627) INFO 04-02 19:44:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

and nothing else is printed after these message.

Through some debugging，I figure out that there is a mismatch request id between prefill node and decode node. OpenAPI may wrap the original request_id with 'cmpl-' prefix and '-0-<suffix>', which make decode node cannot recieve correct KV tensor from prefill node.

Here is my temporary solution for this mismatch: In class <code>P2pNcclConnector</code>, define a static method to extract the original request id

   @staticmethod
    def extract_original_request_id(request_id: str) -> str:
        """
        Extract the original request_id from a potentially wrapped version.

        The OpenAI API wraps request_ids with `cmpl-` prefix and `-0-<suffix>`,
        causing tensor_id mismatch between prefill (saves with original ID) and
        decode (looks up with wrapped ID).

        Example:
        - Original: `___prefill_addr_172.17.89.91:14579___decode_addr_..._`
        - Wrapped: `cmpl-___prefill_addr_172.17.89.91:14579___decode_addr_..._`

        Args:
            request_id: The potentially wrapped request_id.

        Returns:
            The original request_id without the cmpl- prefix and -0-<suffix>.
        """
        pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
        match = re.search(pattern, request_id)
        if match:
            return match.group(1)
        # If no match, return the original request_id unchanged
        return request_id

and then, in function <code>start_load_kv</code>

        # Load the KV for each request each layer
        for request in metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, False)
            remote_address = ip + ":" + str(port + self._rank)
            for layer_name in forward_context.no_compile_layers:
                layer = forward_context.no_compile_layers[layer_name]

                # Only process layers that have kv_cache
                # attribute (attention layers) Skip non-attention
                # layers like FusedMoE
                kv_cache = getattr(layer, "kv_cache", None)
                if kv_cache is None:
                    continue

                layer = kv_cache[forward_context.virtual_engine]
                #.........
                # concat tensor id with original_request_id
                #.........
                tensor_id = original_request_id + "#" + layer_name
                kv_cache = self.p2p_nccl_engine.recv_tensor(tensor_id, remote_address)

                if kv_cache is None:
                    logger.warning("🚧kv_cache is None, %s", original_request_id)
                    continue

                inject_kv_into_layer(
                    layer, kv_cache, request.block_ids, original_request_id
                )

also in <code>extract_kv_from_layer</code>

     for request in connector_metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, True)
            remote_address = ip + ":" + str(port + self._rank)

            kv_cache = extract_kv_from_layer(kv_layer, request.block_ids)
            tensor_id = original_request_id + "#" + layer_name
            self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)

And with these patch, disaggregated_prefill.sh is able to be run correctly.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be resolved by extracting the original request ID from the potentially wrapped version using a static method in the P2pNcclConnector class.

Guidance

Identify the request_id mismatch between prefill and decode nodes caused by OpenAPI wrapping the original request_id with a 'cmpl-' prefix and '-0-<suffix>'.
Implement a static method extract_original_request_id in the P2pNcclConnector class to extract the original request_id from the wrapped version.
Call the extract_original_request_id method in the start_load_kv and extract_kv_from_layer functions to ensure correct tensor ID construction.
Verify that the disaggregated_prefill.sh script runs correctly after applying the patch.

Example

@staticmethod
def extract_original_request_id(request_id: str) -> str:
    pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
    match = re.search(pattern, request_id)
    if match:
        return match.group(1)
    return request_id

Notes

The provided patch assumes that the wrapped request_id follows a specific pattern. If the pattern changes, the regular expression in the extract_original_request_id method may need to be updated.

Recommendation

Apply the workaround by implementing the extract_original_request_id method and calling it in the start_load_kv and extract_kv_from_layer functions, as this resolves the request_id mismatch issue and allows the disaggregated_prefill.sh script to run correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING