vllm - 💡(How to fix) Fix [Bug]: Disaggregate prefill script cannot work due to inconsistent request id between P node and D node. [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38808Fetched 2026-04-08 02:34:47
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Fix Action

Fix / Workaround

kv_cache = extract_kv_from_layer(kv_layer, request.block_ids) tensor_id = original_request_id + "#" + layer_name self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)

And with these patch, disaggregated_prefill.sh is able to be run correctly.

Code Example

Collecting environment information...        
==============================                                 
        System Info                                            
==============================                                
OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                             
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                   
Clang version                : Could not collect                                                                                       
CMake version                : Could not collect                                                                                       
Libc version                 : glibc-2.39                  
                                                                   
==============================                       
       PyTorch Info                                  
==============================                       
PyTorch version              : 2.10.0+cu128                     
Is debug build               : False                 
CUDA used to build PyTorch   : 12.8                                                                                                                                                                                                                                            ROCM used to build PyTorch   : N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==============================                                                                                                                                                                                                                                                       Python Environment                                                                                                                                                                                                                                                       ==============================                                                                                                                                                                                                                                                 Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)                                                                                                                                                                              Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.39                                                                                                                                                                                                                                                                      
==============================                                     
       CUDA / GPU Info        
==============================            
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :     
GPU 0: NVIDIA H20                                                                                                                      
GPU 1: NVIDIA H20                     
                                                                   
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A             
MIOpen runtime version       : N/A 
Is XNNPACK available         : True 

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260324 (git sha: bcf2be961, date: 20260324)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-47    0               N/A
GPU1    NV18     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

---

cd examples/online_serving && bash disaggregated_prefill.sh

---

INFO:__main__:[prefill] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8100/v1/completions
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:387] 🤝ncclCommInitRank Success, 127.0.0.1:14580👈127.0.0.1:14579, MyRank:1
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 127.0.0.1:14579👉localhost:14580, MyRank:0
(APIServer pid=71627) INFO:     127.0.0.1:52874 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:__main__:[prefill] done request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d status=200 elapsed=0.34s
INFO:__main__:[decode] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8200/v1/completions
(APIServer pid=71627) INFO 04-02 19:44:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
(APIServer pid=71627) INFO 04-02 19:44:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

---

@staticmethod
    def extract_original_request_id(request_id: str) -> str:
        """
        Extract the original request_id from a potentially wrapped version.

        The OpenAI API wraps request_ids with `cmpl-` prefix and `-0-<suffix>`,
        causing tensor_id mismatch between prefill (saves with original ID) and
        decode (looks up with wrapped ID).

        Example:
        - Original: `___prefill_addr_172.17.89.91:14579___decode_addr_..._`
        - Wrapped: `cmpl-___prefill_addr_172.17.89.91:14579___decode_addr_..._`

        Args:
            request_id: The potentially wrapped request_id.

        Returns:
            The original request_id without the cmpl- prefix and -0-<suffix>.
        """
        pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
        match = re.search(pattern, request_id)
        if match:
            return match.group(1)
        # If no match, return the original request_id unchanged
        return request_id

---

# Load the KV for each request each layer
        for request in metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, False)
            remote_address = ip + ":" + str(port + self._rank)
            for layer_name in forward_context.no_compile_layers:
                layer = forward_context.no_compile_layers[layer_name]

                # Only process layers that have kv_cache
                # attribute (attention layers) Skip non-attention
                # layers like FusedMoE
                kv_cache = getattr(layer, "kv_cache", None)
                if kv_cache is None:
                    continue

                layer = kv_cache[forward_context.virtual_engine]
                #.........
                # concat tensor id with original_request_id
                #.........
                tensor_id = original_request_id + "#" + layer_name
                kv_cache = self.p2p_nccl_engine.recv_tensor(tensor_id, remote_address)

                if kv_cache is None:
                    logger.warning("🚧kv_cache is None, %s", original_request_id)
                    continue

                inject_kv_into_layer(
                    layer, kv_cache, request.block_ids, original_request_id
                )

---

for request in connector_metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, True)
            remote_address = ip + ":" + str(port + self._rank)

            kv_cache = extract_kv_from_layer(kv_layer, request.block_ids)
            tensor_id = original_request_id + "#" + layer_name
            self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...        
==============================                                 
        System Info                                            
==============================                                
OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                             
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                   
Clang version                : Could not collect                                                                                       
CMake version                : Could not collect                                                                                       
Libc version                 : glibc-2.39                  
                                                                   
==============================                       
       PyTorch Info                                  
==============================                       
PyTorch version              : 2.10.0+cu128                     
Is debug build               : False                 
CUDA used to build PyTorch   : 12.8                                                                                                                                                                                                                                            ROCM used to build PyTorch   : N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==============================                                                                                                                                                                                                                                                       Python Environment                                                                                                                                                                                                                                                       ==============================                                                                                                                                                                                                                                                 Python version               : 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)                                                                                                                                                                              Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.39                                                                                                                                                                                                                                                                      
==============================                                     
       CUDA / GPU Info        
==============================            
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :     
GPU 0: NVIDIA H20                                                                                                                      
GPU 1: NVIDIA H20                     
                                                                   
Nvidia driver version        : 580.126.20
cuDNN version                : Could not collect
HIP runtime version          : N/A             
MIOpen runtime version       : N/A 
Is XNNPACK available         : True 

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.1.dev0+gbcf2be961.d20260324 (git sha: bcf2be961, date: 20260324)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-47    0               N/A
GPU1    NV18     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
</details>

🐛 Describe the bug

I try to run prefill disaggregate with Qwen3-14B by running command

cd examples/online_serving && bash disaggregated_prefill.sh

this procedure hang, and I cannot get the request output from LLM. The logging message like:

INFO:__main__:[prefill] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8100/v1/completions
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:52] set_p2p_nccl_context, original_values: {'NCCL_MAX_NCHANNELS': None, 'NCCL_MIN_NCHANNELS': None, 'NCCL_CUMEM_ENABLE': '0', 'NCCL_BUFFSIZE': None, 'NCCL_PROTO': None, 'NCCL_ALGO': None}
(EngineCore pid=71856) INFO 04-02 19:44:09 [p2p_nccl_engine.py:387] 🤝ncclCommInitRank Success, 127.0.0.1:14580👈127.0.0.1:14579, MyRank:1
(EngineCore pid=71849) INFO 04-02 19:44:09 [p2p_nccl_engine.py:226] 🤝ncclCommInitRank Success, 127.0.0.1:14579👉localhost:14580, MyRank:0
(APIServer pid=71627) INFO:     127.0.0.1:52874 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:__main__:[prefill] done request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d status=200 elapsed=0.34s
INFO:__main__:[decode] start request_id=___prefill_addr_localhost:14579___decode_addr_localhost:14580_aa38055cf06b4fbcadf37f77befaf21d url=http://localhost:8200/v1/completions
(APIServer pid=71627) INFO 04-02 19:44:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
(APIServer pid=71627) INFO 04-02 19:44:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

and nothing else is printed after these message.

Through some debugging,I figure out that there is a mismatch request id between prefill node and decode node. OpenAPI may wrap the original request_id with 'cmpl-' prefix and '-0-<suffix>', which make decode node cannot recieve correct KV tensor from prefill node.

Here is my temporary solution for this mismatch: In class <code>P2pNcclConnector</code>, define a static method to extract the original request id

   @staticmethod
    def extract_original_request_id(request_id: str) -> str:
        """
        Extract the original request_id from a potentially wrapped version.

        The OpenAI API wraps request_ids with `cmpl-` prefix and `-0-<suffix>`,
        causing tensor_id mismatch between prefill (saves with original ID) and
        decode (looks up with wrapped ID).

        Example:
        - Original: `___prefill_addr_172.17.89.91:14579___decode_addr_..._`
        - Wrapped: `cmpl-___prefill_addr_172.17.89.91:14579___decode_addr_..._`

        Args:
            request_id: The potentially wrapped request_id.

        Returns:
            The original request_id without the cmpl- prefix and -0-<suffix>.
        """
        pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
        match = re.search(pattern, request_id)
        if match:
            return match.group(1)
        # If no match, return the original request_id unchanged
        return request_id

and then, in function <code>start_load_kv</code>

        # Load the KV for each request each layer
        for request in metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, False)
            remote_address = ip + ":" + str(port + self._rank)
            for layer_name in forward_context.no_compile_layers:
                layer = forward_context.no_compile_layers[layer_name]

                # Only process layers that have kv_cache
                # attribute (attention layers) Skip non-attention
                # layers like FusedMoE
                kv_cache = getattr(layer, "kv_cache", None)
                if kv_cache is None:
                    continue

                layer = kv_cache[forward_context.virtual_engine]
                #.........
                # concat tensor id with original_request_id
                #.........
                tensor_id = original_request_id + "#" + layer_name
                kv_cache = self.p2p_nccl_engine.recv_tensor(tensor_id, remote_address)

                if kv_cache is None:
                    logger.warning("🚧kv_cache is None, %s", original_request_id)
                    continue

                inject_kv_into_layer(
                    layer, kv_cache, request.block_ids, original_request_id
                )

also in <code>extract_kv_from_layer</code>

     for request in connector_metadata.requests:
            #.........
            # call extract_original_request_id
            #.........
            original_request_id = self.extract_original_request_id(
                request.request_id
            )
            ip, port = self.parse_request_id(original_request_id, True)
            remote_address = ip + ":" + str(port + self._rank)

            kv_cache = extract_kv_from_layer(kv_layer, request.block_ids)
            tensor_id = original_request_id + "#" + layer_name
            self.p2p_nccl_engine.send_tensor(tensor_id, kv_cache, remote_address)

And with these patch, disaggregated_prefill.sh is able to be run correctly.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be resolved by extracting the original request ID from the potentially wrapped version using a static method in the P2pNcclConnector class.

Guidance

  • Identify the request_id mismatch between prefill and decode nodes caused by OpenAPI wrapping the original request_id with a 'cmpl-' prefix and '-0-<suffix>'.
  • Implement a static method extract_original_request_id in the P2pNcclConnector class to extract the original request_id from the wrapped version.
  • Call the extract_original_request_id method in the start_load_kv and extract_kv_from_layer functions to ensure correct tensor ID construction.
  • Verify that the disaggregated_prefill.sh script runs correctly after applying the patch.

Example

@staticmethod
def extract_original_request_id(request_id: str) -> str:
    pattern = r"(___prefill_addr_.*___decode_addr_.*_[a-f0-9]+)"
    match = re.search(pattern, request_id)
    if match:
        return match.group(1)
    return request_id

Notes

The provided patch assumes that the wrapped request_id follows a specific pattern. If the pattern changes, the regular expression in the extract_original_request_id method may need to be updated.

Recommendation

Apply the workaround by implementing the extract_original_request_id method and calling it in the start_load_kv and extract_kv_from_layer functions, as this resolves the request_id mismatch issue and allows the disaggregated_prefill.sh script to run correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING