vllm - ✅(Solved) Fix [Bug]: KeyError on model.layers.N.self_attn.attn during initialize_attn_backend with pipeline_parallel_size=4 (V1 engine + Ray) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40649Fetched 2026-04-23 07:23:38
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1subscribed ×1unsubscribed ×1

Error Message

KeyError: 'model.layers.20.self_attn.attn' # PP rank 1 (layers 20–39) KeyError: 'model.layers.40.self_attn.attn' # PP rank 2 (layers 40–59) KeyError: 'model.layers.60.self_attn.attn' # PP rank 3 (layers 60–79)

Root Cause

Root cause hypothesis

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i9-13900K CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 BogoMIPS: 5990.39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 32 MiB (16 instances) L3 cache: 36 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Mitigation; Clear Register File Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #40678: [Bugfix] Fix KeyError in get_attn_backends_for_group when using PP

Description (problem / solution / changelog)

Latest Status

cannot reproduce this bug using 1-node ray with 4x4090 so I will close it first. will try reproduce with 4-node ray, each has 1 rtx4090, then re-open

Purpose

Closes #40649.

When models use pipeline parallel, each rank only has its local stage's layers. But the code still tries to access kv_cache_group_spec.layer_names, which are global and contain layer names from other ranks, so it triggers key error. Changing code to only access local layer names on each machine fix this error. Because the layers are already filtered to the local stage by get_layers_from_vllm_config.

Test Plan

The described error can be reproduced using this script,

<details>
"""
Minimal repro for vllm-project/vllm#40649:
KeyError in `get_attn_backends_for_group` on non-zero PP ranks.

Bug: `get_attn_backends_for_group` iterates `kv_cache_group_spec.layer_names`
(global across all PP stages) but indexes into `layers` (filtered to only the
local stage by `get_layers_from_vllm_config`). On any PP rank > 0 the lookup
misses and raises KeyError.

Fix: iterate `layers` directly, which is already the local-rank subset.

Run:
    .venv/bin/python repro_40649.py            # buggy behavior
    .venv/bin/python repro_40649.py --fixed    # patched behavior
"""

import argparse
import sys


class FakeLayer:
    def __init__(self, idx: int) -> None:
        self.idx = idx

    def get_attn_backend(self) -> str:
        return f"FakeAttnBackend(layer={self.idx})"


def simulate_pp_rank(
    pp_rank: int,
    pp_size: int,
    num_layers: int,
    fixed: bool,
) -> None:
    # kv_cache_group_spec.layer_names is global across all PP stages.
    global_layer_names = [
        f"model.layers.{i}.self_attn.attn" for i in range(num_layers)
    ]

    # Each PP rank only owns its local stage. This mirrors what
    # `get_layers_from_vllm_config(..., kv_cache_group_spec.layer_names)`
    # returns on a PP worker: the spec's names filtered to the layers
    # actually present in this worker's static_forward_context.
    per_rank = num_layers // pp_size
    local_start = pp_rank * per_rank
    local_end = local_start + per_rank
    layers = {
        f"model.layers.{i}.self_attn.attn": FakeLayer(i)
        for i in range(local_start, local_end)
    }

    # The loop in get_attn_backends_for_group.
    visited: list[str] = []
    try:
        iterable = layers if fixed else global_layer_names
        for layer_name in iterable:
            layers[layer_name].get_attn_backend()
            visited.append(layer_name)
    except KeyError as e:
        print(
            f"[pp_rank={pp_rank}] FAILED after visiting {len(visited)} layers: "
            f"KeyError({e})"
        )
        return
    print(
        f"[pp_rank={pp_rank}] OK — built backends for {len(visited)} local layers"
    )


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--fixed", action="store_true", help="use the patched loop")
    parser.add_argument("--pp-size", type=int, default=4)
    parser.add_argument("--num-layers", type=int, default=80)
    args = parser.parse_args()

    label = "FIXED" if args.fixed else "BUGGY"
    print(f"=== {label}: pp_size={args.pp_size}, num_layers={args.num_layers} ===")
    for rank in range(args.pp_size):
        simulate_pp_rank(rank, args.pp_size, args.num_layers, fixed=args.fixed)
    return 0


if __name__ == "__main__":
    sys.exit(main())
</details>

If you run with the script using original version and --fixed, you can see

 === BUGGY: pp_size=4, num_layers=80 ===                   
  [pp_rank=0] FAILED after visiting 20 layers: KeyError('model.layers.20.self_attn.attn')
  [pp_rank=1] FAILED after visiting 0 layers: KeyError('model.layers.0.self_attn.attn')                                                  
  [pp_rank=2] FAILED after visiting 0 layers: KeyError('model.layers.0.self_attn.attn')                                                  
  [pp_rank=3] FAILED after visiting 0 layers: KeyError('model.layers.0.self_attn.attn') 

=== FIXED: pp_size=4, num_layers=80 ===                   
  [pp_rank=0] OK — built backends for 20 local layers                                                                                    
  [pp_rank=1] OK — built backends for 20 local layers       
  [pp_rank=2] OK — built backends for 20 local layers                                                                                    
  [pp_rank=3] OK — built backends for 20 local layers

Test Result

Similarly, you can just run the original issue's command, serve Llama-3.3-70B-AWQ across a 4-node Ray cluster (1× RTX 4090 each),

vllm serve casperhansen/llama-3.3-70b-instruct-awq \                                                                                   
    --tensor-parallel-size 1 \                                                                                                           
    --pipeline-parallel-size 4 \
    --distributed-executor-backend ray \                                                                                                 
    --quantization awq_marlin \                             
    --attention-backend FLASH_ATTN \                                                                                                     
    --max-model-len 8192 \
    --trust-remote-code

Before: crashes on PP ranks 1/2/3 with KeyError: 'model.layers.N.self_attn.attn' during initialize_attn_backend.
After: engine initializes; a one-shot /v1/completions request returns a response.

I will paste the before and after result within an hour


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +1/-1)

Code Example

/vllm-workspace# python3 collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version        : 595.79
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                           6
Model:                                183
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             5990.39
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            768 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             32 MiB (16 instances)
L3 cache:                             36 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Mitigation; Clear Register File
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_P2P_DISABLE=1
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NCCL_SOCKET_IFNAME=eth
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=WARN
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
NCCL_IB_DISABLE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

vllm serve casperhansen/llama-3.3-70b-instruct-awq \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --distributed-executor-backend ray \
  --quantization awq_marlin \
  --attention-backend FLASH_ATTN \
  --max-model-len 8192 \
  --trust-remote-code

---

KeyError: 'model.layers.20.self_attn.attn'  # PP rank 1 (layers 2039)
KeyError: 'model.layers.40.self_attn.attn'  # PP rank 2 (layers 4059)
KeyError: 'model.layers.60.self_attn.attn'  # PP rank 3 (layers 6079)

---

File "vllm/v1/worker/gpu_model_runner.py", line 6781, in initialize_kv_cache
    self.initialize_attn_backend(kv_cache_config)
File "vllm/v1/worker/gpu_model_runner.py", line 6204, in initialize_attn_backend
    attn_backends = get_attn_backends_for_group(kv_cache_group_spec)
File "vllm/v1/worker/gpu_model_runner.py", line 6163, in get_attn_backends_for_group
    attn_backend = layers[layer_name].get_attn_backend()
                   ~~~~~~^^^^^^^^^^^^
KeyError: 'model.layers.20.self_attn.attn'
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
/vllm-workspace# python3 collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version        : 595.79
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                           6
Model:                                183
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             5990.39
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            768 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             32 MiB (16 instances)
L3 cache:                             36 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Mitigation; Clear Register File
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_P2P_DISABLE=1
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NCCL_SOCKET_IFNAME=eth
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=WARN
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
NCCL_IB_DISABLE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

Current Environment

  • vLLM version: 0.19.1
  • Python: 3.12
  • CUDA: 12.x
  • GPU: 4× NVIDIA RTX 4090 (24 GB each), one per node
  • OS: Ubuntu (WSL2 detected on workers — pin_memory=False)
  • Ray version: (see logs — connected to existing cluster)
  • Quantization: awq_marlin
  • Attention backend: FLASH_ATTN (FlashAttention v2)

Bug Description

When serving a model with --pipeline-parallel-size 4 and --tensor-parallel-size 1 across 4 nodes (1 GPU per node) using the Ray distributed executor backend, the V1 engine crashes during KV cache initialization with a KeyError on attention layer names.

The error occurs in get_attn_backends_for_group inside gpu_model_runner.py. Each pipeline parallel worker only holds its local stage layers (20 layers each for an 80-layer model with pp=4), but the KV cache group spec references global layer indices (e.g. model.layers.20, model.layers.40, model.layers.60). These global keys don't exist in the worker-local layers dict, causing a KeyError.

The bug affects PP ranks 1, 2, and 3 (all non-zero ranks). PP rank 0 (layers 0–19) succeeds.

All prior initialization steps succeed:

  • Ray placement group created across 4 nodes ✓
  • NCCL world_size=4, all ranks assigned ✓
  • Weights loaded on all workers ✓
  • torch.compile completed on all workers ✓
  • CUDA graph profiling completed ✓
  • KV cache size calculated (121,680 tokens) ✓
  • Crash occurs at initialize_attn_backend during initialize_from_config

To reproduce

vllm serve casperhansen/llama-3.3-70b-instruct-awq \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --distributed-executor-backend ray \
  --quantization awq_marlin \
  --attention-backend FLASH_ATTN \
  --max-model-len 8192 \
  --trust-remote-code

Across a 4-node Ray cluster, each node with 1× RTX 4090.

Expected behavior

The engine should initialize successfully. Each PP worker should look up attention layers using its local stage indices (0–19), not the global model indices (0–79).

Error output

Fails on PP ranks 1, 2, 3 with:

KeyError: 'model.layers.20.self_attn.attn'  # PP rank 1 (layers 20–39)
KeyError: 'model.layers.40.self_attn.attn'  # PP rank 2 (layers 40–59)
KeyError: 'model.layers.60.self_attn.attn'  # PP rank 3 (layers 60–79)

Full traceback:

File "vllm/v1/worker/gpu_model_runner.py", line 6781, in initialize_kv_cache
    self.initialize_attn_backend(kv_cache_config)
File "vllm/v1/worker/gpu_model_runner.py", line 6204, in initialize_attn_backend
    attn_backends = get_attn_backends_for_group(kv_cache_group_spec)
File "vllm/v1/worker/gpu_model_runner.py", line 6163, in get_attn_backends_for_group
    attn_backend = layers[layer_name].get_attn_backend()
                   ~~~~~~^^^^^^^^^^^^
KeyError: 'model.layers.20.self_attn.attn'

Root cause hypothesis

In get_attn_backends_for_group (line ~6163), layer_name contains a global model index (e.g. model.layers.20.self_attn.attn), but the layers dict on each PP worker is keyed by local stage indices starting from 0. The lookup fails for any worker that is not PP rank 0.

This appears to be a regression in the V1 engine's initialize_attn_backend path — the V0 engine handled this correctly by using local layer indexing.

Additional context

  • Confirmed reproduced with both quantization=awq and quantization=awq_marlin — quantization is not the cause
  • Confirmed reproduced with explicit --attention-backend FLASH_ATTN — attention backend selection is not the cause
  • VLLM_USE_V1=0 is not available in v0.19.1 (V0 engine was fully removed in v0.11.0)
  • The cluster, NCCL, and all GPU workers initialize correctly — this is purely a layer index mapping bug in initialize_attn_backend
  • Model: casperhansen/llama-3.3-70b-instruct-awq (LlamaForCausalLM, 80 layers, split as 20 layers per PP stage)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify the get_attn_backends_for_group function to use local stage indices instead of global model indices when looking up attention layers.

Guidance

  • Identify the get_attn_backends_for_group function in gpu_model_runner.py and modify it to use local stage indices.
  • Update the layer_name variable to use the local stage index instead of the global model index.
  • Verify that the layers dict on each PP worker is keyed by local stage indices starting from 0.
  • Test the modified code with the provided reproduction command to ensure the KeyError is resolved.

Example

# Modified get_attn_backends_for_group function
def get_attn_backends_for_group(kv_cache_group_spec):
    attn_backends = []
    for layer_name in kv_cache_group_spec:
        # Assuming local_stage_index is the local stage index of the current PP worker
        local_layer_name = layer_name.replace('model.layers.', 'model.layers.') + str(local_stage_index)
        attn_backend = layers[local_layer_name].get_attn_backend()
        attn_backends.append(attn_backend)
    return attn_backends

Note: The above example is a simplified representation and may require additional modifications to work correctly.

Notes

  • The modification should be made in the vllm codebase, specifically in the gpu_model_runner.py file.
  • The local_stage_index variable should be replaced with the actual local stage index of the current PP worker.
  • The layers dict should be keyed by local stage indices starting from 0 for each PP worker.

Recommendation

Apply the workaround by modifying the get_attn_backends_for_group function to use local stage indices, as this is the most likely cause of the KeyError.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The engine should initialize successfully. Each PP worker should look up attention layers using its local stage indices (0–19), not the global model indices (0–79).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: KeyError on model.layers.N.self_attn.attn during initialize_attn_backend with pipeline_parallel_size=4 (V1 engine + Ray) [1 pull requests, 1 participants]