vllm - ✅(Solved) Fix [Bug]: `sharded_state` load fails for FP8 models: `_filter_subtensors` drops `q_scale/k_scale/v_scale/prob_scale` parameters [1 pull requests, 1 participants]

vllm2026-04-28 23:35:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41174•Fetched 2026-04-29 06:11:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mickelliu

Participants

mickelliu

Timeline (top)

cross-referenced ×1labeled ×1

Error Message

ValueError: Missing keys ('language_model.model.layers.3.self_attn.attn.q_scale', 'language_model.model.layers.3.self_attn.attn.k_scale', 'language_model.model.layers.3.self_attn.attn.v_scale', 'language_model.model.layers.3.self_attn.attn.prob_scale', ...) in loaded state!

Root Cause

Root cause

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R13 Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 1 BogoMIPS: 5299.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (12 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Mitigation; Clear CPU buffers Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #41179: Fix sharded_state load for FP8 models with aliased scale keys

Repository: vllm-project/vllm
Author: rixav77
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41179

Description (problem / solution / changelog)

Summary

Fixes #41174

_filter_subtensors deduplicates tensors sharing storage during save (e.g. _q_scale buffer and q_scale parameter keep only _q_scale). On a fresh model at load time, these tensors have separate storage and cannot be deduplicated, so both keys remain in the expected state dict but only the underscore-prefixed version exists in the checkpoint — causing ValueError: Missing keys.
Added _has_loaded_alias() helper that checks whether an unloaded key's underscore-prefixed counterpart was already loaded from the checkpoint. Remaining keys that pass this check are skipped instead of triggering the error.
Added unit tests for the alias detection logic.

Why this is not duplicating an existing PR

No open PRs address issue #41174. Verified via:

gh pr list --repo vllm-project/vllm --state open --search "41174 in:body"
gh pr list --repo vllm-project/vllm --state open --search "_filter_subtensors"

AI assistance disclosure

This PR was developed with AI assistance (Claude). All changes have been reviewed and understood by the submitter.

Test plan

Unit tests for _has_loaded_alias covering alias detection, non-alias keys, already-prefixed keys, and top-level keys
Existing test_filter_subtensors still passes (no change to dedup logic)
Manual verification with FP8 model + sharded_state save/load (requires GPU — CI will cover this via existing test_sharded_state_loader)

🤖 Generated with Claude Code

Changed files

tests/model_executor/model_loader/test_sharded_state_loader.py (modified, +25/-0)
vllm/model_executor/model_loader/sharded_state_loader.py (modified, +37/-1)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by conda-forge | (main, Mar  5 2026, 16:50:00) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 580.126.09
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.20.0
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_adv.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_cnn.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_graph.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_ops.so.9.7.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7R13 Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                1
BogoMIPS:                                5299.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               3 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                48 MiB (96 instances)
L3 cache:                                384 MiB (12 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.5.3
[pip3] triton==3.6.0
[conda] flashinfer-python                           0.6.6            pypi_0
[conda] numpy                                       2.2.6            pypi_0
[conda] nvidia-cublas                               13.1.0.3         pypi_0
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0
[conda] nvidia-cuda-cupti                           13.0.85          pypi_0
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0
[conda] nvidia-cuda-nvrtc                           13.0.88          pypi_0
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0
[conda] nvidia-cuda-runtime                         13.0.96          pypi_0
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0
[conda] nvidia-cudnn-cu13                           9.19.0.56        pypi_0
[conda] nvidia-cudnn-frontend                       1.18.0           pypi_0
[conda] nvidia-cufft                                12.0.0.61        pypi_0
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0
[conda] nvidia-cufile                               1.15.1.6         pypi_0
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0
[conda] nvidia-curand                               10.4.0.35        pypi_0
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0
[conda] nvidia-cusolver                             12.0.4.66        pypi_0
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0
[conda] nvidia-cusparse                             12.6.3.3         pypi_0
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0
[conda] nvidia-cusparselt-cu13                      0.8.0            pypi_0
[conda] nvidia-cutlass-dsl                          4.5.0.dev0       pypi_0
[conda] nvidia-cutlass-dsl-libs-base                4.5.0.dev0       pypi_0
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0
[conda] nvidia-nccl-cu13                            2.28.9           pypi_0
[conda] nvidia-nvjitlink                            13.0.88          pypi_0
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0
[conda] nvidia-nvshmem-cu12                         3.4.5            pypi_0
[conda] nvidia-nvshmem-cu13                         3.4.5            pypi_0
[conda] nvidia-nvtx                                 13.0.85          pypi_0
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0
[conda] pyzmq                                       27.1.0           py312hda471dd_2
[conda] torch                                       2.10.0           pypi_0
[conda] torch-c-dlpack-ext                          0.1.5            pypi_0
[conda] torchaudio                                  2.10.0           pypi_0
[conda] torchvision                                 0.25.0           pypi_0
[conda] transformers                                5.5.3            pypi_0
[conda] triton                                      3.6.0            pypi_0

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 7.5;8.0;8.6;8.9;9.0;10.0;10.3;12.0;12.1+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	48-55,144-151	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	48-55,144-151	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	48-55,144-151	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	48-55,144-151	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

---

ValueError: Missing keys ('language_model.model.layers.3.self_attn.attn.q_scale',
  'language_model.model.layers.3.self_attn.attn.k_scale',
  'language_model.model.layers.3.self_attn.attn.v_scale',
  'language_model.model.layers.3.self_attn.attn.prob_scale', ...) in loaded state!

---

(Worker_TP0 pid=1668288) ERROR 04-28 23:13:07 [multiproc_executor.py:857]     raise ValueError(f"Missing keys {tuple(state_dict)} in loaded state!")
(Worker_TP0 pid=1668288) ERROR 04-28 23:13:07 [multiproc_executor.py:857] ValueError: Missing keys ('language_model.model.layers.3.self_attn.attn.q_scale', 'language_model.model.layers.3.self_attn.attn.k_scale', 'language_model.model.layers.3.self_attn.attn.v_scale', 'language_model.model.layers.3.self_attn.attn.prob_scale', 'language_model.model.layers.7.self_attn.attn.q_scale', 'language_model.model.layers.7.self_attn.attn.k_scale', 'language_model.model.layers.7.self_attn.attn.v_scale', 'language_model.model.layers.7.self_attn.attn.prob_scale', 'language_model.model.layers.11.self_attn.attn.q_scale', 'language_model.model.layers.11.self_attn.attn.k_scale', 'language_model.model.layers.11.self_attn.attn.v_scale', 'language_model.model.layers.11.self_attn.attn.prob_scale', 'language_model.model.layers.15.self_attn.attn.q_scale', 'language_model.model.layers.15.self_attn.attn.k_scale', 'language_model.model.layers.15.self_attn.attn.v_scale', 'language_model.model.layers.15.self_attn.attn.prob_scale', 'language_model.model.layers.19.self_attn.attn.q_scale', 'language_model.model.layers.19.self_attn.attn.k_scale', 'language_model.model.layers.19.self_attn.attn.v_scale', 'language_model.model.layers.19.self_attn.attn.prob_scale', 'language_model.model.layers.23.self_attn.attn.q_scale', 'language_model.model.layers.23.self_attn.attn.k_scale', 'language_model.model.layers.23.self_attn.attn.v_scale', 'language_model.model.layers.23.self_attn.attn.prob_scale', 'language_model.model.layers.27.self_attn.attn.q_scale', 'language_model.model.layers.27.self_attn.attn.k_scale', 'language_model.model.layers.27.self_attn.attn.v_scale', 'language_model.model.layers.27.self_attn.attn.prob_scale', 'language_model.model.layers.31.self_attn.attn.q_scale', 'language_model.model.layers.31.self_attn.attn.k_scale', 'language_model.model.layers.31.self_attn.attn.v_scale', 'language_model.model.layers.31.self_attn.attn.prob_scale', 'language_model.model.layers.35.self_attn.attn.q_scale', 'language_model.model.layers.35.self_attn.attn.k_scale', 'language_model.model.layers.35.self_attn.attn.v_scale', 'language_model.model.layers.35.self_attn.attn.prob_scale', 'language_model.model.layers.39.self_attn.attn.q_scale', 'language_model.model.layers.39.self_attn.attn.k_scale', 'language_model.model.layers.39.self_attn.attn.v_scale', 'language_model.model.layers.39.self_attn.attn.prob_scale', 'language_model.model.layers.43.self_attn.attn.q_scale', 'language_model.model.layers.43.self_attn.attn.k_scale', 'language_model.model.layers.43.self_attn.attn.v_scale', 'language_model.model.layers.43.self_attn.attn.prob_scale', 'language_model.model.layers.47.self_attn.attn.q_scale', 'language_model.model.layers.47.self_attn.attn.k_scale', 'language_model.model.layers.47.self_attn.attn.v_scale', 'language_model.model.layers.47.self_attn.attn.prob_scale', 'language_model.model.layers.51.self_attn.attn.q_scale', 'language_model.model.layers.51.self_attn.attn.k_scale', 'language_model.model.layers.51.self_attn.attn.v_scale', 'language_model.model.layers.51.self_attn.attn.prob_scale', 'language_model.model.layers.55.self_attn.attn.q_scale', 'language_model.model.layers.55.self_attn.attn.k_scale', 'language_model.model.layers.55.self_attn.attn.v_scale', 'language_model.model.layers.55.self_attn.attn.prob_scale', 'language_model.model.layers.59.self_attn.attn.q_scale', 'language_model.model.layers.59.self_attn.attn.k_scale', 'language_model.model.layers.59.self_attn.attn.v_scale', 'language_model.model.layers.59.self_attn.attn.prob_scale') in loaded state!

---

# Save (succeeds)
  python save_sharded_state.py \
      --model Qwen/Qwen3.5-397B-A17B-FP8 \
      --tensor-parallel-size 8 \
      --output /path/to/sharded

---

# Load (fails)
  vllm serve /path/to/sharded \
      --load-format sharded_state \
      --tensor-parallel-size 8

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by conda-forge | (main, Mar  5 2026, 16:50:00) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 580.126.09
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.20.0
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_adv.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_cnn.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_graph.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.7.1
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn_ops.so.9.7.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7R13 Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                1
BogoMIPS:                                5299.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               3 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                48 MiB (96 instances)
L3 cache:                                384 MiB (12 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.5.3
[pip3] triton==3.6.0
[conda] flashinfer-python                           0.6.6            pypi_0
[conda] numpy                                       2.2.6            pypi_0
[conda] nvidia-cublas                               13.1.0.3         pypi_0
[conda] nvidia-cublas-cu12                          12.8.4.1         pypi_0
[conda] nvidia-cuda-cupti                           13.0.85          pypi_0
[conda] nvidia-cuda-cupti-cu12                      12.8.90          pypi_0
[conda] nvidia-cuda-nvrtc                           13.0.88          pypi_0
[conda] nvidia-cuda-nvrtc-cu12                      12.8.93          pypi_0
[conda] nvidia-cuda-runtime                         13.0.96          pypi_0
[conda] nvidia-cuda-runtime-cu12                    12.8.90          pypi_0
[conda] nvidia-cudnn-cu12                           9.10.2.21        pypi_0
[conda] nvidia-cudnn-cu13                           9.19.0.56        pypi_0
[conda] nvidia-cudnn-frontend                       1.18.0           pypi_0
[conda] nvidia-cufft                                12.0.0.61        pypi_0
[conda] nvidia-cufft-cu12                           11.3.3.83        pypi_0
[conda] nvidia-cufile                               1.15.1.6         pypi_0
[conda] nvidia-cufile-cu12                          1.13.1.3         pypi_0
[conda] nvidia-curand                               10.4.0.35        pypi_0
[conda] nvidia-curand-cu12                          10.3.9.90        pypi_0
[conda] nvidia-cusolver                             12.0.4.66        pypi_0
[conda] nvidia-cusolver-cu12                        11.7.3.90        pypi_0
[conda] nvidia-cusparse                             12.6.3.3         pypi_0
[conda] nvidia-cusparse-cu12                        12.5.8.93        pypi_0
[conda] nvidia-cusparselt-cu12                      0.7.1            pypi_0
[conda] nvidia-cusparselt-cu13                      0.8.0            pypi_0
[conda] nvidia-cutlass-dsl                          4.5.0.dev0       pypi_0
[conda] nvidia-cutlass-dsl-libs-base                4.5.0.dev0       pypi_0
[conda] nvidia-nccl-cu12                            2.27.5           pypi_0
[conda] nvidia-nccl-cu13                            2.28.9           pypi_0
[conda] nvidia-nvjitlink                            13.0.88          pypi_0
[conda] nvidia-nvjitlink-cu12                       12.8.93          pypi_0
[conda] nvidia-nvshmem-cu12                         3.4.5            pypi_0
[conda] nvidia-nvshmem-cu13                         3.4.5            pypi_0
[conda] nvidia-nvtx                                 13.0.85          pypi_0
[conda] nvidia-nvtx-cu12                            12.8.90          pypi_0
[conda] pyzmq                                       27.1.0           py312hda471dd_2
[conda] torch                                       2.10.0           pypi_0
[conda] torch-c-dlpack-ext                          0.1.5            pypi_0
[conda] torchaudio                                  2.10.0           pypi_0
[conda] torchvision                                 0.25.0           pypi_0
[conda] transformers                                5.5.3            pypi_0
[conda] triton                                      3.6.0            pypi_0

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 7.5;8.0;8.6;8.9;9.0;10.0;10.3;12.0;12.1+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-7,96-103	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	48-55,144-151	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	48-55,144-151	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	48-55,144-151	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	48-55,144-151	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

</details>

🐛 Describe the bug

Environment

vLLM v0.19.1
Model: Qwen/Qwen3.5-397B-A17B-FP8 (FP8 quantized, TP=8)
Platform: 8x H100

What happened

Saved a sharded checkpoint using the official save_sharded_state.py example. Loading with --load-format sharded_state fails with:

  ValueError: Missing keys ('language_model.model.layers.3.self_attn.attn.q_scale',
  'language_model.model.layers.3.self_attn.attn.k_scale',
  'language_model.model.layers.3.self_attn.attn.v_scale',
  'language_model.model.layers.3.self_attn.attn.prob_scale', ...) in loaded state!

Full error message:

(Worker_TP0 pid=1668288) ERROR 04-28 23:13:07 [multiproc_executor.py:857]     raise ValueError(f"Missing keys {tuple(state_dict)} in loaded state!")
(Worker_TP0 pid=1668288) ERROR 04-28 23:13:07 [multiproc_executor.py:857] ValueError: Missing keys ('language_model.model.layers.3.self_attn.attn.q_scale', 'language_model.model.layers.3.self_attn.attn.k_scale', 'language_model.model.layers.3.self_attn.attn.v_scale', 'language_model.model.layers.3.self_attn.attn.prob_scale', 'language_model.model.layers.7.self_attn.attn.q_scale', 'language_model.model.layers.7.self_attn.attn.k_scale', 'language_model.model.layers.7.self_attn.attn.v_scale', 'language_model.model.layers.7.self_attn.attn.prob_scale', 'language_model.model.layers.11.self_attn.attn.q_scale', 'language_model.model.layers.11.self_attn.attn.k_scale', 'language_model.model.layers.11.self_attn.attn.v_scale', 'language_model.model.layers.11.self_attn.attn.prob_scale', 'language_model.model.layers.15.self_attn.attn.q_scale', 'language_model.model.layers.15.self_attn.attn.k_scale', 'language_model.model.layers.15.self_attn.attn.v_scale', 'language_model.model.layers.15.self_attn.attn.prob_scale', 'language_model.model.layers.19.self_attn.attn.q_scale', 'language_model.model.layers.19.self_attn.attn.k_scale', 'language_model.model.layers.19.self_attn.attn.v_scale', 'language_model.model.layers.19.self_attn.attn.prob_scale', 'language_model.model.layers.23.self_attn.attn.q_scale', 'language_model.model.layers.23.self_attn.attn.k_scale', 'language_model.model.layers.23.self_attn.attn.v_scale', 'language_model.model.layers.23.self_attn.attn.prob_scale', 'language_model.model.layers.27.self_attn.attn.q_scale', 'language_model.model.layers.27.self_attn.attn.k_scale', 'language_model.model.layers.27.self_attn.attn.v_scale', 'language_model.model.layers.27.self_attn.attn.prob_scale', 'language_model.model.layers.31.self_attn.attn.q_scale', 'language_model.model.layers.31.self_attn.attn.k_scale', 'language_model.model.layers.31.self_attn.attn.v_scale', 'language_model.model.layers.31.self_attn.attn.prob_scale', 'language_model.model.layers.35.self_attn.attn.q_scale', 'language_model.model.layers.35.self_attn.attn.k_scale', 'language_model.model.layers.35.self_attn.attn.v_scale', 'language_model.model.layers.35.self_attn.attn.prob_scale', 'language_model.model.layers.39.self_attn.attn.q_scale', 'language_model.model.layers.39.self_attn.attn.k_scale', 'language_model.model.layers.39.self_attn.attn.v_scale', 'language_model.model.layers.39.self_attn.attn.prob_scale', 'language_model.model.layers.43.self_attn.attn.q_scale', 'language_model.model.layers.43.self_attn.attn.k_scale', 'language_model.model.layers.43.self_attn.attn.v_scale', 'language_model.model.layers.43.self_attn.attn.prob_scale', 'language_model.model.layers.47.self_attn.attn.q_scale', 'language_model.model.layers.47.self_attn.attn.k_scale', 'language_model.model.layers.47.self_attn.attn.v_scale', 'language_model.model.layers.47.self_attn.attn.prob_scale', 'language_model.model.layers.51.self_attn.attn.q_scale', 'language_model.model.layers.51.self_attn.attn.k_scale', 'language_model.model.layers.51.self_attn.attn.v_scale', 'language_model.model.layers.51.self_attn.attn.prob_scale', 'language_model.model.layers.55.self_attn.attn.q_scale', 'language_model.model.layers.55.self_attn.attn.k_scale', 'language_model.model.layers.55.self_attn.attn.v_scale', 'language_model.model.layers.55.self_attn.attn.prob_scale', 'language_model.model.layers.59.self_attn.attn.q_scale', 'language_model.model.layers.59.self_attn.attn.k_scale', 'language_model.model.layers.59.self_attn.attn.v_scale', 'language_model.model.layers.59.self_attn.attn.prob_scale') in loaded state!

Root cause

For FP8 models, each attention layer has two sets of scale attributes:

_q_scale, _k_scale, _v_scale, _prob_scale — register_buffer() in set_default_quant_scales() (attention.py:92-95)
q_scale, k_scale, v_scale, prob_scale — nn.Parameter created by BaseKVCacheMethod.create_weights() (kv_cache.py:41-44)

These share the same underlying tensor storage. During save, ShardedStateLoader._filter_subtensors() deduplicates tensors that share memory, keeping only one (the _q_scale variant). During load, model.state_dict() contains both sets of keys, but the checkpoint only has the underscore versions — the non-underscore Parameters are never matched, triggering the strict missing-keys check.

To reproduce:

  # Save (succeeds)
  python save_sharded_state.py \
      --model Qwen/Qwen3.5-397B-A17B-FP8 \
      --tensor-parallel-size 8 \
      --output /path/to/sharded

  # Load (fails)
  vllm serve /path/to/sharded \
      --load-format sharded_state \
      --tensor-parallel-size 8

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to the duplication of scale attributes in FP8 models, causing the ShardedStateLoader to deduplicate tensors and resulting in missing keys during load.

Guidance

The root cause is identified as the duplication of scale attributes in FP8 models, where both _q_scale, _k_scale, _v_scale, _prob_scale and q_scale, k_scale, v_scale, prob_scale share the same underlying tensor storage.
To fix this, the code needs to be modified to handle the duplication of scale attributes, possibly by removing or renaming one set of attributes.
The ShardedStateLoader._filter_subtensors() method should be updated to handle the duplication of tensors and ensure that both sets of scale attributes are loaded correctly.
The model's state_dict() method should also be updated to include both sets of scale attributes, or to handle the case where one set is missing.

Example

No code example is provided as the issue requires modifications to the existing codebase, specifically the attention.py and kv_cache.py files.

Notes

The issue is specific to FP8 models and may not affect other model types.
The fix will require updates to the ShardedStateLoader and model code to handle the duplication of scale attributes.

Recommendation

Apply a workaround by modifying the ShardedStateLoader._filter_subtensors() method to handle the duplication of tensors and ensure that both sets of scale attributes are loaded correctly. This will require updates to the existing codebase and may involve removing or renaming one set of attributes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: `sharded_state` load fails for FP8 models: `_filter_subtensors` drops `q_scale/k_scale/v_scale/prob_scale` parameters [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #41179: Fix sharded_state load for FP8 models with aliased scale keys

Description (problem / solution / changelog)

Summary

Why this is not duplicating an existing PR

AI assistance disclosure

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Environment

What happened

Root cause

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING