vllm - ✅(Solved) Fix [Bug]: runai_streamer loads both Ministral consolidated and HF sharded safetensors [1 pull requests, 1 comments, 1 participants]

vllm2026-04-24 03:54:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40765•Fetched 2026-04-24 10:36:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dhayanesh

Participants

dhayanesh

Timeline (top)

commented ×1cross-referenced ×1labeled ×1

Error Message

KeyError: 'layers.23.self_attn.qkv_proj.weight_scale_inv'

Root Cause

Repos with only consolidated Mistral-format safetensors, for example mistralai/Pixtral-12B-2409, do not hit the duplicate-layout issue because there are no HF-style shards to mix in.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 6 CPU(s) scaling MHz: 34% CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5600.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 2.3 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 60 MiB (48 instances) L3 cache: 72 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #40774: [Core] Fix Run:ai streamer loading for Mistral-format safetensors

Repository: vllm-project/vllm
Author: dhayanesh
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40774

Description (problem / solution / changelog)

Purpose

Fixes #40765.

runai_streamer currently streams every *.safetensors file it finds. For Mistral-format checkpoints that publish both consolidated*.safetensors and Hugging Face sharded model-*.safetensors, this mixes two checkpoint layouts and can fail during weight loading.

This change makes RunaiModelStreamerLoader select consolidated*.safetensors for Mistral-format weights, matching the existing behavior in the default loader. For object-storage paths, the detection requires both a Mistral config marker (params.json) and consolidated weights, so an unrelated consolidated.safetensors file does not accidentally exclude HF shards.

Test Plan

Added unit tests for local Mistral-format safetensors selection.
Added unit tests for object-storage Mistral-format safetensors selection.
Added a regression test that object-storage paths without params.json keep normal *.safetensors behavior.

Test Result

python -m py_compile \
  vllm/model_executor/model_loader/runai_streamer_loader.py \
  tests/model_executor/model_loader/runai_streamer_loader/test_runai_model_streamer_loader.py

passed

git diff --check

passed

Attempted focused pytest:

python -m pytest -q \
  tests/model_executor/model_loader/runai_streamer_loader/test_runai_model_streamer_loader.py \
  -k "mistral_consolidated or without_mistral_config"

Blocked in this local source checkout because vLLM was not installed/built:
ModuleNotFoundError: No module named 'vllm._C'

Validated manually in a pip-installed vLLM 0.19.1 environment with the same patch:

Ministral-3 with load_format="runai_streamer" selected only
consolidated.safetensors, loaded 1145 tensors instead of 2290, and inference
completed successfully without config_format="mistral" or tokenizer_mode="mistral".

Changed files

tests/model_executor/model_loader/runai_streamer_loader/test_runai_model_streamer_loader.py (modified, +94/-0)
vllm/model_executor/model_loader/runai_streamer_loader.py (modified, +94/-11)

Code Example

consolidated.safetensors
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
params.json
config.json

---

KeyError: 'layers.23.self_attn.qkv_proj.weight_scale_inv'

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="./Ministral-3-14B-Instruct-2512",
    load_format="runai_streamer",
    max_model_len=8192,
)

outputs = llm.chat(
    [{"role": "user", "content": "Write a short explanation of what vLLM is."}],
    sampling_params=SamplingParams(temperature=0.1, max_tokens=256),
    use_tqdm=False,
)

print(outputs[0].outputs[0].text)

---

Loading safetensors using Runai Model Streamer:   0% Completed | 0/2290
...
KeyError: 'layers.23.self_attn.qkv_proj.weight_scale_inv'
RuntimeError: Engine core initialization failed.

---

mistralai/Mistral-7B-v0.3
  consolidated*.safetensors: 1
  model-*.safetensors: 3

mistralai/Mistral-7B-Instruct-v0.3
  consolidated*.safetensors: 1
  model-*.safetensors: 3

mistralai/Ministral-8B-Instruct-2410
  consolidated*.safetensors: 1
  model-*.safetensors: 4

mistralai/Ministral-3-14B-Instruct-2512
  consolidated*.safetensors: 1
  model-*.safetensors: 4

mistralai/Mistral-Small-3.1-24B-Instruct-2503
  consolidated*.safetensors: 1
  model-*.safetensors: 10

mistralai/Mistral-Small-3.2-24B-Instruct-2506
  consolidated*.safetensors: 1
  model-*.safetensors: 10

mistralai/Mistral-Large-Instruct-2411
  consolidated*.safetensors: 51
  model-*.safetensors: 51

---

safetensors_pattern = "*.safetensors"
...
hf_weights_files = list_safetensors(path=hf_folder)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

============================== System Info

OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : version 3.28.3 Libc version : glibc-2.39

============================== PyTorch Info

PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A XPU used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-6.8.0-49-generic-x86_64-with-glibc2.39

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA A40 Nvidia driver version : 570.195.03 cuDNN version : Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.8.0 HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.6 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.5.0.dev0 [pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.10.0 [pip3] torchvision==0.25.0 [pip3] transformers==5.6.2 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.19.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-23,48-71 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_runner

</details>

🐛 Describe the bug

load_format="runai_streamer" fails when loading mistralai/Ministral-3-14B-Instruct-2512 from object storage locations such as S3 and the same issue happens in local directory.

The model directory contains both Mistral-format consolidated weights and HF-style sharded weights:

consolidated.safetensors
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
params.json
config.json

The Run:ai streamer loader currently lists and streams every *.safetensors file. For this model, that mixes two different checkpoint layouts. The progress bar shows 2290 tensors, which is exactly double the expected 1145 tensors from the consolidated checkpoint.

This eventually fails during weight loading:

KeyError: 'layers.23.self_attn.qkv_proj.weight_scale_inv'

Minimal reproduction:

from vllm import LLM, SamplingParams

llm = LLM(
    model="./Ministral-3-14B-Instruct-2512",
    load_format="runai_streamer",
    max_model_len=8192,
)

outputs = llm.chat(
    [{"role": "user", "content": "Write a short explanation of what vLLM is."}],
    sampling_params=SamplingParams(temperature=0.1, max_tokens=256),
    use_tqdm=False,
)

print(outputs[0].outputs[0].text)

Observed behavior:

Loading safetensors using Runai Model Streamer:   0% Completed | 0/2290
...
KeyError: 'layers.23.self_attn.qkv_proj.weight_scale_inv'
RuntimeError: Engine core initialization failed.

Expected behavior:

runai_streamer should follow the same Mistral-format weight selection behavior as vLLM's normal loader: when a Mistral-format checkpoint is detected, stream only consolidated*.safetensors instead of all *.safetensors.

After filtering to consolidated.safetensors, the same sample loads 1145 tensors and inference succeeds without requiring users to pass config_format="mistral" or tokenizer_mode="mistral".

Scope

This is not specific to Ministral-3. It affects Mistral-format repositories or object-storage locations that contain both consolidated*.safetensors and HF-style model-*.safetensors shards.

Examples from Hugging Face repo file listings:

mistralai/Mistral-7B-v0.3
  consolidated*.safetensors: 1
  model-*.safetensors: 3

mistralai/Mistral-7B-Instruct-v0.3
  consolidated*.safetensors: 1
  model-*.safetensors: 3

mistralai/Ministral-8B-Instruct-2410
  consolidated*.safetensors: 1
  model-*.safetensors: 4

mistralai/Ministral-3-14B-Instruct-2512
  consolidated*.safetensors: 1
  model-*.safetensors: 4

mistralai/Mistral-Small-3.1-24B-Instruct-2503
  consolidated*.safetensors: 1
  model-*.safetensors: 10

mistralai/Mistral-Small-3.2-24B-Instruct-2506
  consolidated*.safetensors: 1
  model-*.safetensors: 10

mistralai/Mistral-Large-Instruct-2411
  consolidated*.safetensors: 51
  model-*.safetensors: 51

Repos with only consolidated Mistral-format safetensors, for example mistralai/Pixtral-12B-2409, do not hit the duplicate-layout issue because there are no HF-style shards to mix in.

Suspected cause

vllm/model_executor/model_loader/runai_streamer_loader.py always uses *.safetensors:

safetensors_pattern = "*.safetensors"
...
hf_weights_files = list_safetensors(path=hf_folder)

For Mistral-format repos that include both consolidated.safetensors and HF shards, this streams both checkpoint formats together.

DefaultModelLoader already has special handling for Mistral-format weights by selecting consolidated*.safetensors. RunaiModelStreamerLoader appears to need equivalent handling, including object storage paths.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The runai_streamer loader should be modified to stream only consolidated*.safetensors for Mistral-format checkpoints, instead of all *.safetensors files.

Guidance

Modify the vllm/model_executor/model_loader/runai_streamer_loader.py file to handle Mistral-format weights by selecting only consolidated*.safetensors files.
Update the safetensors_pattern variable to consolidated*.safetensors when a Mistral-format checkpoint is detected.
Ensure the RunaiModelStreamerLoader class has equivalent handling for object storage paths as the DefaultModelLoader class.
Verify that the modified loader correctly streams the consolidated weights and resolves the KeyError issue.

Example

# Modified runai_streamer_loader.py
if is_mistral_format(hf_folder):
    safetensors_pattern = "consolidated*.safetensors"
else:
    safetensors_pattern = "*.safetensors"

Notes

This fix assumes that the is_mistral_format function is already implemented to detect Mistral-format checkpoints. If not, an additional check needs to be added to determine the checkpoint format.

Recommendation

Apply the workaround by modifying the runai_streamer_loader.py file to handle Mistral-format weights correctly, as this will resolve the KeyError issue and ensure correct weight loading.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #embedding generation #cache error #pipeline error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: runai_streamer loads both Ministral consolidated and HF sharded safetensors [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #40774: [Core] Fix Run:ai streamer loading for Mistral-format safetensors

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

============================== System Info

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

🐛 Describe the bug

Scope

Suspected cause

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING