vllm - ✅(Solved) Fix [Bug]: Certain Ranks Take a Look Time to Load Weights [1 pull requests, 2 comments, 2 participants]

robertgshaw2-redhat · 2026-04-05T15:58:44Z

[vllm] PR 40068: fix: stagger checkpoint file reads across ranks to reduce I/O contention - Repository: vllm-project/vllm - Author: ianliuy - State: open | mer… # PR #40068: fix: stagger checkpoint file reads across ranks to reduce I/O contention - Repository: vllm-project/vllm - Author: ianliuy - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/40068 ## Description (problem / solution / changelog) ## What's broken? In multi-rank TP/EP setups, one rank can take **7x longer** to load weights than the others. A user reported Worker_TP1_EP1 taking **197s** while the other three workers completed in **~27-30s** on a B200 VM with 163 safetensors shards (#39030). Additionally, per-rank load times are invisible because the timing log uses `scope="local"` (only local rank 0 reports), so this class of issue requires monkeypatching to diagnose. ## Who is affected? Any multi-GPU deployment (TP≥2 or EP≥2) loading large sharded checkpoints, especially on: - Network filesystems (NFS, Lustre) - Cloud VMs with network-attached storage - Shared local storage under heavy concurrent read load Single-GPU or `--load-format fastsafetensors/instanttensor` users are not affected. ## When does it trigger? Non-deterministic — depends on I/O scheduling, page cache pressure, and storage backend characteristics. More likely to manifest with: - Large number of checkpoint shards (100+) - High TP/EP world size (4+) - Slower or contended storage ## Where is the bug? `vllm/model_executor/model_loader/weight_utils.py` — `safetensors_weights_iterator()`: ```python sorted_files = sorted(hf_weights_files, key=_natural_sort_key) # All ranks iterate files in the SAME order → I/O contention for st_file in tqdm(sorted_files, ...): with safe_open(st_file, framework="pt") as f: for name in f.keys(): param = f.get_tensor(name) # All ranks hit the same file simultaneously yield name, param ``` `vllm/model_executor/model_loader/default_loader.py` — timing log: ```python logger.info_once("Loading weights took %.2f seconds", ..., scope="local") # scope="local" → only local rank 0 reports, masking per-rank slowdowns ``` ## Why does it happen? When N ranks all read the same checkpoint files in the same sorted order, at any given moment all N processes are issuing I/O requests for the **same file**. This causes: 1. **Storage-level contention**: The storage backend (NFS server, cloud disk controller, even local NVMe page-fault handler) must serve N concurrent read streams for identical data 2. **Page cache thrashing**: Under memory pressure, OS page cache pages loaded by one rank may be evicted before another rank reads them 3. **Amplified variance**: A rank that starts even slightly late (e.g., 4 seconds, as observed) faces the worst contention — all other ranks have moved ahead and are competing for the same I/O bandwidth The prefetch mechanism (`_prefetch_all_checkpoints`) already distributes prefetch work across ranks, but the **main loading loop** does not benefit from this distribution since all ranks still read all files in the same order. ## How did we fix it? **1. Stagger file reading order per rank** (`weight_utils.py`) Each rank rotates the sorted file list by an offset proportional to its rank: ```python # Before: all ranks read files [0, 1, 2, ..., 162] # After: rank 0 reads [0, 1, ...], rank 1 reads [41, 42, ...], etc. if torch.distributed.is_initialized(): rank = torch.distributed.get_rank() world_size = torch.distributed.get_world_size() if world_size > 1 and len(sorted_files) > 1: offset = len(sorted_files) * rank // world_size sorted_files = sorted_files[offset:] + sorted_files[:offset] ``` This ensures that at any moment, different ranks are reading **different** files, distributing I/O load across the storage backend. Since tensors are matched by name (not file position), the reading order does not affect correctness. **2. Per-rank timing visibility** (`default_loader.py`) - Added "Starting load weights" log with `scope="process"` so every rank reports when it begins loading - Changed "Loading weights took" log from `scope="local"` to `scope="process"` so every rank reports its load duration This makes per-rank load-time variance immediately visible without monkeypatching. **Alternatives considered:** - *Rank-0-only loading + NCCL broadcast*: Most efficient for I/O but very invasive, touches the entire weight loading pipeline, high risk - *Only rank 0 calls `_prepare_weights()`*: Helps for non-local models (eliminates file-lock serialization) but doesn't address the main loading loop contention; left for a follow-up ## How do we know it works? - Existing `test_ep_weight_filter.py` tests pass (they use `safetensors_weights_iterator` without distributed init, so the stagger is correctly skipped) - The stagger is guarded by `torch.distributed.is_initialized()` — no behavior change in single-process mode - `ruff check` passes on both modified files - The fix is minimal (14 lines added, 1 line changed) and only affe

vllm2026-04-05 15:58:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39030•Fetched 2026-04-08 02:52:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

robertgshaw2-redhat

Participants

robertgshaw2-redhat

zeel2104

Timeline (top)

commented ×2labeled ×2

Fix Action

Fix / Workaround

I noticed that sometimes certain ranks take a very long time to load. monkeypatched log (so it logs the load time per rank)

PR fix notes

PR #40068: fix: stagger checkpoint file reads across ranks to reduce I/O contention

Repository: vllm-project/vllm
Author: ianliuy
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40068

Description (problem / solution / changelog)

What's broken?

In multi-rank TP/EP setups, one rank can take 7x longer to load weights than the others. A user reported Worker_TP1_EP1 taking 197s while the other three workers completed in ~27-30s on a B200 VM with 163 safetensors shards (#39030).

Additionally, per-rank load times are invisible because the timing log uses scope="local" (only local rank 0 reports), so this class of issue requires monkeypatching to diagnose.

Who is affected?

Any multi-GPU deployment (TP≥2 or EP≥2) loading large sharded checkpoints, especially on:

Network filesystems (NFS, Lustre)
Cloud VMs with network-attached storage
Shared local storage under heavy concurrent read load

Single-GPU or --load-format fastsafetensors/instanttensor users are not affected.

When does it trigger?

Non-deterministic — depends on I/O scheduling, page cache pressure, and storage backend characteristics. More likely to manifest with:

Large number of checkpoint shards (100+)
High TP/EP world size (4+)
Slower or contended storage

Where is the bug?

vllm/model_executor/model_loader/weight_utils.py — safetensors_weights_iterator():

sorted_files = sorted(hf_weights_files, key=_natural_sort_key)
# All ranks iterate files in the SAME order → I/O contention
for st_file in tqdm(sorted_files, ...):
    with safe_open(st_file, framework="pt") as f:
        for name in f.keys():
            param = f.get_tensor(name)  # All ranks hit the same file simultaneously
            yield name, param

vllm/model_executor/model_loader/default_loader.py — timing log:

logger.info_once("Loading weights took %.2f seconds", ..., scope="local")
# scope="local" → only local rank 0 reports, masking per-rank slowdowns

Why does it happen?

When N ranks all read the same checkpoint files in the same sorted order, at any given moment all N processes are issuing I/O requests for the same file. This causes:

Storage-level contention: The storage backend (NFS server, cloud disk controller, even local NVMe page-fault handler) must serve N concurrent read streams for identical data
Page cache thrashing: Under memory pressure, OS page cache pages loaded by one rank may be evicted before another rank reads them
Amplified variance: A rank that starts even slightly late (e.g., 4 seconds, as observed) faces the worst contention — all other ranks have moved ahead and are competing for the same I/O bandwidth

The prefetch mechanism (_prefetch_all_checkpoints) already distributes prefetch work across ranks, but the main loading loop does not benefit from this distribution since all ranks still read all files in the same order.

How did we fix it?

1. Stagger file reading order per rank (weight_utils.py)

Each rank rotates the sorted file list by an offset proportional to its rank:

# Before: all ranks read files [0, 1, 2, ..., 162]
# After:  rank 0 reads [0, 1, ...], rank 1 reads [41, 42, ...], etc.
if torch.distributed.is_initialized():
    rank = torch.distributed.get_rank()
    world_size = torch.distributed.get_world_size()
    if world_size > 1 and len(sorted_files) > 1:
        offset = len(sorted_files) * rank // world_size
        sorted_files = sorted_files[offset:] + sorted_files[:offset]

This ensures that at any moment, different ranks are reading different files, distributing I/O load across the storage backend. Since tensors are matched by name (not file position), the reading order does not affect correctness.

2. Per-rank timing visibility (default_loader.py)

Added "Starting load weights" log with scope="process" so every rank reports when it begins loading
Changed "Loading weights took" log from scope="local" to scope="process" so every rank reports its load duration

This makes per-rank load-time variance immediately visible without monkeypatching.

Alternatives considered:

Rank-0-only loading + NCCL broadcast: Most efficient for I/O but very invasive, touches the entire weight loading pipeline, high risk
Only rank 0 calls _prepare_weights(): Helps for non-local models (eliminates file-lock serialization) but doesn't address the main loading loop contention; left for a follow-up

How do we know it works?

Existing test_ep_weight_filter.py tests pass (they use safetensors_weights_iterator without distributed init, so the stagger is correctly skipped)
The stagger is guarded by torch.distributed.is_initialized() — no behavior change in single-process mode
ruff check passes on both modified files
The fix is minimal (14 lines added, 1 line changed) and only affects the file iteration order, not the loading logic itself
Note: Full multi-GPU validation requires a distributed setup; the fix is designed to be safe by construction (order-independent tensor matching)

Changed files

vllm/model_executor/model_loader/default_loader.py (modified, +2/-1)
vllm/model_executor/model_loader/weight_utils.py (modified, +13/-0)

Code Example

r_TP0_EP0 pid=3425910) INFO 04-05 11:38:59 [default_loader.py:369] Starting load weights
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:38:59 [default_loader.py:369] Starting load weights
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/163 [00:00<00:48,  3.31it/s]
Loading safetensors checkpoint shards:   1% Completed | 2/163 [00:00<00:41,  3.88it/s]
Loading safetensors checkpoint shards:   2% Completed | 4/163 [00:00<00:35,  4.51it/s]
Loading safetensors checkpoint shards:   3% Completed | 5/163 [00:01<00:29,  5.38it/s]
Loading safetensors checkpoint shards:   4% Completed | 7/163 [00:01<00:29,  5.25it/s]
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:01<00:26,  5.93it/s]
Loading safetensors checkpoint shards:   6% Completed | 9/163 [00:01<00:27,  5.69it/s]
Loading safetensors checkpoint shards:   6% Completed | 10/163 [00:01<00:28,  5.41it/s]
Loading safetensors checkpoint shards:   7% Completed | 12/163 [00:02<00:19,  7.73it/s]
Loading safetensors checkpoint shards:   8% Completed | 13/163 [00:02<00:25,  5.81it/s]
Loading safetensors checkpoint shards:   9% Completed | 14/163 [00:02<00:23,  6.43it/s]
Loading safetensors checkpoint shards:   9% Completed | 15/163 [00:02<00:24,  6.16it/s]
Loading safetensors checkpoint shards:  10% Completed | 16/163 [00:02<00:26,  5.59it/s]
Loading safetensors checkpoint shards:  11% Completed | 18/163 [00:03<00:26,  5.41it/s]
(Worker_TP1_EP1 pid=3425911) INFO 04-05 11:39:03 [default_loader.py:369] Starting load weights
Loading safetensors checkpoint shards:  12% Completed | 19/163 [00:03<00:24,  5.88it/s]
Loading safetensors checkpoint shards:  13% Completed | 21/163 [00:03<00:25,  5.55it/s]
Loading safetensors checkpoint shards:  13% Completed | 22/163 [00:03<00:23,  6.13it/s]
Loading safetensors checkpoint shards:  14% Completed | 23/163 [00:04<00:23,  6.04it/s]
Loading safetensors checkpoint shards:  15% Completed | 24/163 [00:04<00:25,  5.49it/s]
Loading safetensors checkpoint shards:  16% Completed | 26/163 [00:04<00:25,  5.42it/s]
Loading safetensors checkpoint shards:  17% Completed | 27/163 [00:04<00:23,  5.81it/s]
Loading safetensors checkpoint shards:  18% Completed | 29/163 [00:05<00:25,  5.34it/s]
Loading safetensors checkpoint shards:  18% Completed | 30/163 [00:05<00:22,  5.88it/s]
Loading safetensors checkpoint shards:  19% Completed | 31/163 [00:05<00:22,  5.86it/s]
Loading safetensors checkpoint shards:  20% Completed | 32/163 [00:05<00:24,  5.25it/s]
Loading safetensors checkpoint shards:  21% Completed | 34/163 [00:05<00:17,  7.30it/s]
Loading safetensors checkpoint shards:  21% Completed | 35/163 [00:06<00:22,  5.64it/s]
Loading safetensors checkpoint shards:  22% Completed | 36/163 [00:06<00:20,  6.27it/s]
Loading safetensors checkpoint shards:  23% Completed | 37/163 [00:06<00:20,  6.07it/s]
Loading safetensors checkpoint shards:  23% Completed | 38/163 [00:06<00:22,  5.61it/s]
Loading safetensors checkpoint shards:  25% Completed | 40/163 [00:07<00:22,  5.49it/s]
Loading safetensors checkpoint shards:  25% Completed | 41/163 [00:07<00:20,  6.03it/s]
Loading safetensors checkpoint shards:  26% Completed | 43/163 [00:07<00:21,  5.63it/s]
Loading safetensors checkpoint shards:  28% Completed | 45/163 [00:07<00:19,  6.20it/s]
Loading safetensors checkpoint shards:  28% Completed | 46/163 [00:08<00:20,  5.75it/s]
Loading safetensors checkpoint shards:  29% Completed | 48/163 [00:08<00:20,  5.64it/s]
Loading safetensors checkpoint shards:  30% Completed | 49/163 [00:08<00:18,  6.03it/s]
Loading safetensors checkpoint shards:  31% Completed | 51/163 [00:08<00:20,  5.59it/s]
Loading safetensors checkpoint shards:  33% Completed | 53/163 [00:09<00:17,  6.14it/s]
Loading safetensors checkpoint shards:  33% Completed | 54/163 [00:09<00:19,  5.62it/s]
Loading safetensors checkpoint shards:  34% Completed | 56/163 [00:09<00:14,  7.35it/s]
Loading safetensors checkpoint shards:  35% Completed | 57/163 [00:09<00:18,  5.86it/s]
Loading safetensors checkpoint shards:  36% Completed | 59/163 [00:10<00:16,  6.31it/s]
Loading safetensors checkpoint shards:  37% Completed | 60/163 [00:10<00:17,  5.78it/s]
Loading safetensors checkpoint shards:  38% Completed | 62/163 [00:10<00:18,  5.48it/s]
Loading safetensors checkpoint shards:  39% Completed | 63/163 [00:10<00:16,  5.93it/s]
Loading safetensors checkpoint shards:  40% Completed | 65/163 [00:11<00:17,  5.48it/s]
Loading safetensors checkpoint shards:  41% Completed | 67/163 [00:11<00:16,  5.99it/s]
Loading safetensors checkpoint shards:  42% Completed | 68/163 [00:11<00:17,  5.52it/s]
Loading safetensors checkpoint shards:  43% Completed | 70/163 [00:12<00:16,  5.51it/s]
Loading safetensors checkpoint shards:  44% Completed | 71/163 [00:12<00:15,  5.91it/s]
Loading safetensors checkpoint shards:  45% Completed | 73/163 [00:12<00:15,  5.63it/s]
Loading safetensors checkpoint shards:  46% Completed | 75/163 [00:12<00:14,  6.23it/s]
Loading safetensors checkpoint shards:  47% Completed | 76/163 [00:13<00:15,  5.73it/s]
Loading safetensors checkpoint shards:  48% Completed | 78/163 [00:13<00:11,  7.47it/s]
Loading safetensors checkpoint shards:  48% Completed | 79/163 [00:13<00:14,  5.71it/s]
Loading safetensors checkpoint shards:  50% Completed | 81/163 [00:13<00:13,  6.20it/s]
Loading safetensors checkpoint shards:  50% Completed | 82/163 [00:14<00:13,  5.81it/s]
Loading safetensors checkpoint shards:  52% Completed | 84/163 [00:14<00:14,  5.62it/s]
Loading safetensors checkpoint shards:  52% Completed | 85/163 [00:14<00:12,  6.09it/s]
Loading safetensors checkpoint shards:  53% Completed | 87/163 [00:14<00:13,  5.72it/s]
Loading safetensors checkpoint shards:  55% Completed | 89/163 [00:15<00:11,  6.25it/s]
Loading safetensors checkpoint shards:  55% Completed | 90/163 [00:15<00:12,  5.79it/s]
Loading safetensors checkpoint shards:  56% Completed | 92/163 [00:15<00:12,  5.67it/s]
Loading safetensors checkpoint shards:  57% Completed | 93/163 [00:15<00:11,  6.09it/s]
Loading safetensors checkpoint shards:  58% Completed | 95/163 [00:16<00:11,  5.74it/s]
Loading safetensors checkpoint shards:  60% Completed | 97/163 [00:16<00:10,  6.35it/s]
Loading safetensors checkpoint shards:  60% Completed | 98/163 [00:16<00:11,  5.83it/s]
Loading safetensors checkpoint shards:  61% Completed | 100/163 [00:16<00:08,  7.64it/s]
Loading safetensors checkpoint shards:  62% Completed | 101/163 [00:17<00:10,  6.01it/s]
Loading safetensors checkpoint shards:  63% Completed | 103/163 [00:17<00:09,  6.47it/s]
Loading safetensors checkpoint shards:  64% Completed | 104/163 [00:17<00:09,  6.01it/s]
Loading safetensors checkpoint shards:  65% Completed | 106/163 [00:18<00:09,  5.77it/s]
Loading safetensors checkpoint shards:  66% Completed | 107/163 [00:18<00:08,  6.26it/s]
Loading safetensors checkpoint shards:  67% Completed | 109/163 [00:18<00:09,  5.84it/s]
Loading safetensors checkpoint shards:  68% Completed | 111/163 [00:18<00:08,  6.37it/s]
Loading safetensors checkpoint shards:  69% Completed | 112/163 [00:19<00:08,  5.91it/s]
Loading safetensors checkpoint shards:  70% Completed | 114/163 [00:19<00:08,  5.77it/s]
Loading safetensors checkpoint shards:  71% Completed | 115/163 [00:19<00:07,  6.17it/s]
Loading safetensors checkpoint shards:  72% Completed | 117/163 [00:19<00:07,  5.76it/s]
Loading safetensors checkpoint shards:  73% Completed | 119/163 [00:20<00:06,  6.37it/s]
Loading safetensors checkpoint shards:  74% Completed | 120/163 [00:20<00:07,  5.85it/s]
Loading safetensors checkpoint shards:  75% Completed | 122/163 [00:20<00:05,  7.66it/s]
Loading safetensors checkpoint shards:  75% Completed | 123/163 [00:20<00:06,  5.99it/s]
Loading safetensors checkpoint shards:  77% Completed | 125/163 [00:21<00:05,  6.43it/s]
Loading safetensors checkpoint shards:  77% Completed | 126/163 [00:21<00:06,  6.00it/s]
Loading safetensors checkpoint shards:  79% Completed | 128/163 [00:21<00:06,  5.76it/s]
Loading safetensors checkpoint shards:  79% Completed | 129/163 [00:21<00:05,  6.26it/s]
Loading safetensors checkpoint shards:  80% Completed | 131/163 [00:22<00:05,  5.79it/s]
Loading safetensors checkpoint shards:  82% Completed | 133/163 [00:22<00:04,  6.26it/s]
Loading safetensors checkpoint shards:  82% Completed | 134/163 [00:22<00:05,  5.72it/s]
Loading safetensors checkpoint shards:  83% Completed | 136/163 [00:23<00:04,  5.59it/s]
Loading safetensors checkpoint shards:  84% Completed | 137/163 [00:23<00:04,  6.01it/s]
Loading safetensors checkpoint shards:  85% Completed | 139/163 [00:23<00:04,  5.70it/s]
Loading safetensors checkpoint shards:  87% Completed | 141/163 [00:23<00:03,  7.12it/s]
Loading safetensors checkpoint shards:  87% Completed | 142/163 [00:24<00:03,  5.79it/s]
Loading safetensors checkpoint shards:  88% Completed | 144/163 [00:24<00:03,  6.29it/s]
Loading safetensors checkpoint shards:  89% Completed | 145/163 [00:24<00:03,  5.92it/s]
Loading safetensors checkpoint shards:  90% Completed | 147/163 [00:24<00:02,  5.72it/s]
Loading safetensors checkpoint shards:  91% Completed | 148/163 [00:24<00:02,  6.22it/s]
Loading safetensors checkpoint shards:  92% Completed | 150/163 [00:25<00:02,  5.81it/s]
Loading safetensors checkpoint shards:  93% Completed | 152/163 [00:25<00:01,  6.37it/s]
Loading safetensors checkpoint shards:  94% Completed | 153/163 [00:25<00:01,  5.91it/s]
Loading safetensors checkpoint shards:  95% Completed | 155/163 [00:26<00:01,  5.67it/s]
Loading safetensors checkpoint shards:  96% Completed | 156/163 [00:26<00:01,  6.09it/s]
Loading safetensors checkpoint shards:  97% Completed | 158/163 [00:26<00:00,  5.74it/s]
(Worker_TP2_EP2 pid=3425912) INFO 04-05 11:39:27 [default_loader.py:385] Loading weights took 27.12 seconds
Loading safetensors checkpoint shards:  98% Completed | 160/163 [00:26<00:00,  6.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:26<00:00,  6.06it/s]
(Worker_TP0_EP0 pid=3425910) 
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:27 [default_loader.py:385] Loading weights took 26.92 seconds
(Worker_TP0_EP0 pid=3425910) WARNING 04-05 11:39:27 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0_EP0 pid=3425910) WARNING 04-05 11:39:27 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:27 [nvfp4.py:404] Using MoEPrepareAndFinalizeNoDPEPMonolithic
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:39:30 [default_loader.py:385] Loading weights took 30.42 seconds
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:30 [gpu_model_runner.py:4820] Model loading took 91.67 GiB memory and 31.993121 seconds
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:30 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP2_EP2 pid=3425912) INFO 04-05 11:39:31 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:39:34 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP1_EP1 pid=3425911) INFO 04-05 11:42:21 [default_loader.py:385] Loading weights took 197.43 seconds

RAW_BUFFERClick to expand / collapse

Your current environment

B200 VM

🐛 Describe the bug

I noticed that sometimes certain ranks take a very long time to load. monkeypatched log (so it logs the load time per rank)

r_TP0_EP0 pid=3425910) INFO 04-05 11:38:59 [default_loader.py:369] Starting load weights
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:38:59 [default_loader.py:369] Starting load weights
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/163 [00:00<00:48,  3.31it/s]
Loading safetensors checkpoint shards:   1% Completed | 2/163 [00:00<00:41,  3.88it/s]
Loading safetensors checkpoint shards:   2% Completed | 4/163 [00:00<00:35,  4.51it/s]
Loading safetensors checkpoint shards:   3% Completed | 5/163 [00:01<00:29,  5.38it/s]
Loading safetensors checkpoint shards:   4% Completed | 7/163 [00:01<00:29,  5.25it/s]
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:01<00:26,  5.93it/s]
Loading safetensors checkpoint shards:   6% Completed | 9/163 [00:01<00:27,  5.69it/s]
Loading safetensors checkpoint shards:   6% Completed | 10/163 [00:01<00:28,  5.41it/s]
Loading safetensors checkpoint shards:   7% Completed | 12/163 [00:02<00:19,  7.73it/s]
Loading safetensors checkpoint shards:   8% Completed | 13/163 [00:02<00:25,  5.81it/s]
Loading safetensors checkpoint shards:   9% Completed | 14/163 [00:02<00:23,  6.43it/s]
Loading safetensors checkpoint shards:   9% Completed | 15/163 [00:02<00:24,  6.16it/s]
Loading safetensors checkpoint shards:  10% Completed | 16/163 [00:02<00:26,  5.59it/s]
Loading safetensors checkpoint shards:  11% Completed | 18/163 [00:03<00:26,  5.41it/s]
(Worker_TP1_EP1 pid=3425911) INFO 04-05 11:39:03 [default_loader.py:369] Starting load weights
Loading safetensors checkpoint shards:  12% Completed | 19/163 [00:03<00:24,  5.88it/s]
Loading safetensors checkpoint shards:  13% Completed | 21/163 [00:03<00:25,  5.55it/s]
Loading safetensors checkpoint shards:  13% Completed | 22/163 [00:03<00:23,  6.13it/s]
Loading safetensors checkpoint shards:  14% Completed | 23/163 [00:04<00:23,  6.04it/s]
Loading safetensors checkpoint shards:  15% Completed | 24/163 [00:04<00:25,  5.49it/s]
Loading safetensors checkpoint shards:  16% Completed | 26/163 [00:04<00:25,  5.42it/s]
Loading safetensors checkpoint shards:  17% Completed | 27/163 [00:04<00:23,  5.81it/s]
Loading safetensors checkpoint shards:  18% Completed | 29/163 [00:05<00:25,  5.34it/s]
Loading safetensors checkpoint shards:  18% Completed | 30/163 [00:05<00:22,  5.88it/s]
Loading safetensors checkpoint shards:  19% Completed | 31/163 [00:05<00:22,  5.86it/s]
Loading safetensors checkpoint shards:  20% Completed | 32/163 [00:05<00:24,  5.25it/s]
Loading safetensors checkpoint shards:  21% Completed | 34/163 [00:05<00:17,  7.30it/s]
Loading safetensors checkpoint shards:  21% Completed | 35/163 [00:06<00:22,  5.64it/s]
Loading safetensors checkpoint shards:  22% Completed | 36/163 [00:06<00:20,  6.27it/s]
Loading safetensors checkpoint shards:  23% Completed | 37/163 [00:06<00:20,  6.07it/s]
Loading safetensors checkpoint shards:  23% Completed | 38/163 [00:06<00:22,  5.61it/s]
Loading safetensors checkpoint shards:  25% Completed | 40/163 [00:07<00:22,  5.49it/s]
Loading safetensors checkpoint shards:  25% Completed | 41/163 [00:07<00:20,  6.03it/s]
Loading safetensors checkpoint shards:  26% Completed | 43/163 [00:07<00:21,  5.63it/s]
Loading safetensors checkpoint shards:  28% Completed | 45/163 [00:07<00:19,  6.20it/s]
Loading safetensors checkpoint shards:  28% Completed | 46/163 [00:08<00:20,  5.75it/s]
Loading safetensors checkpoint shards:  29% Completed | 48/163 [00:08<00:20,  5.64it/s]
Loading safetensors checkpoint shards:  30% Completed | 49/163 [00:08<00:18,  6.03it/s]
Loading safetensors checkpoint shards:  31% Completed | 51/163 [00:08<00:20,  5.59it/s]
Loading safetensors checkpoint shards:  33% Completed | 53/163 [00:09<00:17,  6.14it/s]
Loading safetensors checkpoint shards:  33% Completed | 54/163 [00:09<00:19,  5.62it/s]
Loading safetensors checkpoint shards:  34% Completed | 56/163 [00:09<00:14,  7.35it/s]
Loading safetensors checkpoint shards:  35% Completed | 57/163 [00:09<00:18,  5.86it/s]
Loading safetensors checkpoint shards:  36% Completed | 59/163 [00:10<00:16,  6.31it/s]
Loading safetensors checkpoint shards:  37% Completed | 60/163 [00:10<00:17,  5.78it/s]
Loading safetensors checkpoint shards:  38% Completed | 62/163 [00:10<00:18,  5.48it/s]
Loading safetensors checkpoint shards:  39% Completed | 63/163 [00:10<00:16,  5.93it/s]
Loading safetensors checkpoint shards:  40% Completed | 65/163 [00:11<00:17,  5.48it/s]
Loading safetensors checkpoint shards:  41% Completed | 67/163 [00:11<00:16,  5.99it/s]
Loading safetensors checkpoint shards:  42% Completed | 68/163 [00:11<00:17,  5.52it/s]
Loading safetensors checkpoint shards:  43% Completed | 70/163 [00:12<00:16,  5.51it/s]
Loading safetensors checkpoint shards:  44% Completed | 71/163 [00:12<00:15,  5.91it/s]
Loading safetensors checkpoint shards:  45% Completed | 73/163 [00:12<00:15,  5.63it/s]
Loading safetensors checkpoint shards:  46% Completed | 75/163 [00:12<00:14,  6.23it/s]
Loading safetensors checkpoint shards:  47% Completed | 76/163 [00:13<00:15,  5.73it/s]
Loading safetensors checkpoint shards:  48% Completed | 78/163 [00:13<00:11,  7.47it/s]
Loading safetensors checkpoint shards:  48% Completed | 79/163 [00:13<00:14,  5.71it/s]
Loading safetensors checkpoint shards:  50% Completed | 81/163 [00:13<00:13,  6.20it/s]
Loading safetensors checkpoint shards:  50% Completed | 82/163 [00:14<00:13,  5.81it/s]
Loading safetensors checkpoint shards:  52% Completed | 84/163 [00:14<00:14,  5.62it/s]
Loading safetensors checkpoint shards:  52% Completed | 85/163 [00:14<00:12,  6.09it/s]
Loading safetensors checkpoint shards:  53% Completed | 87/163 [00:14<00:13,  5.72it/s]
Loading safetensors checkpoint shards:  55% Completed | 89/163 [00:15<00:11,  6.25it/s]
Loading safetensors checkpoint shards:  55% Completed | 90/163 [00:15<00:12,  5.79it/s]
Loading safetensors checkpoint shards:  56% Completed | 92/163 [00:15<00:12,  5.67it/s]
Loading safetensors checkpoint shards:  57% Completed | 93/163 [00:15<00:11,  6.09it/s]
Loading safetensors checkpoint shards:  58% Completed | 95/163 [00:16<00:11,  5.74it/s]
Loading safetensors checkpoint shards:  60% Completed | 97/163 [00:16<00:10,  6.35it/s]
Loading safetensors checkpoint shards:  60% Completed | 98/163 [00:16<00:11,  5.83it/s]
Loading safetensors checkpoint shards:  61% Completed | 100/163 [00:16<00:08,  7.64it/s]
Loading safetensors checkpoint shards:  62% Completed | 101/163 [00:17<00:10,  6.01it/s]
Loading safetensors checkpoint shards:  63% Completed | 103/163 [00:17<00:09,  6.47it/s]
Loading safetensors checkpoint shards:  64% Completed | 104/163 [00:17<00:09,  6.01it/s]
Loading safetensors checkpoint shards:  65% Completed | 106/163 [00:18<00:09,  5.77it/s]
Loading safetensors checkpoint shards:  66% Completed | 107/163 [00:18<00:08,  6.26it/s]
Loading safetensors checkpoint shards:  67% Completed | 109/163 [00:18<00:09,  5.84it/s]
Loading safetensors checkpoint shards:  68% Completed | 111/163 [00:18<00:08,  6.37it/s]
Loading safetensors checkpoint shards:  69% Completed | 112/163 [00:19<00:08,  5.91it/s]
Loading safetensors checkpoint shards:  70% Completed | 114/163 [00:19<00:08,  5.77it/s]
Loading safetensors checkpoint shards:  71% Completed | 115/163 [00:19<00:07,  6.17it/s]
Loading safetensors checkpoint shards:  72% Completed | 117/163 [00:19<00:07,  5.76it/s]
Loading safetensors checkpoint shards:  73% Completed | 119/163 [00:20<00:06,  6.37it/s]
Loading safetensors checkpoint shards:  74% Completed | 120/163 [00:20<00:07,  5.85it/s]
Loading safetensors checkpoint shards:  75% Completed | 122/163 [00:20<00:05,  7.66it/s]
Loading safetensors checkpoint shards:  75% Completed | 123/163 [00:20<00:06,  5.99it/s]
Loading safetensors checkpoint shards:  77% Completed | 125/163 [00:21<00:05,  6.43it/s]
Loading safetensors checkpoint shards:  77% Completed | 126/163 [00:21<00:06,  6.00it/s]
Loading safetensors checkpoint shards:  79% Completed | 128/163 [00:21<00:06,  5.76it/s]
Loading safetensors checkpoint shards:  79% Completed | 129/163 [00:21<00:05,  6.26it/s]
Loading safetensors checkpoint shards:  80% Completed | 131/163 [00:22<00:05,  5.79it/s]
Loading safetensors checkpoint shards:  82% Completed | 133/163 [00:22<00:04,  6.26it/s]
Loading safetensors checkpoint shards:  82% Completed | 134/163 [00:22<00:05,  5.72it/s]
Loading safetensors checkpoint shards:  83% Completed | 136/163 [00:23<00:04,  5.59it/s]
Loading safetensors checkpoint shards:  84% Completed | 137/163 [00:23<00:04,  6.01it/s]
Loading safetensors checkpoint shards:  85% Completed | 139/163 [00:23<00:04,  5.70it/s]
Loading safetensors checkpoint shards:  87% Completed | 141/163 [00:23<00:03,  7.12it/s]
Loading safetensors checkpoint shards:  87% Completed | 142/163 [00:24<00:03,  5.79it/s]
Loading safetensors checkpoint shards:  88% Completed | 144/163 [00:24<00:03,  6.29it/s]
Loading safetensors checkpoint shards:  89% Completed | 145/163 [00:24<00:03,  5.92it/s]
Loading safetensors checkpoint shards:  90% Completed | 147/163 [00:24<00:02,  5.72it/s]
Loading safetensors checkpoint shards:  91% Completed | 148/163 [00:24<00:02,  6.22it/s]
Loading safetensors checkpoint shards:  92% Completed | 150/163 [00:25<00:02,  5.81it/s]
Loading safetensors checkpoint shards:  93% Completed | 152/163 [00:25<00:01,  6.37it/s]
Loading safetensors checkpoint shards:  94% Completed | 153/163 [00:25<00:01,  5.91it/s]
Loading safetensors checkpoint shards:  95% Completed | 155/163 [00:26<00:01,  5.67it/s]
Loading safetensors checkpoint shards:  96% Completed | 156/163 [00:26<00:01,  6.09it/s]
Loading safetensors checkpoint shards:  97% Completed | 158/163 [00:26<00:00,  5.74it/s]
(Worker_TP2_EP2 pid=3425912) INFO 04-05 11:39:27 [default_loader.py:385] Loading weights took 27.12 seconds
Loading safetensors checkpoint shards:  98% Completed | 160/163 [00:26<00:00,  6.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:26<00:00,  6.06it/s]
(Worker_TP0_EP0 pid=3425910) 
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:27 [default_loader.py:385] Loading weights took 26.92 seconds
(Worker_TP0_EP0 pid=3425910) WARNING 04-05 11:39:27 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0_EP0 pid=3425910) WARNING 04-05 11:39:27 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:27 [nvfp4.py:404] Using MoEPrepareAndFinalizeNoDPEPMonolithic
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:39:30 [default_loader.py:385] Loading weights took 30.42 seconds
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:30 [gpu_model_runner.py:4820] Model loading took 91.67 GiB memory and 31.993121 seconds
(Worker_TP0_EP0 pid=3425910) INFO 04-05 11:39:30 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP2_EP2 pid=3425912) INFO 04-05 11:39:31 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP3_EP3 pid=3425913) INFO 04-05 11:39:34 [interface.py:484] Setting kv cache block size to 32 for FLASHINFER_MLA backend.
(Worker_TP1_EP1 pid=3425911) INFO 04-05 11:42:21 [default_loader.py:385] Loading weights took 197.43 seconds

I dont know why this is is happening?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to inconsistent loading times for certain ranks, which may be caused by variations in system resources or checkpoint loading efficiency.

Guidance

Investigate the system resources (e.g., CPU, memory, disk I/O) utilization during the loading process to identify potential bottlenecks.
Compare the loading times of different ranks to determine if there's a pattern or correlation with specific ranks or workers.
Check the checkpoint files for any inconsistencies or issues that might be causing the slow loading times.
Consider optimizing the checkpoint loading process or adjusting the system configuration to improve performance.

Example

No specific code example is provided, as the issue seems to be related to system performance and configuration rather than a specific code snippet.

Notes

The issue may be related to the specific hardware or environment configuration, and further investigation is needed to determine the root cause. The warnings about checkpoint scaling factors and KV cache block size may also be relevant to the issue.

Recommendation

Apply a workaround by optimizing the system configuration and checkpoint loading process to improve performance, as the root cause of the issue is unclear and may require further investigation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #model loading #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Certain Ranks Take a Look Time to Load Weights [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #40068: fix: stagger checkpoint file reads across ranks to reduce I/O contention

Description (problem / solution / changelog)

What's broken?

Who is affected?

When does it trigger?

Where is the bug?

Why does it happen?

How did we fix it?

How do we know it works?

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Certain Ranks Take a Look Time to Load Weights [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #40068: fix: stagger checkpoint file reads across ranks to reduce I/O contention

Description (problem / solution / changelog)

What's broken?

Who is affected?

When does it trigger?

Where is the bug?

Why does it happen?

How did we fix it?

How do we know it works?

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING