vllm - ✅(Solved) Fix [Bug]: Uninitialized `PerTensorScaleParameter` slots corrupt fused-on-disk quantized models (NVFP4 / compressed-tensors) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39764Fetched 2026-04-16 06:36:50
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1

When a quantized checkpoint stores already-fused weights (e.g. a single qkv_proj instead of separate q_proj/k_proj/v_proj), the per-tensor scale parameters are loaded incorrectly. Only slot 0 of the scale tensor receives the checkpoint value; the remaining slots retain uninitialized (indeterminate) values from torch.empty. process_weights_after_loading then calls .max() over all slots, so an indeterminate value that happens to be larger than the true scale silently becomes the effective scale, leading to incorrect dequantization.

Error Message

#!/usr/bin/env python3 import json, subprocess, sys, textwrap

MODEL = "Phi-3-mini-4k-instruct-NVFP4"

WORKER = textwrap.dedent(f"""
import os, json, sys os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" import torch from vllm import LLM

max_len = int(sys.argv[1]) llm = LLM("{MODEL}", trust_remote_code=True, dtype="auto", max_model_len=max_len, enforce_eager=True, gpu_memory_utilization=0.9, hf_overrides={{"max_position_embeddings": max(max_len, 4096)}})

def extract(model): out = {{}} for i, layer in enumerate(model.model.layers): for tag, mod in [("qkv", layer.self_attn.qkv_proj), ("gup", layer.mlp.gate_up_proj)]: for attr in ("input_global_scale", "weight_global_scale"): if hasattr(mod, attr): key = f"L{{i}}.{{tag}}.{{attr.split('_')[0]}}" out[key] = getattr(mod, attr).item() return out

print(json.dumps(llm.llm_engine.apply_model(extract)[0])) """)

def run(max_len): r = subprocess.run( [sys.executable, "-c", WORKER, str(max_len)], capture_output=True, text=True) for line in r.stdout.strip().splitlines(): try: return json.loads(line) except json.JSONDecodeError: continue print(r.stderr[-2000:], file=sys.stderr) raise RuntimeError(f"Failed for max_len={max_len}")

print("Running with max_model_len=2048 ...") scales_2k = run(2048) print("Running with max_model_len=65536 ...") scales_64k = run(65536)

print("\n{:<30s} {:>14s} {:>14s} {:>8s}".format( "Parameter", "len=2048", "len=65536", "Match?")) print("-" * 72)

mismatches = 0 for key in sorted(scales_2k): v1, v2 = scales_2k[key], scales_64k[key] match = "OK" if v1 == v2 else "DIFFER" if v1 != v2: mismatches += 1 print(f" {key:<28s} {v1:>14.8f} {v2:>14.8f} {match:>8s}")

print(f"\n==> {mismatches} / {len(scales_2k)} scales differ between runs.")

Root Cause

In MergedColumnParallelLinear.weight_loader_v2 and QKVParallelLinear.weight_loader_v2, when loaded_shard_id is None (fused-on-disk checkpoint), the code writes the loaded scalar scale into shard 0 only, but the parameter retains its full shape [N]. The other N−1 slots are never written.

  1. For example, CompressedTensorsW4A4Fp4 allocate PerTensorScaleParameter with torch.empty(N), where N = number of output partitions (3 for QKV, 2 for gate_up). https://github.com/vllm-project/vllm/blob/2a3c32ce674950f94fdd447979e4621267125e41/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py#L61-L66

  2. When loaded_shard_id is None (fused checkpoint), only shard 0 is written; the other N−1 slots keep their indeterminate values:

  3. process_weights_after_loading calls .max() over all N slots, so an indeterminate value larger than the true scale silently becomes the effective scale. https://github.com/vllm-project/vllm/blob/2a3c32ce674950f94fdd447979e4621267125e41/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py#L105-L106

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) w5-3535X CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 1 Stepping: 8 CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 5808.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 960 KiB (20 instances) L1i cache: 640 KiB (20 instances) L2 cache: 40 MiB (20 instances) L3 cache: 52.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-39 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

from compressed_tensors.offload import dispatch_model from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer

PR fix notes

PR #39765: [Bugfix] Properly initialize PerTensorScaleParameter for fused-on-disk checkpoints

Description (problem / solution / changelog)

Purpose

resolve https://github.com/vllm-project/vllm/issues/39764

When a quantized checkpoint stores already-fused weights (e.g., a single qkv_proj or gate_up_proj), weight_loader_v2 writes the loaded scale into shard 0 only, leaving the remaining slots at their torch.empty indeterminate values. process_weights_after_loading then calls .max() over all slots, so a stale value larger than the true scale silently becomes the effective scale, leading to incorrect dequantization.

This PR fixes MergedColumnParallelLinear.weight_loader_v2 and QKVParallelLinear.weight_loader_v2 so that unwritten slots of PerTensorScaleParameter.data are discarded. By narrowing the tensor to shape [1] before writing to shard 0, .max() always returns the correct value.

Test Plan

Run the reproducible script in https://github.com/vllm-project/vllm/issues/39764

Test Result

<details> <summary>The output of <code>python compare.py</code></summary>

% python compare.py Running with max_model_len=2048 ... Running with max_model_len=65536 ...

Parameter len=2048 len=65536 Match?

L0.gup.input 0.00036337 0.00036337 OK L0.gup.weight 0.00028409 0.00028409 OK L0.qkv.input 0.00119617 0.00119617 OK L0.qkv.weight 0.00082237 0.00082237 OK L1.gup.input 0.00051440 0.00051440 OK L1.gup.weight 0.00042230 0.00042230 OK L1.qkv.input 0.00170068 0.00170068 OK L1.qkv.weight 0.00051440 0.00051440 OK L10.gup.input 0.00137363 0.00137363 OK L10.gup.weight 0.00026596 0.00026596 OK L10.qkv.input 0.00425532 0.00425532 OK L10.qkv.weight 0.00040064 0.00040064 OK L11.gup.input 0.00138122 0.00138122 OK L11.gup.weight 0.00024802 0.00024802 OK L11.qkv.input 0.00450450 0.00450450 OK L11.qkv.weight 0.00036337 0.00036337 OK L12.gup.input 0.00147929 0.00147929 OK L12.gup.weight 0.00031726 0.00031726 OK L12.qkv.input 0.00434783 0.00434783 OK L12.qkv.weight 0.00036337 0.00036337 OK L13.gup.input 0.00221239 0.00221239 OK L13.gup.weight 0.00051230 0.00051230 OK L13.qkv.input 0.00507614 0.00507614 OK L13.qkv.weight 0.00030941 0.00030941 OK L14.gup.input 0.00236967 0.00236967 OK L14.gup.weight 0.00040064 0.00040064 OK L14.qkv.input 0.00507614 0.00507614 OK L14.qkv.weight 0.00030488 0.00030488 OK L15.gup.input 0.00238095 0.00238095 OK L15.gup.weight 0.00029621 0.00029621 OK L15.qkv.input 0.00515464 0.00515464 OK L15.qkv.weight 0.00028802 0.00028802 OK L16.gup.input 0.00250000 0.00250000 OK L16.gup.weight 0.00030788 0.00030788 OK L16.qkv.input 0.00492611 0.00492611 OK L16.qkv.weight 0.00038820 0.00038820 OK L17.gup.input 0.00268817 0.00268817 OK L17.gup.weight 0.00058140 0.00058140 OK L17.qkv.input 0.00487805 0.00487805 OK L17.qkv.weight 0.00033784 0.00033784 OK L18.gup.input 0.00282486 0.00282486 OK L18.gup.weight 0.00033784 0.00033784 OK L18.qkv.input 0.00497512 0.00497512 OK L18.qkv.weight 0.00031095 0.00031095 OK L19.gup.input 0.00285714 0.00285714 OK L19.gup.weight 0.00038820 0.00038820 OK L19.qkv.input 0.00497512 0.00497512 OK L19.qkv.weight 0.00029621 0.00029621 OK L2.gup.input 0.00303030 0.00303030 OK L2.gup.weight 0.00056054 0.00056054 OK L2.qkv.input 0.00155280 0.00155280 OK L2.qkv.weight 0.00047348 0.00047348 OK L20.gup.input 0.00268817 0.00268817 OK L20.gup.weight 0.00020833 0.00020833 OK L20.qkv.input 0.00520833 0.00520833 OK L20.qkv.weight 0.00031250 0.00031250 OK L21.gup.input 0.00264550 0.00264550 OK L21.gup.weight 0.00024225 0.00024225 OK L21.qkv.input 0.00518135 0.00518135 OK L21.qkv.weight 0.00030788 0.00030788 OK L22.gup.input 0.00252525 0.00252525 OK L22.gup.weight 0.00023148 0.00023148 OK L22.qkv.input 0.00507614 0.00507614 OK L22.qkv.weight 0.00034722 0.00034722 OK L23.gup.input 0.00246305 0.00246305 OK L23.gup.weight 0.00022810 0.00022810 OK L23.qkv.input 0.00537634 0.00537634 OK L23.qkv.weight 0.00035311 0.00035311 OK L24.gup.input 0.00238095 0.00238095 OK L24.gup.weight 0.00019654 0.00019654 OK L24.qkv.input 0.00564972 0.00564972 OK L24.qkv.weight 0.00035714 0.00035714 OK L25.gup.input 0.00232558 0.00232558 OK L25.gup.weight 0.00023674 0.00023674 OK L25.qkv.input 0.00564972 0.00564972 OK L25.qkv.weight 0.00038820 0.00038820 OK L26.gup.input 0.00186567 0.00186567 OK L26.gup.weight 0.00034341 0.00034341 OK L26.qkv.input 0.00591716 0.00591716 OK L26.qkv.weight 0.00033602 0.00033602 OK L27.gup.input 0.00243902 0.00243902 OK L27.gup.weight 0.00044643 0.00044643 OK L27.qkv.input 0.00540541 0.00540541 OK L27.qkv.weight 0.00033069 0.00033069 OK L28.gup.input 0.00314465 0.00314465 OK L28.gup.weight 0.00029904 0.00029904 OK L28.qkv.input 0.00578035 0.00578035 OK L28.qkv.weight 0.00029904 0.00029904 OK L29.gup.input 0.00970874 0.00970874 OK L29.gup.weight 0.00070621 0.00070621 OK L29.qkv.input 0.00549451 0.00549451 OK L29.qkv.weight 0.00031726 0.00031726 OK L3.gup.input 0.00125000 0.00125000 OK L3.gup.weight 0.00052966 0.00052966 OK L3.qkv.input 0.00299401 0.00299401 OK L3.qkv.weight 0.00083333 0.00083333 OK L30.gup.input 0.00564972 0.00564972 OK L30.gup.weight 0.00036337 0.00036337 OK L30.qkv.input 0.00465116 0.00465116 OK L30.qkv.weight 0.00029343 0.00029343 OK L31.gup.input 0.00621118 0.00621118 OK L31.gup.weight 0.00046992 0.00046992 OK L31.qkv.input 0.00381679 0.00381679 OK L31.qkv.weight 0.00047710 0.00047710 OK L4.gup.input 0.00724638 0.00724638 OK L4.gup.weight 0.00053191 0.00053191 OK L4.qkv.input 0.00398406 0.00398406 OK L4.qkv.weight 0.00091241 0.00091241 OK L5.gup.input 0.00099206 0.00099206 OK L5.gup.weight 0.00024802 0.00024802 OK L5.qkv.input 0.00427350 0.00427350 OK L5.qkv.weight 0.00077640 0.00077640 OK L6.gup.input 0.00209205 0.00209205 OK L6.gup.weight 0.00028802 0.00028802 OK L6.qkv.input 0.00416667 0.00416667 OK L6.qkv.weight 0.00072254 0.00072254 OK L7.gup.input 0.00790514 0.00790514 OK L7.gup.weight 0.00049801 0.00049801 OK L7.qkv.input 0.00390625 0.00390625 OK L7.qkv.weight 0.00062500 0.00062500 OK L8.gup.input 0.00146199 0.00146199 OK L8.gup.weight 0.00037879 0.00037879 OK L8.qkv.input 0.00408163 0.00408163 OK L8.qkv.weight 0.00066489 0.00066489 OK L9.gup.input 0.00140449 0.00140449 OK L9.gup.weight 0.00041118 0.00041118 OK L9.qkv.input 0.00398406 0.00398406 OK L9.qkv.weight 0.00053191 0.00053191 OK

==> 0 / 128 scales differ between runs.

</details>
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/model_executor/layers/linear.py (modified, +18/-6)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 10.5.0-1ubuntu1~22.04.3) 10.5.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Apr  7 2026, 20:45:25) [Clang 22.1.1 ] (64-bit runtime)
Python platform              : Linux-6.14.0-37-generic-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.48
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  40
On-line CPU(s) list:                     0-39
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) w5-3535X
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               1
Stepping:                                8
CPU max MHz:                             4800.0000
CPU min MHz:                             800.0000
BogoMIPS:                                5808.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               960 KiB (20 instances)
L1i cache:                               640 KiB (20 instances)
L2 cache:                                40 MiB (20 instances)
L3 cache:                                52.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-39
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.29.7
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] open-clip-torch==2.32.0
[pip3] pytorch-lightning==2.5.2
[pip3] pyzmq==27.1.0
[pip3] segmentation-models-pytorch==0.5.0
[pip3] sentence-transformers==5.2.0
[pip3] terratorch==1.2.2
[pip3] torch==2.11.0+cu130
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchgeo==0.7.0
[pip3] torchmetrics==1.7.4
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==4.57.5
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.6.0
[pip3] tritonclient==2.64.0
[pip3] vector-quantize-pytorch==1.21.2
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev230+g0fce51cb7 (git sha: 0fce51cb7)
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-39    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NCCL_VERSION=2.27.7-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=13.0.0
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

#!/usr/bin/env python3
"""Quantize microsoft/Phi-3-mini-4k-instruct to NVFP4 using llm-compressor."""

from compressed_tensors.offload import dispatch_model
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
SAVE_DIR = "Phi-3-mini-4k-instruct-NVFP4"

# Load model
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto",
                                              trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Calibration dataset
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 1024

ds = load_dataset("HuggingFaceH4/ultrachat_200k",
                   split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False,
        )
    }


ds = ds.map(preprocess)


def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# NVFP4 quantization
recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4", ignore=["lm_head"]
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
print(f"Saved to {SAVE_DIR}")

---

#!/usr/bin/env python3
import json, subprocess, sys, textwrap

MODEL = "Phi-3-mini-4k-instruct-NVFP4"

WORKER = textwrap.dedent(f"""\
import os, json, sys
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
import torch
from vllm import LLM

max_len = int(sys.argv[1])
llm = LLM("{MODEL}", trust_remote_code=True, dtype="auto",
           max_model_len=max_len, enforce_eager=True,
           gpu_memory_utilization=0.9,
           hf_overrides={{"max_position_embeddings": max(max_len, 4096)}})

def extract(model):
    out = {{}}
    for i, layer in enumerate(model.model.layers):
        for tag, mod in [("qkv", layer.self_attn.qkv_proj),
                         ("gup", layer.mlp.gate_up_proj)]:
            for attr in ("input_global_scale", "weight_global_scale"):
                if hasattr(mod, attr):
                    key = f"L{{i}}.{{tag}}.{{attr.split('_')[0]}}"
                    out[key] = getattr(mod, attr).item()
    return out

print(json.dumps(llm.llm_engine.apply_model(extract)[0]))
""")


def run(max_len):
    r = subprocess.run(
        [sys.executable, "-c", WORKER, str(max_len)],
        capture_output=True, text=True)
    for line in r.stdout.strip().splitlines():
        try:
            return json.loads(line)
        except json.JSONDecodeError:
            continue
    print(r.stderr[-2000:], file=sys.stderr)
    raise RuntimeError(f"Failed for max_len={max_len}")


print("Running with max_model_len=2048 ...")
scales_2k = run(2048)
print("Running with max_model_len=65536 ...")
scales_64k = run(65536)

print("\n{:<30s}  {:>14s}  {:>14s}  {:>8s}".format(
    "Parameter", "len=2048", "len=65536", "Match?"))
print("-" * 72)

mismatches = 0
for key in sorted(scales_2k):
    v1, v2 = scales_2k[key], scales_64k[key]
    match = "OK" if v1 == v2 else "DIFFER"
    if v1 != v2:
        mismatches += 1
    print(f"  {key:<28s}  {v1:>14.8f}  {v2:>14.8f}  {match:>8s}")

print(f"\n==> {mismatches} / {len(scales_2k)} scales differ between runs.")

---

% python compare.py 
Running with max_model_len=2048 ...
Running with max_model_len=65536 ...

Parameter                             len=2048       len=65536    Match?
------------------------------------------------------------------------
  L0.gup.input                      0.00029988      0.00036337    DIFFER
  L0.gup.weight                     0.00028409      0.00028409        OK
  L0.qkv.input                      0.00119617      0.00119617        OK
  L0.qkv.weight                     0.00082237      0.00082237        OK
  L1.gup.input                      0.00051440      0.00051440        OK
  L1.gup.weight                     0.00042230      0.00042230        OK
  L1.qkv.input                      0.00170068      0.00020024    DIFFER
  L1.qkv.weight                     0.00051440      0.00020551    DIFFER
  L10.gup.input                     0.00137363      0.00137363        OK
  L10.gup.weight                    0.00026596      0.00026596        OK
  L10.qkv.input                     0.00425532      0.00425532        OK
  L10.qkv.weight                    0.00040064      0.00040064        OK
  L11.gup.input                     0.00138122      0.00138122        OK
  L11.gup.weight                    0.00024802      0.00024802        OK
  L11.qkv.input                     0.00450450      0.00450450        OK
  L11.qkv.weight                    0.00036337      0.00036337        OK
  L12.gup.input                     0.00147929      0.00147929        OK
  L12.gup.weight                    0.00031726      0.00031726        OK
  L12.qkv.input                     0.00434783      0.00434783        OK
  L12.qkv.weight                    0.00036337      0.00036337        OK
  L13.gup.input                     0.00221239      0.00221239        OK
  L13.gup.weight                    0.00051230      0.00051230        OK
  L13.qkv.input                     0.00507614      0.00507614        OK
  L13.qkv.weight                    0.00030941      0.00030941        OK
  L14.gup.input                     0.00236967      0.00236967        OK
  L14.gup.weight                    0.00040064      0.00040064        OK
  L14.qkv.input                     0.00507614      0.00507614        OK
  L14.qkv.weight                    0.00030488      0.00030488        OK
  L15.gup.input                     0.00238095      0.00238095        OK
  L15.gup.weight                    0.00029621      0.00029621        OK
  L15.qkv.input                     0.00515464      0.00515464        OK
  L15.qkv.weight                    0.00028802      0.00028802        OK
  L16.gup.input                     0.00250000      0.00250000        OK
  L16.gup.weight                    0.00030788      0.00030788        OK
  L16.qkv.input                     0.00492611      0.00492611        OK
  L16.qkv.weight                    0.00038820      0.00038820        OK
  L17.gup.input                     0.00268817      0.00268817        OK
  L17.gup.weight                    0.00058140      0.00058140        OK
  L17.qkv.input                     0.00487805      0.00487805        OK
  L17.qkv.weight                    0.00033784      0.00033784        OK
  L18.gup.input                     0.00282486      0.00282486        OK
  L18.gup.weight                    0.00033784      0.00033784        OK
  L18.qkv.input                     0.00497512      0.00497512        OK
  L18.qkv.weight                    0.00031095      0.00031095        OK
  L19.gup.input                     0.00285714      0.00285714        OK
  L19.gup.weight                    0.00038820      0.00038820        OK
  L19.qkv.input                     0.00497512      0.00497512        OK
  L19.qkv.weight                    0.00029621      0.00029621        OK
  L2.gup.input                      0.00303030      0.00303030        OK
  L2.gup.weight                     0.00056054      0.00056054        OK
  L2.qkv.input                      0.00155280      0.00155280        OK
  L2.qkv.weight                     0.00047348      0.00047348        OK
  L20.gup.input                     0.00268817      0.00268817        OK
  L20.gup.weight                    0.00020833      0.00020833        OK
  L20.qkv.input                     0.00520833      0.00520833        OK
  L20.qkv.weight                    0.00031250      0.00031250        OK
  L21.gup.input                     0.00264550      0.00264550        OK
  L21.gup.weight                    0.00024225      0.00024225        OK
  L21.qkv.input                     0.00518135      0.00518135        OK
  L21.qkv.weight                    0.00030788      0.00030788        OK
  L22.gup.input                     0.00252525      0.00252525        OK
  L22.gup.weight                    0.00023148      0.00023148        OK
  L22.qkv.input                     0.00507614      0.00507614        OK
  L22.qkv.weight                    0.00034722      0.00034722        OK
  L23.gup.input                     0.00246305      0.00246305        OK
  L23.gup.weight                    0.00022810      0.00022810        OK
  L23.qkv.input                     0.00537634      0.00537634        OK
  L23.qkv.weight                    0.00035311      0.00035311        OK
  L24.gup.input                     0.00238095      0.00238095        OK
  L24.gup.weight                    0.00019654      0.00019654        OK
  L24.qkv.input                     0.00564972      0.00564972        OK
  L24.qkv.weight                    0.00035714      0.00035714        OK
  L25.gup.input                     0.00232558      0.00232558        OK
  L25.gup.weight                    0.00023674      0.00023674        OK
  L25.qkv.input                     0.00564972      0.00564972        OK
  L25.qkv.weight                    0.00038820      0.00038820        OK
  L26.gup.input                     0.00186567      0.00186567        OK
  L26.gup.weight                    0.00034341      0.00034341        OK
  L26.qkv.input                     0.00591716      0.00591716        OK
  L26.qkv.weight                    0.00033602      0.00033602        OK
  L27.gup.input                     0.00243902      0.00243902        OK
  L27.gup.weight                    0.00044643      0.00044643        OK
  L27.qkv.input                     0.00540541      0.00540541        OK
  L27.qkv.weight                    0.00033069      0.00033069        OK
  L28.gup.input                     0.00314465      0.00314465        OK
  L28.gup.weight                    0.00029904      0.00029904        OK
  L28.qkv.input                     0.00578035      0.00578035        OK
  L28.qkv.weight                    0.00029904      0.00029904        OK
  L29.gup.input                     0.00970874      0.00970874        OK
  L29.gup.weight                    0.00070621      0.00070621        OK
  L29.qkv.input                     0.00549451      0.00549451        OK
  L29.qkv.weight                    0.00031726      0.00031726        OK
  L3.gup.input                      0.00125000      0.00125000        OK
  L3.gup.weight                     0.00052966      0.00052966        OK
  L3.qkv.input                      0.00299401      0.00299401        OK
  L3.qkv.weight                     0.00083333      0.00083333        OK
  L30.gup.input                     0.00564972      0.00564972        OK
  L30.gup.weight                    0.00036337      0.00036337        OK
  L30.qkv.input                     0.00465116      0.00465116        OK
  L30.qkv.weight                    0.00029343      0.00029343        OK
  L31.gup.input                     0.00621118      0.00621118        OK
  L31.gup.weight                    0.00046992      0.00046992        OK
  L31.qkv.input                     0.00381679      0.00381679        OK
  L31.qkv.weight                    0.00047710      0.00047710        OK
  L4.gup.input                      0.00724638      0.00724638        OK
  L4.gup.weight                     0.00053191      0.00053191        OK
  L4.qkv.input                      0.00398406      0.00398406        OK
  L4.qkv.weight                     0.00091241      0.00091241        OK
  L5.gup.input                      0.00099206      0.00099206        OK
  L5.gup.weight                     0.00024802      0.00024802        OK
  L5.qkv.input                      0.00427350      0.00427350        OK
  L5.qkv.weight                     0.00077640      0.00077640        OK
  L6.gup.input                      0.00209205      0.00209205        OK
  L6.gup.weight                     0.00028802      0.00028802        OK
  L6.qkv.input                      0.00416667      0.00416667        OK
  L6.qkv.weight                     0.00072254      0.00072254        OK
  L7.gup.input                      0.00790514      0.00790514        OK
  L7.gup.weight                     0.00049801      0.00049801        OK
  L7.qkv.input                      0.00390625      0.00390625        OK
  L7.qkv.weight                     0.00062500      0.00062500        OK
  L8.gup.input                      0.00146199      0.00146199        OK
  L8.gup.weight                     0.00037879      0.00037879        OK
  L8.qkv.input                      0.00408163      0.00408163        OK
  L8.qkv.weight                     0.00066489      0.00066489        OK
  L9.gup.input                      0.00140449      0.00140449        OK
  L9.gup.weight                     0.00041118      0.00041118        OK
  L9.qkv.input                      0.00398406      0.00398406        OK
  L9.qkv.weight                     0.00053191      0.00053191        OK

==> 3 / 128 scales differ between runs.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 10.5.0-1ubuntu1~22.04.3) 10.5.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Apr  7 2026, 20:45:25) [Clang 22.1.1 ] (64-bit runtime)
Python platform              : Linux-6.14.0-37-generic-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.48
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  40
On-line CPU(s) list:                     0-39
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) w5-3535X
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               1
Stepping:                                8
CPU max MHz:                             4800.0000
CPU min MHz:                             800.0000
BogoMIPS:                                5808.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               960 KiB (20 instances)
L1i cache:                               640 KiB (20 instances)
L2 cache:                                40 MiB (20 instances)
L3 cache:                                52.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-39
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.29.7
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] open-clip-torch==2.32.0
[pip3] pytorch-lightning==2.5.2
[pip3] pyzmq==27.1.0
[pip3] segmentation-models-pytorch==0.5.0
[pip3] sentence-transformers==5.2.0
[pip3] terratorch==1.2.2
[pip3] torch==2.11.0+cu130
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchgeo==0.7.0
[pip3] torchmetrics==1.7.4
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==4.57.5
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.6.0
[pip3] tritonclient==2.64.0
[pip3] vector-quantize-pytorch==1.21.2
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev230+g0fce51cb7 (git sha: 0fce51cb7)
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-39    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NCCL_VERSION=2.27.7-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=13.0.0
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

Summary

When a quantized checkpoint stores already-fused weights (e.g. a single qkv_proj instead of separate q_proj/k_proj/v_proj), the per-tensor scale parameters are loaded incorrectly. Only slot 0 of the scale tensor receives the checkpoint value; the remaining slots retain uninitialized (indeterminate) values from torch.empty. process_weights_after_loading then calls .max() over all slots, so an indeterminate value that happens to be larger than the true scale silently becomes the effective scale, leading to incorrect dequantization.

Root cause

In MergedColumnParallelLinear.weight_loader_v2 and QKVParallelLinear.weight_loader_v2, when loaded_shard_id is None (fused-on-disk checkpoint), the code writes the loaded scalar scale into shard 0 only, but the parameter retains its full shape [N]. The other N−1 slots are never written.

  1. For example, CompressedTensorsW4A4Fp4 allocate PerTensorScaleParameter with torch.empty(N), where N = number of output partitions (3 for QKV, 2 for gate_up). https://github.com/vllm-project/vllm/blob/2a3c32ce674950f94fdd447979e4621267125e41/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py#L61-L66

  2. When loaded_shard_id is None (fused checkpoint), only shard 0 is written; the other N−1 slots keep their indeterminate values:

  3. process_weights_after_loading calls .max() over all N slots, so an indeterminate value larger than the true scale silently becomes the effective scale. https://github.com/vllm-project/vllm/blob/2a3c32ce674950f94fdd447979e4621267125e41/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py#L105-L106

Affected quantization schemes

Any scheme that uses PerTensorScaleParameter initialized with torch.empty(N) (where N = number of output partitions) is vulnerable, including CompressedTensorsW4A4Fp4, CompressedTensorsW4A16Fp4, CompressedTensorsW8A8Int8, CompressedTensorsW8A16Fp8, ModelOptNvFp4LinearMethod, and QuarkW8A8Int8. Note that Fp8LinearMethod and ModelOptFp8LinearMethod place a sentinel value (torch.finfo(torch.float32).min = −3.4e38) instead of torch.empty, so they are not affected.

Reproduction

Prerequisites

  • A GPU with NVFP4 support (e.g. RTX 5090)
  • vLLM installed from main branch
  • An NVFP4-quantized model with fused weights on disk (e.g. Phi-3-mini-4k-instruct-NVFP4 from Step 1)

Step 1: Create an NVFP4 checkpoint (fused weights)

Use any model whose architecture stores fused projections on disk (e.g. Phi-3, PLaMo-2/3), where qkv_proj and gate_up_proj are saved as single tensors rather than split into q_proj/k_proj/v_proj and gate_proj/up_proj.

<details> <summary>Quantization script (click to expand)</summary>

Almost the same as https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py.

#!/usr/bin/env python3
"""Quantize microsoft/Phi-3-mini-4k-instruct to NVFP4 using llm-compressor."""

from compressed_tensors.offload import dispatch_model
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
SAVE_DIR = "Phi-3-mini-4k-instruct-NVFP4"

# Load model
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto",
                                              trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Calibration dataset
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 1024

ds = load_dataset("HuggingFaceH4/ultrachat_200k",
                   split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False,
        )
    }


ds = ds.map(preprocess)


def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# NVFP4 quantization
recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4", ignore=["lm_head"]
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
print(f"Saved to {SAVE_DIR}")
</details>

Step 2: (Rather) easy way to see the effect

In my environment (RTX 5090, vLLM main at 2a3c32ce), changing max_model_len alters the GPU memory layout, which changes the indeterminate values left by torch.empty in the unwritten scale slots. The script below loads the same checkpoint twice with different max_model_len values and compares the resulting scales — any difference proves the scales depend on memory state rather than checkpoint values. Results may vary depending on GPU model and driver version.

#!/usr/bin/env python3
import json, subprocess, sys, textwrap

MODEL = "Phi-3-mini-4k-instruct-NVFP4"

WORKER = textwrap.dedent(f"""\
import os, json, sys
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
import torch
from vllm import LLM

max_len = int(sys.argv[1])
llm = LLM("{MODEL}", trust_remote_code=True, dtype="auto",
           max_model_len=max_len, enforce_eager=True,
           gpu_memory_utilization=0.9,
           hf_overrides={{"max_position_embeddings": max(max_len, 4096)}})

def extract(model):
    out = {{}}
    for i, layer in enumerate(model.model.layers):
        for tag, mod in [("qkv", layer.self_attn.qkv_proj),
                         ("gup", layer.mlp.gate_up_proj)]:
            for attr in ("input_global_scale", "weight_global_scale"):
                if hasattr(mod, attr):
                    key = f"L{{i}}.{{tag}}.{{attr.split('_')[0]}}"
                    out[key] = getattr(mod, attr).item()
    return out

print(json.dumps(llm.llm_engine.apply_model(extract)[0]))
""")


def run(max_len):
    r = subprocess.run(
        [sys.executable, "-c", WORKER, str(max_len)],
        capture_output=True, text=True)
    for line in r.stdout.strip().splitlines():
        try:
            return json.loads(line)
        except json.JSONDecodeError:
            continue
    print(r.stderr[-2000:], file=sys.stderr)
    raise RuntimeError(f"Failed for max_len={max_len}")


print("Running with max_model_len=2048 ...")
scales_2k = run(2048)
print("Running with max_model_len=65536 ...")
scales_64k = run(65536)

print("\n{:<30s}  {:>14s}  {:>14s}  {:>8s}".format(
    "Parameter", "len=2048", "len=65536", "Match?"))
print("-" * 72)

mismatches = 0
for key in sorted(scales_2k):
    v1, v2 = scales_2k[key], scales_64k[key]
    match = "OK" if v1 == v2 else "DIFFER"
    if v1 != v2:
        mismatches += 1
    print(f"  {key:<28s}  {v1:>14.8f}  {v2:>14.8f}  {match:>8s}")

print(f"\n==> {mismatches} / {len(scales_2k)} scales differ between runs.")
% python compare.py 
Running with max_model_len=2048 ...
Running with max_model_len=65536 ...

Parameter                             len=2048       len=65536    Match?
------------------------------------------------------------------------
  L0.gup.input                      0.00029988      0.00036337    DIFFER
  L0.gup.weight                     0.00028409      0.00028409        OK
  L0.qkv.input                      0.00119617      0.00119617        OK
  L0.qkv.weight                     0.00082237      0.00082237        OK
  L1.gup.input                      0.00051440      0.00051440        OK
  L1.gup.weight                     0.00042230      0.00042230        OK
  L1.qkv.input                      0.00170068      0.00020024    DIFFER
  L1.qkv.weight                     0.00051440      0.00020551    DIFFER
  L10.gup.input                     0.00137363      0.00137363        OK
  L10.gup.weight                    0.00026596      0.00026596        OK
  L10.qkv.input                     0.00425532      0.00425532        OK
  L10.qkv.weight                    0.00040064      0.00040064        OK
  L11.gup.input                     0.00138122      0.00138122        OK
  L11.gup.weight                    0.00024802      0.00024802        OK
  L11.qkv.input                     0.00450450      0.00450450        OK
  L11.qkv.weight                    0.00036337      0.00036337        OK
  L12.gup.input                     0.00147929      0.00147929        OK
  L12.gup.weight                    0.00031726      0.00031726        OK
  L12.qkv.input                     0.00434783      0.00434783        OK
  L12.qkv.weight                    0.00036337      0.00036337        OK
  L13.gup.input                     0.00221239      0.00221239        OK
  L13.gup.weight                    0.00051230      0.00051230        OK
  L13.qkv.input                     0.00507614      0.00507614        OK
  L13.qkv.weight                    0.00030941      0.00030941        OK
  L14.gup.input                     0.00236967      0.00236967        OK
  L14.gup.weight                    0.00040064      0.00040064        OK
  L14.qkv.input                     0.00507614      0.00507614        OK
  L14.qkv.weight                    0.00030488      0.00030488        OK
  L15.gup.input                     0.00238095      0.00238095        OK
  L15.gup.weight                    0.00029621      0.00029621        OK
  L15.qkv.input                     0.00515464      0.00515464        OK
  L15.qkv.weight                    0.00028802      0.00028802        OK
  L16.gup.input                     0.00250000      0.00250000        OK
  L16.gup.weight                    0.00030788      0.00030788        OK
  L16.qkv.input                     0.00492611      0.00492611        OK
  L16.qkv.weight                    0.00038820      0.00038820        OK
  L17.gup.input                     0.00268817      0.00268817        OK
  L17.gup.weight                    0.00058140      0.00058140        OK
  L17.qkv.input                     0.00487805      0.00487805        OK
  L17.qkv.weight                    0.00033784      0.00033784        OK
  L18.gup.input                     0.00282486      0.00282486        OK
  L18.gup.weight                    0.00033784      0.00033784        OK
  L18.qkv.input                     0.00497512      0.00497512        OK
  L18.qkv.weight                    0.00031095      0.00031095        OK
  L19.gup.input                     0.00285714      0.00285714        OK
  L19.gup.weight                    0.00038820      0.00038820        OK
  L19.qkv.input                     0.00497512      0.00497512        OK
  L19.qkv.weight                    0.00029621      0.00029621        OK
  L2.gup.input                      0.00303030      0.00303030        OK
  L2.gup.weight                     0.00056054      0.00056054        OK
  L2.qkv.input                      0.00155280      0.00155280        OK
  L2.qkv.weight                     0.00047348      0.00047348        OK
  L20.gup.input                     0.00268817      0.00268817        OK
  L20.gup.weight                    0.00020833      0.00020833        OK
  L20.qkv.input                     0.00520833      0.00520833        OK
  L20.qkv.weight                    0.00031250      0.00031250        OK
  L21.gup.input                     0.00264550      0.00264550        OK
  L21.gup.weight                    0.00024225      0.00024225        OK
  L21.qkv.input                     0.00518135      0.00518135        OK
  L21.qkv.weight                    0.00030788      0.00030788        OK
  L22.gup.input                     0.00252525      0.00252525        OK
  L22.gup.weight                    0.00023148      0.00023148        OK
  L22.qkv.input                     0.00507614      0.00507614        OK
  L22.qkv.weight                    0.00034722      0.00034722        OK
  L23.gup.input                     0.00246305      0.00246305        OK
  L23.gup.weight                    0.00022810      0.00022810        OK
  L23.qkv.input                     0.00537634      0.00537634        OK
  L23.qkv.weight                    0.00035311      0.00035311        OK
  L24.gup.input                     0.00238095      0.00238095        OK
  L24.gup.weight                    0.00019654      0.00019654        OK
  L24.qkv.input                     0.00564972      0.00564972        OK
  L24.qkv.weight                    0.00035714      0.00035714        OK
  L25.gup.input                     0.00232558      0.00232558        OK
  L25.gup.weight                    0.00023674      0.00023674        OK
  L25.qkv.input                     0.00564972      0.00564972        OK
  L25.qkv.weight                    0.00038820      0.00038820        OK
  L26.gup.input                     0.00186567      0.00186567        OK
  L26.gup.weight                    0.00034341      0.00034341        OK
  L26.qkv.input                     0.00591716      0.00591716        OK
  L26.qkv.weight                    0.00033602      0.00033602        OK
  L27.gup.input                     0.00243902      0.00243902        OK
  L27.gup.weight                    0.00044643      0.00044643        OK
  L27.qkv.input                     0.00540541      0.00540541        OK
  L27.qkv.weight                    0.00033069      0.00033069        OK
  L28.gup.input                     0.00314465      0.00314465        OK
  L28.gup.weight                    0.00029904      0.00029904        OK
  L28.qkv.input                     0.00578035      0.00578035        OK
  L28.qkv.weight                    0.00029904      0.00029904        OK
  L29.gup.input                     0.00970874      0.00970874        OK
  L29.gup.weight                    0.00070621      0.00070621        OK
  L29.qkv.input                     0.00549451      0.00549451        OK
  L29.qkv.weight                    0.00031726      0.00031726        OK
  L3.gup.input                      0.00125000      0.00125000        OK
  L3.gup.weight                     0.00052966      0.00052966        OK
  L3.qkv.input                      0.00299401      0.00299401        OK
  L3.qkv.weight                     0.00083333      0.00083333        OK
  L30.gup.input                     0.00564972      0.00564972        OK
  L30.gup.weight                    0.00036337      0.00036337        OK
  L30.qkv.input                     0.00465116      0.00465116        OK
  L30.qkv.weight                    0.00029343      0.00029343        OK
  L31.gup.input                     0.00621118      0.00621118        OK
  L31.gup.weight                    0.00046992      0.00046992        OK
  L31.qkv.input                     0.00381679      0.00381679        OK
  L31.qkv.weight                    0.00047710      0.00047710        OK
  L4.gup.input                      0.00724638      0.00724638        OK
  L4.gup.weight                     0.00053191      0.00053191        OK
  L4.qkv.input                      0.00398406      0.00398406        OK
  L4.qkv.weight                     0.00091241      0.00091241        OK
  L5.gup.input                      0.00099206      0.00099206        OK
  L5.gup.weight                     0.00024802      0.00024802        OK
  L5.qkv.input                      0.00427350      0.00427350        OK
  L5.qkv.weight                     0.00077640      0.00077640        OK
  L6.gup.input                      0.00209205      0.00209205        OK
  L6.gup.weight                     0.00028802      0.00028802        OK
  L6.qkv.input                      0.00416667      0.00416667        OK
  L6.qkv.weight                     0.00072254      0.00072254        OK
  L7.gup.input                      0.00790514      0.00790514        OK
  L7.gup.weight                     0.00049801      0.00049801        OK
  L7.qkv.input                      0.00390625      0.00390625        OK
  L7.qkv.weight                     0.00062500      0.00062500        OK
  L8.gup.input                      0.00146199      0.00146199        OK
  L8.gup.weight                     0.00037879      0.00037879        OK
  L8.qkv.input                      0.00408163      0.00408163        OK
  L8.qkv.weight                     0.00066489      0.00066489        OK
  L9.gup.input                      0.00140449      0.00140449        OK
  L9.gup.weight                     0.00041118      0.00041118        OK
  L9.qkv.input                      0.00398406      0.00398406        OK
  L9.qkv.weight                     0.00053191      0.00053191        OK

==> 3 / 128 scales differ between runs.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by ensuring that all slots of the scale tensor are properly initialized and updated when loading fused weights from a checkpoint.

Guidance

  1. Initialize scale tensor correctly: When creating the scale tensor, ensure that all slots are initialized with the correct values, rather than leaving them as indeterminate values from torch.empty.
  2. Update all slots when loading weights: In MergedColumnParallelLinear.weight_loader_v2 and QKVParallelLinear.weight_loader_v2, update all slots of the scale tensor when loading fused weights from a checkpoint, not just the first slot.
  3. Verify scale tensor values: After loading the weights, verify that all slots of the scale tensor have the correct values to ensure that the issue is resolved.

Example

# Initialize scale tensor with correct values
scale_tensor = torch.full((N,), loaded_scale_value)

# Update all slots of scale tensor when loading weights
def weight_loader_v2(...):
    ...
    if loaded_shard_id is None:
        # Update all slots of scale tensor
        self.scale_tensor[:] = loaded_scale_value
    ...

Notes

  • The issue is specific to the vLLM project and its handling of fused weights in checkpoints.
  • The provided reproduction script can be used to verify the issue and test potential fixes.

Recommendation

Apply a workaround by initializing the scale tensor correctly and updating all slots when loading weights. This can be done by modifying the weight_loader_v2 functions in MergedColumnParallelLinear and QKVParallelLinear to properly handle fused weights.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING