vllm - 💡(How to fix) Fix [Bug]: `runai_streamer` with `distributed=true` produces empty output for Nemotron-H (Nano-Omni-30B) model [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41749Fetched 2026-05-06 06:15:05
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

When loading nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 with --load-format runai_streamer and --model-loader-extra-config '{"distributed":true}', the model loads without error and serves requests, but generates only empty tokens regardless of the prompt. The same model loaded with the default vLLM loader or with runai_streamer in non-distributed mode works correctly.

Error Message

When loading nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 with --load-format runai_streamer and --model-loader-extra-config '{"distributed":true}', the model loads without error and serves requests, but generates only empty tokens regardless of the prompt. The same model loaded with the default vLLM loader or with runai_streamer in non-distributed mode works correctly.

Root Cause

The bug is in vllm/model_executor/models/nano_nemotron_vl.py, in the NanoNemotronVL.load_weights method. It is not in the RunAI model streamer package itself.

Fix Action

Fix / Workaround

The correct fix is to avoid materializing the iterator. Each weight should be dispatched to the appropriate sub-model parameter immediately upon being yielded.

The preferred fix is to restructure load_weights to dispatch each weight inline without buffering, mirroring how every other model in the codebase handles this.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 595.58.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8468
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
BogoMIPS:                             4200.00
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            4 MiB (128 instances)
L1i cache:                            4 MiB (128 instances)
L2 cache:                             256 MiB (64 instances)
L3 cache:                             32 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-63
NUMA node1 CPU(s):                    64-127

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     0-63    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     0-63    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     0-63    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     0-63    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     64-127  1               N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
NIC Legend:
  NIC0: mlx5_0  NIC1: mlx5_1  NIC2: mlx5_2  NIC3: mlx5_3
  NIC4: mlx5_4  NIC5: mlx5_5  NIC6: mlx5_6  NIC7: mlx5_7

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.2
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
VLLM_USAGE_SOURCE=production-docker-image
VLLM_ENABLE_CUDA_COMPATIBILITY=0
PYTORCH_NVML_BASED_CUDA_CHECK=1

---

vllm serve /data/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --load-format runai_streamer \
  --model-loader-extra-config '{"distributed":true}'

---

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
    "messages": [{"role": "user", "content": "Write a haiku about a robot learning to paint."}],
    "max_tokens": 128,
    "temperature": 0
  }'

---

We need to write a haiku (5-7-5 syllable structure). About a robot learning to paint. So maybe:

Metal brush in hand (5)
Learning colors, strokes unfold (7)
Dreams in silent code (5)
...

---

def get_tensors(self) -> Iterator[torch.tensor]:
    for file_index, ready_chunk_index, buffer in self.file_streamer.get_chunks():
        tensor_metadata = self.files_to_tensors_metadata[file_index][ready_chunk_index]
        yield tensor_metadata.name, safetensors_pytorch.create_torch_tensor(
            buffer, tensor_metadata
        )

---

llm_weights = []
vision_weights = []
sound_weights = []

for name, w in weights:          # w is a view into the reusable buffer
    if is_llm(name):
        llm_weights.append((".".join(name.split(".")[1:]), w))   # saves stale reference
    elif is_vision_weights(name):
        vision_weights.append((hf_key, w))                       # saves stale reference
    elif is_sound_weights(name):
        sound_weights.append((name, w))                          # saves stale reference
    elif is_adapter_weights((name, w)):
        default_weight_loader(param, w)  # ← correct: loads immediately, no reference kept

# By the time the loop finishes, the buffer has been overwritten N times.
# Every w in every list now points to the same memory location,
# containing only the data of the LAST tensor yielded by the iterator.
self.language_model.load_weights(llm_weights)   # loads wrong data for all but one param
self.vision_model.load_weights(vision_weights)
RAW_BUFFERClick to expand / collapse

Your current environment

<details><summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.11.0-1016-nvidia-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 595.58.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8468
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
BogoMIPS:                             4200.00
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            4 MiB (128 instances)
L1i cache:                            4 MiB (128 instances)
L2 cache:                             256 MiB (64 instances)
L3 cache:                             32 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-63
NUMA node1 CPU(s):                    64-127

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     0-63    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     0-63    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     0-63    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     0-63    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     64-127  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     64-127  1               N/A

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
NIC Legend:
  NIC0: mlx5_0  NIC1: mlx5_1  NIC2: mlx5_2  NIC3: mlx5_3
  NIC4: mlx5_4  NIC5: mlx5_5  NIC6: mlx5_6  NIC7: mlx5_7

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.2
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
VLLM_USAGE_SOURCE=production-docker-image
VLLM_ENABLE_CUDA_COMPATIBILITY=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
</details>

🐛 Describe the bug

Summary

When loading nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 with --load-format runai_streamer and --model-loader-extra-config '{"distributed":true}', the model loads without error and serves requests, but generates only empty tokens regardless of the prompt. The same model loaded with the default vLLM loader or with runai_streamer in non-distributed mode works correctly.

Environment

vLLM version0.20.1
Modelnvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 (architecture: NanoNemotronVL)
Load formatrunai_streamer with --model-loader-extra-config '{"distributed":true}'
Hardware4× NVIDIA H200 (140 GB each)
Tensor parallel size4

Steps to Reproduce

vllm serve /data/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --load-format runai_streamer \
  --model-loader-extra-config '{"distributed":true}'

Then query the running server:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
    "messages": [{"role": "user", "content": "Write a haiku about a robot learning to paint."}],
    "max_tokens": 128,
    "temperature": 0
  }'

Expected Behavior

The model should produce coherent text, matching the output of the default vLLM loader. Example of correct output (from default loader):

We need to write a haiku (5-7-5 syllable structure). About a robot learning to paint. So maybe:

Metal brush in hand (5)
Learning colors, strokes unfold (7)
Dreams in silent code (5)
...

Actual Behavior

The model returns an empty content string despite reporting 128 completion tokens generated and finish_reason: length:

Comparison Across Loading Variants

All three variants were tested sequentially on the same node and PVC, same hardware, same prompt, temperature=0:

VariantCommandOutputStatus
defaultvllm serve ... (no extra flags)Coherent reasoning + haiku text✅ PASS
runai_memlimit--load-format runai_streamer + RUNAI_STREAMER_MEMORY_LIMIT=-1Coherent reasoning + haiku text✅ PASS
runai_dist--load-format runai_streamer + {"distributed":true}Empty string, 128 tokens❌ FAIL

The runai_memlimit variant (non-distributed runai_streamer) succeeds, ruling out any issue with the runai_streamer load format itself. The failure is specific to the distributed streaming path.

Root Cause Analysis

The bug is in vllm/model_executor/models/nano_nemotron_vl.py, in the NanoNemotronVL.load_weights method. It is not in the RunAI model streamer package itself.

How the distributed streamer yields tensors

RunaiModelStreamerLoader.load_weights calls model.load_weights(iterator) where iterator is produced by runai_safetensors_weights_iterator. In distributed mode, this function creates a SafetensorsStreamer and yields tensors via get_tensors():

def get_tensors(self) -> Iterator[torch.tensor]:
    for file_index, ready_chunk_index, buffer in self.file_streamer.get_chunks():
        tensor_metadata = self.files_to_tensors_metadata[file_index][ready_chunk_index]
        yield tensor_metadata.name, safetensors_pytorch.create_torch_tensor(
            buffer, tensor_metadata
        )

The critical detail: each yielded tensor is a view into buffer, which is a slot from the streamer's internal reusable GPU memory pool. Once get_chunks() advances to the next tensor, that same buffer is overwritten with the new tensor's data. Any Python reference that still points to the previous tensor silently sees the new data.

The bug in nano_nemotron_vl.load_weights

NanoNemotronVL.load_weights fully materializes the weights iterator into lists before doing any loading:

llm_weights = []
vision_weights = []
sound_weights = []

for name, w in weights:          # w is a view into the reusable buffer
    if is_llm(name):
        llm_weights.append((".".join(name.split(".")[1:]), w))   # saves stale reference
    elif is_vision_weights(name):
        vision_weights.append((hf_key, w))                       # saves stale reference
    elif is_sound_weights(name):
        sound_weights.append((name, w))                          # saves stale reference
    elif is_adapter_weights((name, w)):
        default_weight_loader(param, w)  # ← correct: loads immediately, no reference kept

# By the time the loop finishes, the buffer has been overwritten N times.
# Every w in every list now points to the same memory location,
# containing only the data of the LAST tensor yielded by the iterator.
self.language_model.load_weights(llm_weights)   # loads wrong data for all but one param
self.vision_model.load_weights(vision_weights)

After the loop completes, every stored w is a view into the same buffer slot — which now holds the data of the last tensor that was yielded. All sub-model parameters are therefore loaded with incorrect values, silently corrupting the model.

Note that the adapter weights are handled correctly in the same function: default_weight_loader(param, w) is called inline, the tensor data is copied into the parameter immediately, and no reference to w is retained.

Why the other variants are unaffected

VariantBuffer behaviorEffect
default (safetensors)Each tensor is backed by a memory-mapped file region — permanent, independent memorySaved references remain valid; list-buffering works by coincidence
runai_memlimit (MEMORY_LIMIT=-1)Disables the memory pool; each tensor gets its own allocationSaved references remain valid; list-buffering works by coincidence
runai_dist (distributed=true)Reusable GPU buffer pool is activeSaved references become stale after each next() call — corruption
AutoWeightsLoader calls weight_loader(param, w) for each tensor immediately as it is yielded, copying the data into the parameter before advancing the iterator. The buffer is free to be reused by the time next() is called. No references to w are ever stored.

NanoNemotronVL.load_weights is the only model in vLLM that buffers the full iterator into lists before loading, making it uniquely susceptible to this failure mode.

Proposed Fix

The correct fix is to avoid materializing the iterator. Each weight should be dispatched to the appropriate sub-model parameter immediately upon being yielded.

The minimal safe fix — at the cost of extra GPU memory — is to clone each tensor before appending, so the stored value is independent of the reusable buffer. The extra GPU memory is released after the weights are loaded, but can still OOM if the entire model does not fit.

The preferred fix is to restructure load_weights to dispatch each weight inline without buffering, mirroring how every other model in the codebase handles this.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by modifying the NanoNemotronVL.load_weights method to load weights immediately as they are yielded by the iterator, rather than buffering them in lists.

Guidance

  • Identify the NanoNemotronVL.load_weights method in the vllm/model_executor/models/nano_nemotron_vl.py file and modify it to load weights inline.
  • Consider cloning each tensor before appending to the lists to avoid overwriting issues, but be aware that this may increase GPU memory usage.
  • Restructure the load_weights method to dispatch each weight immediately, similar to how other models in the codebase handle this.
  • Verify that the fix works by testing the model with the runai_streamer load format and --model-loader-extra-config '{"distributed":true}' flag.

Example

def load_weights(self, weights):
    for name, w in weights:
        if is_llm(name):
            self.language_model.load_weights([(name, w)])
        elif is_vision_weights(name):
            self.vision_model.load_weights([(name, w)])
        elif is_sound_weights(name):
            self.sound_model.load_weights([(name, w)])
        # ...

Notes

  • The proposed fix assumes that the load_weights method is the only part of the code that needs to be modified.
  • The fix may require additional testing to ensure that it works correctly in all scenarios.

Recommendation

Apply the proposed fix to the NanoNemotronVL.load_weights method to resolve the issue. This fix should allow the model to load correctly with the runai_streamer load format and --model-loader-extra-config '{"distributed":true}' flag.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `runai_streamer` with `distributed=true` produces empty output for Nemotron-H (Nano-Omni-30B) model [1 participants]