vllm - ✅(Solved) Fix [Bug]: Timeout when using LoRA with Nemotron Super (Nano is OK) [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40913Fetched 2026-04-27 05:29:21
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

Root Cause

Debug summary (possible root cause)

Fix Action

Fixed

PR fix notes

PR #40916: Fix timeout when using LoRA adapters with Nemotron Super

Description (problem / solution / changelog)

Purpose

Fixes https://github.com/vllm-project/vllm/issues/40913

The function is_supported_lora_module was added in this PR: https://github.com/vllm-project/vllm/pull/34984

Based on PR #34984 , the regex was added during code review, but I am not sure it's required for --lora-target-modules.

This PR optimizes is_supported_lora_module, this optimization alone solved the LoRA failure with Nemotron Super.

Test Plan

Check performance of is_supported_lora_module before/after this fix.

Run the updated test:

python3 -m pytest -vs tests/lora/test_lora_utils.py

Verify that LoRA adapters work with Nemotron Super and Nemotron Nano: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Serving command with Nemotron Nano:

export MODEL_PATH=/my_home/hf_models/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export LORA_PATH=/my_home/hf_models/loras/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/lora_r8_all_linear_zeros

vllm serve $MODEL_PATH \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --enable-lora \
    --max-loras 1 \
    --max-lora-rank 16 \
    --lora-modules nano-lora=$LORA_PATH

And using --lora-target-modules:

vllm serve $MODEL_PATH \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --enable-lora \
    --max-loras 1 \
    --max-lora-rank 16 \
    --lora-modules nano-lora=$LORA_PATH \
    --lora-target-modules q_proj k_proj v_proj o_proj

Note: Using --lora-target-modules prints a lot of logs (very verbose):

(EngineCore pid=1736011) WARNING 04-26 06:29:24 [worker_manager.py:168] LoRA module 'model.layers.8.mixer.experts.99.down_proj' in adapter '/my_home/hf_models/loras/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/lora_r8_all_linear_zeros' is not in the deployment-time target_modules restriction [k_proj, o_proj, q_proj, v_proj]. These parameters will be ignored.
(EngineCore pid=1736011) WARNING 04-26 06:29:24 [worker_manager.py:168] LoRA module 'model.layers.8.mixer.experts.99.up_proj' in adapter '/my_home/hf_models/loras/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/lora_r8_all_linear_zeros' is not in the deployment-time target_modules restriction [k_proj, o_proj, q_proj, v_proj]. These parameters will be ignored.
...

Test Result

Microbenchmark for is_supported_lora_module:

The benchmark builds module lists for Nemotron Nano/Super (model.layers....) then benchmarks total calls to is_supported_lora_module in one forward pass

Before the fix (using regex):

  model | num_layers_moe | experts | len(supported) | len(modules) | total(s) | per_call(us)
  ------------------------------------------------------------------------------------------
  nano  |             23 |     128 |            271 |         6073 |    0.113 |        18.6
  super |             40 |     512 |           1039 |        41272 |    0.760 |        18.4

After the fix (using endswith):

  model | num_layers_moe | experts | len(supported) | len(modules) | total(s) | per_call(us)
  ------------------------------------------------------------------------------------------
  nano  |             23 |     128 |            271 |         6073 |    0.004 |         0.7
  super |             40 |     512 |           1039 |        41272 |    0.027 |         0.7

All tests in test_lora_utils passed.

LoRA adapters now work with both Nemotron Nano and Nemotron Super.

LoRA adapter for Super: https://huggingface.co/dasereb/Nemo_3_Super_120B_BF16_lora_r4_all_linear_zeros

LoRA adapter for Nano: https://huggingface.co/dasereb/Nemotron_3_Nano_BF16_r16_zeros

Note:

Full flow test is an internal test (not part of vLLM repo), using this fix results are back to vLLM v0.18.1 behavior.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/lora/test_lora_utils.py (modified, +36/-0)
  • vllm/lora/utils.py (modified, +6/-9)

Code Example

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-71-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200

Nvidia driver version        : 580.82.07
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
Model name:                           INTEL(R) XEON(R) PLATINUM 8568Y+
CPU family:                           6
Model:                                207
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             2
CPU(s) scaling MHz:                   91%
CPU max MHz:                          4000.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4600.00
...
==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] mypy-extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-bfcl==25.11
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-livecodebench==25.11
[pip3] nvidia-lm-eval==25.11
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==26.4.0
[pip3] torch==2.11.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect
...

---

No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

---

return any(
        re.match(
            r".*\.{target_module}$".format(target_module=target_module),
            module_name,
        )
        or target_module == module_name
        for target_module in supported_lora_modules
    )
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-71-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200

Nvidia driver version        : 580.82.07
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
Model name:                           INTEL(R) XEON(R) PLATINUM 8568Y+
CPU family:                           6
Model:                                207
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             2
CPU(s) scaling MHz:                   91%
CPU max MHz:                          4000.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4600.00
...
==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] mypy-extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-bfcl==25.11
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-livecodebench==25.11
[pip3] nvidia-lm-eval==25.11
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==26.4.0
[pip3] torch==2.11.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect
...
</details>

🐛 Describe the bug

Symptom

When running generation with a LoRA adapter on Nemotron-3-Super, vLLM hangs indefinitely on the first inference after the LoRA is requested.

After ~60 seconds, vLLM begins repeatedly logging:

No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

vLLM eventually fails (timeout).

Model: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

LoRA adapter for testing: https://huggingface.co/dasereb/Nemo_3_Super_120B_BF16_lora_r4_all_linear_zeros

Using Nemotron Nano (also NemotronHForCausalLM model):

  • Nemotron-3-Nano-30B-A3B-BF16 (n_routed_experts=128, 23 MoE layers) - works.
  • Nemotron-3-Super-120B-A12B-BF16 (n_routed_experts=512, 40 MoE layers) - hangs.

Note: all tests done on B200 GPUs.

Affected versions

The issue does not appear in vLLM v0.18.1, first seen with vLLM v0.19.0.

And still exists in vLLM main - last tested commit is fe57be7809672e5c4d100b55ce8649dd34d3bbc0.

Debug summary (possible root cause)

After further investigation, the issue is possibly caused by slow WorkerLoRAManager._load_adapter.

Specifically this code in is_supported_lora_module:

    return any(
        re.match(
            r".*\.{target_module}$".format(target_module=target_module),
            module_name,
        )
        or target_module == module_name
        for target_module in supported_lora_modules
    )

The function is_supported_lora_module was added in this PR: https://github.com/vllm-project/vllm/pull/34984

The issue seems to be related number of experts - that explains why Nemotron Nano does not fail (has 128 experts), but Nemotron Super does (has 512 experts).

The regex is probably very slow for large models with a lot of experts.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be resolved by optimizing the is_supported_lora_module function to improve performance for large models with many experts.

Guidance

  • Investigate the is_supported_lora_module function and consider optimizing the regex pattern to reduce computation time for large models.
  • Test the performance of the WorkerLoRAManager._load_adapter method with different model sizes to confirm the relationship between model size and loading time.
  • Consider adding a timeout or caching mechanism to the WorkerLoRAManager._load_adapter method to prevent indefinite hanging.
  • Verify that the issue is indeed related to the number of experts in the model by testing with different models and expert counts.

Example

import re
import time

def is_supported_lora_module(module_name, supported_lora_modules):
    start_time = time.time()
    result = any(
        re.match(
            r".*\.{target_module}$".format(target_module=target_module),
            module_name,
        )
        or target_module == module_name
        for target_module in supported_lora_modules
    )
    print(f"Time taken: {time.time() - start_time} seconds")
    return result

This example can be used to measure the time taken by the is_supported_lora_module function and verify the performance impact of the regex pattern.

Notes

The issue seems to be related to the number of experts in the model, but further investigation is needed to confirm this. The provided code snippet is a possible cause of the issue, but other factors may also be contributing to the problem.

Recommendation

Apply a workaround by optimizing the is_supported_lora_module function to improve performance for large models with many experts. This can be done by using a more efficient regex pattern or by caching the results of the function to reduce computation time.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING