vllm - ✅(Solved) Fix [CI Failure]: Kernels FusedMoE Layer Test (2 H100s): test_moe_layer.py::test_moe_layer [2 pull requests, 2 comments, 2 participants]

vllm2026-04-22 17:15:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40637•Fetched 2026-04-23 07:23:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yewentao256

Participants

HollowMan6

yewentao256

Timeline (top)

commented ×2cross-referenced ×2added_to_project_v2 ×1closed ×1

Error Message

=================================================================== short test summary info ====================================================================

FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_high_throughput-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT ============================================= 2 failed, 183 passed, 335 skipped, 18 warnings in 687.34s (0:11:27) ==============================================

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

Fix Action

Fixed

Fixed by PR: [Bugfix] LoRA for DeepSeek V3.2 (https://github.com/vllm-project/vllm/pull/35077)
Fixed by PR: [CI Bug] Fix ci issue #40637, Kernels FusedMoE Layer Test (2 H100s): test_moe_layer.py::test_moe_layer (https://github.com/vllm-project/vllm/pull/40639)

PR fix notes

PR #35077: [Bugfix] LoRA for DeepSeek V3.2

Repository: vllm-project/vllm
Author: HollowMan6
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/35077

Description (problem / solution / changelog)

Purpose

This PR fixes LoRA regressions seen with DeepSeek V3.2/DSA:

LoRA module registration failed for fused_qkv_a_proj with an assertion that the module was not a BaseLayerWithLoRA.
After that fix, MLA weight post-processing failed with AttributeError: 'ColumnParallelLinearWithLoRA' object has no attribute 'quant_method'.

   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 46, in load_lora_model
     return self.lora_manager.create_lora_manager(model, vllm_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 227, in create_lora_manager
     lora_manager = create_lora_manager(
                    ^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 895, in create_lora_manager
     lora_manager = lora_manager_cls(
                    ^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 807, in __init__
     super().__init__(
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 111, in __init__
     self._create_lora_modules()
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 407, in _create_lora_modules
     self.register_module(module_name, new_module)
   File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 414, in register_module
     assert isinstance(module, BaseLayerWithLoRA), (
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 AssertionError: Module model.layers.0.self_attn.fused_qkv_a_proj must be a BaseLayerWithLoRA instance, got <class 'vllm.model_executor.models.deepseek_v2.DeepSeekV2FusedQkvAProj'>

File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 858, in worker_busy_loop
     output = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
   File "/mnt/data/user/songlin/verl/verl/workers/rollout/vllm_rollout/utils.py", line 273, in update_weights_from_ipc
     process_weights_after_loading(model, model_config, self.device)
   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 117, in process_weights_after_loading
     module.process_weights_after_loading(model_config.dtype)
   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/attention/mla_attention.py", line 655, in process_weights_after_loading
     kv_b_proj_weight = get_and_maybe_dequant_weights(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 333, in get_and_maybe_dequant_weights
     if layer.quant_method is None or isinstance(
        ^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1965, in __getattr__
     raise AttributeError(
 AttributeError: 'ColumnParallelLinearWithLoRA' object has no attribute 'quant_method'

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 771, in worker_main
    worker = WorkerProc(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 597, in __init__
    self.worker.load_model()
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 336, in load_model
    self.model_runner.load_model(load_dummy_weights=dummy_weights)
  File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4222, in load_model
    self.model = self.load_lora_model(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 46, in load_lora_model
    return self.lora_manager.create_lora_manager(model, vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 227, in create_lora_manager
    lora_manager = create_lora_manager(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 895, in create_lora_manager
    lora_manager = lora_manager_cls(
                   ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 807, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 111, in __init__
    self._create_lora_modules()
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 407, in _create_lora_modules
    self.register_module(module_name, new_module)
  File "/usr/local/lib/python3.12/site-packages/vllm/lora/model_manager.py", line 414, in
    assert isinstance(module, BaseLayerWithLoRA), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Module model.layers.3.mlp.gate must be a BaseLayerWithLoRA instance, got <class 'vllm.model_executor.layers.fused_moe.router.gate_linear.GateLinear'>

Test Plan

Added the unit test cases, and also with end to end test manually.

Test Result

All pass without the above error.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

<sub>✨ Presented to you with <a href="https://macaron.im/mindlab">Mind Lab</a> - A Lab for Experiential Intelligence.</sub>

Changed files

tests/kernels/moe/test_moe_layer.py (modified, +7/-12)
tests/lora/test_layers.py (modified, +270/-2)
tests/lora/test_lora_manager.py (modified, +150/-0)
tests/lora/test_lora_utils.py (modified, +21/-0)
vllm/lora/layers/base_linear.py (modified, +9/-0)
vllm/lora/layers/column_parallel_linear.py (modified, +42/-10)
vllm/lora/layers/replicated_linear.py (modified, +7/-1)
vllm/lora/model_manager.py (modified, +52/-7)
vllm/lora/utils.py (modified, +25/-3)
vllm/lora/worker_manager.py (modified, +8/-1)
vllm/model_executor/layers/fused_moe/oracle/unquantized.py (modified, +13/-0)
vllm/model_executor/layers/quantization/utils/quant_utils.py (modified, +6/-0)
vllm/v1/worker/lora_model_runner_mixin.py (modified, +4/-1)

PR #40639: [CI Bug] Fix ci issue #40637, Kernels FusedMoE Layer Test (2 H100s): test_moe_layer.py::test_moe_layer

Repository: vllm-project/vllm
Author: yewentao256
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/40639

Description (problem / solution / changelog)

Purpose

Fix ci issue #40637, seems issue introduced from https://github.com/vllm-project/vllm/pull/35077

Test

Covered in CI

Changed files

tests/kernels/moe/test_moe_layer.py (modified, +12/-7)

Code Example

=================================================================== short test summary info ====================================================================
--
FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_high_throughput-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
============================================= 2 failed, 183 passed, 335 skipped, 18 warnings in 687.34s (0:11:27) ==============================================

RAW_BUFFERClick to expand / collapse

Name of failing test

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

https://buildkite.com/vllm/ci/builds/62456#019db5a5-fc65-4fe3-bcd4-62ead4870367


=================================================================== short test summary info ====================================================================
--
FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_high_throughput-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
FAILED kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True] - torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
============================================= 2 failed, 183 passed, 335 skipped, 18 warnings in 687.34s (0:11:27) ==============================================

📝 History of failing test

Since happened yesterday

CC List.

https://buildkite.com/vllm/ci/builds?branch=main&query=nightly Only today, but not happened yesterday

extent analysis

TL;DR

Investigate the torch.multiprocessing.spawn functionality and its interaction with the test environment to resolve the ProcessExitedException issue.

Guidance

Review the test configuration and environment to ensure that torch.multiprocessing.spawn is properly set up and compatible with the test framework.
Check the Buildkite CI build logs for any additional error messages or warnings that may indicate the root cause of the SIGABRT signal.
Investigate potential issues with the test_moe_layer.py test case, such as resource constraints or incorrect test data, that may be contributing to the process termination.
Consider running the test locally to reproduce the issue and gather more detailed debugging information.

Example

No specific code snippet can be provided without more context, but reviewing the test_moe_layer.py test case and the torch.multiprocessing.spawn documentation may help identify potential issues.

Notes

The issue may be related to a specific combination of test parameters or environment settings, and further investigation is needed to determine the root cause.

Recommendation

Apply a workaround by modifying the test configuration or environment to avoid the ProcessExitedException issue, as the root cause is not yet clear.

FAIL-SAFE

If the issue persists, consider reaching out to the PyTorch or Buildkite communities for additional support and guidance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [CI Failure]: Kernels FusedMoE Layer Test (2 H100s): test_moe_layer.py::test_moe_layer [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

=================================================================== short test summary info ====================================================================

Root Cause

Fix Action

Fixed

PR fix notes

PR #35077: [Bugfix] LoRA for DeepSeek V3.2

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #40639: [CI Bug] Fix ci issue #40637, Kernels FusedMoE Layer Test (2 H100s): test_moe_layer.py::test_moe_layer

Description (problem / solution / changelog)

Purpose

Test

Changed files

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAIL-SAFE

Still need to ship something?

RELATED_DISCOVERY

TRENDING