vllm - 💡(How to fix) Fix [CI Failure]: Kernels FusedMoE Layer Test (2 H100s) is flaky [1 comments, 2 participants]

vllm2026-04-10 14:53:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39503•Fetched 2026-04-11 06:13:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

added_to_project_v2 ×1assigned ×1commented ×1cross-referenced ×1

Error Message

===================================================================================================================== test session starts ===================================================================================================================== platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0 rootdir: /home/sagemoore/git/nm-vllm configfile: pyproject.toml plugins: anyio-4.13.0, Faker-40.11.1 collected 0 items

====================================================================================================================== warnings summary ======================================================================================================================= <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute

<frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no module attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ===================================================================================================================== 2 warnings in 0.00s ===================================================================================================================== ERROR: file or directory not found: kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]

sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute (nm-vllm) sagemoore@nm-automation-h100-standalone-1-preserve ~/g/nm-vllm (main) [4]> g2 pytest tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True] Reserved 2 GPU(s): [3 6] for command execution (timeout: 0h 15m 0s) ===================================================================================================================== test session starts ===================================================================================================================== platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0 rootdir: /home/sagemoore/git/nm-vllm configfile: pyproject.toml plugins: anyio-4.13.0, Faker-40.11.1 collected 1 item

tests/kernels/moe/test_moe_layer.py F [100%]

================================================================================= FAILURES ================================================================================= ____________________________________________________________ test_moe_layer[False-deepep_low_latency-2-1-True] _____________________________________________________________

dp_size = 2, tp_size = 1, use_ep = True, backend = 'deepep_low_latency', enable_eplb = False, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f1904d88d10> pytestconfig = <_pytest.config.Config object at 0x7f21f33b7230>, subtests = None

@pytest.mark.parametrize("dp_size, tp_size, use_ep", PARALLEL_COMBOS)
@pytest.mark.parametrize("backend", BACKENDS)
@pytest.mark.parametrize("enable_eplb", [False, True])
def test_moe_layer(
    dp_size: int,
    tp_size: int,
    use_ep: bool,
    backend: str,
    enable_eplb: bool,
    monkeypatch,
    pytestconfig,
    subtests,
):
    """Test MoE layer with parallelism (multi-GPU or TP/EP enabled).

    For non-parallel cases (world_size == 1), use test_moe_layer_no_parallel instead.
    """
    num_gpus = current_platform.device_count()
    world_size = tp_size * dp_size
    ep_size = 1 if not use_ep else world_size  # or dp_size?
    assert world_size > 1

    # Check if enough GPUs available
    if world_size is not None and num_gpus is not None and world_size > num_gpus:
        pytest.skip(f"Not enough GPUs got {num_gpus}, expected {world_size}.")

    if enable_eplb and not use_ep:
        pytest.skip("EPLB requires EP.")

    verbosity = pytestconfig.getoption("verbose")

    test_env = dict()
    test_env["VLLM_MOE_DP_CHUNK_SIZE"] = "128"
    monkeypatch.setenv("VLLM_MOE_DP_CHUNK_SIZE", "128")
    if os.environ.get("VLLM_LOGGING_LEVEL") is None:
        monkeypatch.setenv("VLLM_LOGGING_LEVEL", "ERROR")

    # TODO
    # VLLM_FLASHINFER_MOE_BACKEND=latency
    # VLLM_USE_FLASHINFER_MOE_FP16=1
    # VLLM_USE_FLASHINFER_MOE_FP8
    # VLLM_USE_FLASHINFER_MOE_FP4
    # VLLM_USE_FLASHINFER_MOE_INT4

    parallel_config = ParallelConfig(
        pipeline_parallel_size=1,
        data_parallel_size=dp_size,
        tensor_parallel_size=tp_size,
        enable_expert_parallel=use_ep,
        all2all_backend=backend,
        enable_eplb=enable_eplb,
    )

    compilation_config = CompilationConfig()
    # compilation_config.mode = CompilationMode.NONE  # for now
    compilation_config.pass_config.fuse_allreduce_rms = False  # for now

    vllm_config = VllmConfig(
        parallel_config=parallel_config, compilation_config=compilation_config
    )

    test_configs = generate_valid_test_configs(
        backend, ep_size, dp_size, tp_size, enable_eplb, verbosity
    )

    if subtests is not None:
        new_test_configs = []
        for subtest in subtests.split(","):
            sub_test_config = MoETestConfig.from_id(subtest)
            if sub_test_config in test_configs:
                new_test_configs.append(sub_test_config)
            else:
                pytest.skip(
                    f"subtest config {subtest} does not match any valid test "
                    "configuration"
                )
        test_configs = new_test_configs

    if len(test_configs) == 0:
        pytest.skip("No supported configs found for this testpoint.")

    try:

      parallel_launch_with_config(

            world_size,

            _parallel_worker,
            vllm_config,
            test_env,
            test_configs,
            verbosity,
        )

tests/kernels/moe/test_moe_layer.py:1717:

tests/kernels/moe/modular_kernel_tools/parallel_utils.py:133: in parallel_launch_with_config spawn( .venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:340: in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:296: in start_processes while not context.join(): ^^^^^^^^^^^^^^

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f1904502960>, timeout = None, grace_period = None

def join(self, timeout: float | None = None, grace_period: float | None = None):
    r"""Join one or more processes within spawn context.

    Attempt to join one or more processes in this spawn context.
    If one of them exited with a non-zero exit status, this function
    kills the remaining processes (optionally with a grace period)
    and raises an exception with the cause of the first process exiting.

    Returns ``True`` if all processes have been joined successfully,
    ``False`` if there are more processes that need to be joined.

    Args:
        timeout (float): Wait this long (in seconds) before giving up on waiting.
        grace_period (float): When any processes fail, wait this long (in seconds)
            for others to shutdown gracefully before terminating them. If they
            still don't exit, wait another grace period before killing them.
    """
    # Ensure this function can be called even when we're done.
    if len(self.sentinels) == 0:
        return True

    # Wait for any process to fail or all of them to succeed.
    ready = multiprocessing.connection.wait(
        self.sentinels.keys(),
        timeout=timeout,
    )

    error_index = None
    for sentinel in ready:
        index = self.sentinels.pop(sentinel)
        process = self.processes[index]
        process.join()
        if process.exitcode != 0:
            error_index = index
            break

    # Return if there was no error.
    if error_index is None:
        # Return whether or not all processes have been joined.
        return len(self.sentinels) == 0
    # An error occurred. Clean-up all processes before returning.
    # First, allow a grace period for processes to shutdown themselves.
    if grace_period is not None:
        self._join_procs_with_timeout(grace_period)
    # Then, terminate processes that are still alive. Try SIGTERM first.
    for process in self.processes:
        if process.is_alive():
            log.warning("Terminating process %s via signal SIGTERM", process.pid)
            process.terminate()

    # Try SIGKILL if the process isn't going down after another grace_period.
    # The reason is related to python signal handling is limited
    # to main thread and if that is in c/c++ land and stuck it won't
    # to handle it. We have seen processes getting stuck not handling
    # SIGTERM for the above reason.
    self._join_procs_with_timeout(30 if grace_period is None else grace_period)
    for process in self.processes:
        if process.is_alive():
            log.warning(
                "Unable to shutdown process %s via SIGTERM , forcefully exiting via SIGKILL",
                process.pid,
            )
            process.kill()
        process.join()

    # The file will only be created if the process crashed.
    failed_process = self.processes[error_index]
    if not os.access(self.error_files[error_index], os.R_OK):
        exitcode = self.processes[error_index].exitcode
        if exitcode < 0:
            try:
                name = signal.Signals(-exitcode).name
            except ValueError:
                name = f"<Unknown signal {-exitcode}>"
            raise ProcessExitedException(
                f"process {error_index:d} terminated with signal {name}",
                error_index=error_index,
                error_pid=failed_process.pid,
                exit_code=exitcode,
                signal_name=name,
            )
        else:
            raise ProcessExitedException(
                f"process {error_index:d} terminated with exit code {exitcode:d}",
                error_index=error_index,
                error_pid=failed_process.pid,
                exit_code=exitcode,
            )

    with open(self.error_files[error_index], "rb") as fh:
        original_trace = pickle.load(fh)
    msg = f"\n\n-- Process {error_index:d} terminated with the following error:\n"
    msg += original_trace

  raise ProcessRaisedException(msg, error_index, failed_process.pid)

E torch.multiprocessing.spawn.ProcessRaisedException: E E -- Process 0 terminated with the following error: E Traceback (most recent call last): E File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap E fn(i, *args) E File "/home/sagemoore/git/nm-vllm/tests/kernels/moe/modular_kernel_tools/parallel_utils.py", line 120, in _worker_parallel_launch E cleanup_dist_env_and_memory() E File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1919, in cleanup_dist_env_and_memory E destroy_model_parallel() E File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1883, in destroy_model_parallel E _EP.destroy() E File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1063, in destroy E self.device_communicator.destroy() E File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/cuda_communicator.py", line 349, in destroy E self.all2all_manager.destroy() E File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/all2all.py", line 302, in destroy E handle.destroy() E File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/deep_ep/buffer.py", line 146, in destroy E self.runtime.destroy() E RuntimeError: Failed: CUDA error /home/sagemoore/git/vllm_dependency_install/deepep/csrc/deep_ep.cpp:288 'unspecified launch failure'

.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:211: ProcessRaisedException --------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------- INFO 04-10 14:07:35 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 04-10 14:07:35 [vllm.py:799] Asynchronous scheduling is enabled. INFO 04-10 14:07:35 [vllm.py:809] Disabling NCCL for DP synchronization when using async scheduling. INFO 04-10 14:07:35 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native']) WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. DeepEP timeout check failed: rank = 1, thread = 0, value = 1024) DeepEP timeout check failed: rank = 1, thread = 1, value = 0) DeepEP timeout check failed: rank = 0, thread = 0, value = 0) DeepEP timeout check failed: rank = 0, thread = 1, value = 1024) ................ ============= 16 passed of 16 total tests ============= ................ ============= 16 passed of 16 total tests ============= WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources. --------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------- <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA W0410 14:09:32.356000 3348241 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:165] Terminating process 3348580 via signal SIGTERM

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

Fix Action

Fix / Workaround

@pytest.mark.parametrize("dp_size, tp_size, use_ep", PARALLEL_COMBOS) @pytest.mark.parametrize("backend", BACKENDS) @pytest.mark.parametrize("enable_eplb", [False, True]) def test_moe_layer( dp_size: int, tp_size: int, use_ep: bool, backend: str, enable_eplb: bool, monkeypatch, pytestconfig, subtests, ): """Test MoE layer with parallelism (multi-GPU or TP/EP enabled).

test_env = dict() test_env["VLLM_MOE_DP_CHUNK_SIZE"] = "128" monkeypatch.setenv("VLLM_MOE_DP_CHUNK_SIZE", "128") if os.environ.get("VLLM_LOGGING_LEVEL") is None: monkeypatch.setenv("VLLM_LOGGING_LEVEL", "ERROR")

Code Example

DeepEP timeout check failed: rank = 1, thread = 0, value = 1024)
DeepEP timeout check failed: rank = 1, thread = 1, value = 0)
DeepEP timeout check failed: rank = 0, thread = 0, value = 0)
DeepEP timeout check failed: rank = 0, thread = 1, value = 1024)

---

===================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/sagemoore/git/nm-vllm
configfile: pyproject.toml
plugins: anyio-4.13.0, Faker-40.11.1
collected 0 items                                                                                                                                                           

====================================================================================================================== warnings summary =======================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================================================== 2 warnings in 0.00s =====================================================================================================================
ERROR: file or directory not found: kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]

sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
(nm-vllm) sagemoore@nm-automation-h100-standalone-1-preserve ~/g/nm-vllm (main) [4]> g2 pytest tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]
Reserved 2 GPU(s): [3 6] for command execution (timeout: 0h 15m 0s)
===================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/sagemoore/git/nm-vllm
configfile: pyproject.toml
plugins: anyio-4.13.0, Faker-40.11.1
collected 1 item                                                                                                                                                            

tests/kernels/moe/test_moe_layer.py F                                                                                                                                [100%]

================================================================================= FAILURES =================================================================================
____________________________________________________________ test_moe_layer[False-deepep_low_latency-2-1-True] _____________________________________________________________

dp_size = 2, tp_size = 1, use_ep = True, backend = 'deepep_low_latency', enable_eplb = False, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f1904d88d10>
pytestconfig = <_pytest.config.Config object at 0x7f21f33b7230>, subtests = None

    @pytest.mark.parametrize("dp_size, tp_size, use_ep", PARALLEL_COMBOS)
    @pytest.mark.parametrize("backend", BACKENDS)
    @pytest.mark.parametrize("enable_eplb", [False, True])
    def test_moe_layer(
        dp_size: int,
        tp_size: int,
        use_ep: bool,
        backend: str,
        enable_eplb: bool,
        monkeypatch,
        pytestconfig,
        subtests,
    ):
        """Test MoE layer with parallelism (multi-GPU or TP/EP enabled).

        For non-parallel cases (world_size == 1), use test_moe_layer_no_parallel instead.
        """
        num_gpus = current_platform.device_count()
        world_size = tp_size * dp_size
        ep_size = 1 if not use_ep else world_size  # or dp_size?
        assert world_size > 1

        # Check if enough GPUs available
        if world_size is not None and num_gpus is not None and world_size > num_gpus:
            pytest.skip(f"Not enough GPUs got {num_gpus}, expected {world_size}.")

        if enable_eplb and not use_ep:
            pytest.skip("EPLB requires EP.")

        verbosity = pytestconfig.getoption("verbose")

        test_env = dict()
        test_env["VLLM_MOE_DP_CHUNK_SIZE"] = "128"
        monkeypatch.setenv("VLLM_MOE_DP_CHUNK_SIZE", "128")
        if os.environ.get("VLLM_LOGGING_LEVEL") is None:
            monkeypatch.setenv("VLLM_LOGGING_LEVEL", "ERROR")

        # TODO
        # VLLM_FLASHINFER_MOE_BACKEND=latency
        # VLLM_USE_FLASHINFER_MOE_FP16=1
        # VLLM_USE_FLASHINFER_MOE_FP8
        # VLLM_USE_FLASHINFER_MOE_FP4
        # VLLM_USE_FLASHINFER_MOE_INT4

        parallel_config = ParallelConfig(
            pipeline_parallel_size=1,
            data_parallel_size=dp_size,
            tensor_parallel_size=tp_size,
            enable_expert_parallel=use_ep,
            all2all_backend=backend,
            enable_eplb=enable_eplb,
        )

        compilation_config = CompilationConfig()
        # compilation_config.mode = CompilationMode.NONE  # for now
        compilation_config.pass_config.fuse_allreduce_rms = False  # for now

        vllm_config = VllmConfig(
            parallel_config=parallel_config, compilation_config=compilation_config
        )

        test_configs = generate_valid_test_configs(
            backend, ep_size, dp_size, tp_size, enable_eplb, verbosity
        )

        if subtests is not None:
            new_test_configs = []
            for subtest in subtests.split(","):
                sub_test_config = MoETestConfig.from_id(subtest)
                if sub_test_config in test_configs:
                    new_test_configs.append(sub_test_config)
                else:
                    pytest.skip(
                        f"subtest config {subtest} does not match any valid test "
                        "configuration"
                    )
            test_configs = new_test_configs

        if len(test_configs) == 0:
            pytest.skip("No supported configs found for this testpoint.")

        try:
>           parallel_launch_with_config(
                world_size,
                _parallel_worker,
                vllm_config,
                test_env,
                test_configs,
                verbosity,
            )

tests/kernels/moe/test_moe_layer.py:1717:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/kernels/moe/modular_kernel_tools/parallel_utils.py:133: in parallel_launch_with_config
    spawn(
.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:340: in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:296: in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f1904502960>, timeout = None, grace_period = None

    def join(self, timeout: float | None = None, grace_period: float | None = None):
        r"""Join one or more processes within spawn context.

        Attempt to join one or more processes in this spawn context.
        If one of them exited with a non-zero exit status, this function
        kills the remaining processes (optionally with a grace period)
        and raises an exception with the cause of the first process exiting.

        Returns ``True`` if all processes have been joined successfully,
        ``False`` if there are more processes that need to be joined.

        Args:
            timeout (float): Wait this long (in seconds) before giving up on waiting.
            grace_period (float): When any processes fail, wait this long (in seconds)
                for others to shutdown gracefully before terminating them. If they
                still don't exit, wait another grace period before killing them.
        """
        # Ensure this function can be called even when we're done.
        if len(self.sentinels) == 0:
            return True

        # Wait for any process to fail or all of them to succeed.
        ready = multiprocessing.connection.wait(
            self.sentinels.keys(),
            timeout=timeout,
        )

        error_index = None
        for sentinel in ready:
            index = self.sentinels.pop(sentinel)
            process = self.processes[index]
            process.join()
            if process.exitcode != 0:
                error_index = index
                break

        # Return if there was no error.
        if error_index is None:
            # Return whether or not all processes have been joined.
            return len(self.sentinels) == 0
        # An error occurred. Clean-up all processes before returning.
        # First, allow a grace period for processes to shutdown themselves.
        if grace_period is not None:
            self._join_procs_with_timeout(grace_period)
        # Then, terminate processes that are still alive. Try SIGTERM first.
        for process in self.processes:
            if process.is_alive():
                log.warning("Terminating process %s via signal SIGTERM", process.pid)
                process.terminate()

        # Try SIGKILL if the process isn't going down after another grace_period.
        # The reason is related to python signal handling is limited
        # to main thread and if that is in c/c++ land and stuck it won't
        # to handle it. We have seen processes getting stuck not handling
        # SIGTERM for the above reason.
        self._join_procs_with_timeout(30 if grace_period is None else grace_period)
        for process in self.processes:
            if process.is_alive():
                log.warning(
                    "Unable to shutdown process %s via SIGTERM , forcefully exiting via SIGKILL",
                    process.pid,
                )
                process.kill()
            process.join()

        # The file will only be created if the process crashed.
        failed_process = self.processes[error_index]
        if not os.access(self.error_files[error_index], os.R_OK):
            exitcode = self.processes[error_index].exitcode
            if exitcode < 0:
                try:
                    name = signal.Signals(-exitcode).name
                except ValueError:
                    name = f"<Unknown signal {-exitcode}>"
                raise ProcessExitedException(
                    f"process {error_index:d} terminated with signal {name}",
                    error_index=error_index,
                    error_pid=failed_process.pid,
                    exit_code=exitcode,
                    signal_name=name,
                )
            else:
                raise ProcessExitedException(
                    f"process {error_index:d} terminated with exit code {exitcode:d}",
                    error_index=error_index,
                    error_pid=failed_process.pid,
                    exit_code=exitcode,
                )

        with open(self.error_files[error_index], "rb") as fh:
            original_trace = pickle.load(fh)
        msg = f"\n\n-- Process {error_index:d} terminated with the following error:\n"
        msg += original_trace
>       raise ProcessRaisedException(msg, error_index, failed_process.pid)
E       torch.multiprocessing.spawn.ProcessRaisedException:
E
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
E           fn(i, *args)
E         File "/home/sagemoore/git/nm-vllm/tests/kernels/moe/modular_kernel_tools/parallel_utils.py", line 120, in _worker_parallel_launch
E           cleanup_dist_env_and_memory()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1919, in cleanup_dist_env_and_memory
E           destroy_model_parallel()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1883, in destroy_model_parallel
E           _EP.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1063, in destroy
E           self.device_communicator.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/cuda_communicator.py", line 349, in destroy
E           self.all2all_manager.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/all2all.py", line 302, in destroy
E           handle.destroy()
E         File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/deep_ep/buffer.py", line 146, in destroy
E           self.runtime.destroy()
E       RuntimeError: Failed: CUDA error /home/sagemoore/git/vllm_dependency_install/deepep/csrc/deep_ep.cpp:288 'unspecified launch failure'

.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:211: ProcessRaisedException
--------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
INFO 04-10 14:07:35 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-10 14:07:35 [vllm.py:799] Asynchronous scheduling is enabled.
INFO 04-10 14:07:35 [vllm.py:809] Disabling NCCL for DP synchronization when using async scheduling.
INFO 04-10 14:07:35 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
DeepEP timeout check failed: rank = 1, thread = 0, value = 1024)
DeepEP timeout check failed: rank = 1, thread = 1, value = 0)
DeepEP timeout check failed: rank = 0, thread = 0, value = 0)
DeepEP timeout check failed: rank = 0, thread = 1, value = 1024)
................
============= 16 passed of 16 total tests =============
................
============= 16 passed of 16 total tests =============
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
W0410 14:09:32.356000 3348241 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:165] Terminating process 3348580 via signal SIGTERM

RAW_BUFFERClick to expand / collapse

Name of failing test

tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

The tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True] is occasionally failing in CI. Here's the run from the latest nightly (https://buildkite.com/vllm/ci/builds/60760#019d75fb-1530-448f-a743-a9270a55bc04) where the failure occurs. The test group passed when I reran it.

I was able to reproduce the failure locally but only on my first run. It failed the first time I ran the test and has passed every time since. The full output is pasted below. It's unclear if these DeepEP timeouts are caused by a hang or just some first-run slowdown.

Output snippet from my local repro. These same timeouts are in the nightly run's output as well.

DeepEP timeout check failed: rank = 1, thread = 0, value = 1024)
DeepEP timeout check failed: rank = 1, thread = 1, value = 0)
DeepEP timeout check failed: rank = 0, thread = 0, value = 0)
DeepEP timeout check failed: rank = 0, thread = 1, value = 1024)

Full output of pytest tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]

===================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/sagemoore/git/nm-vllm
configfile: pyproject.toml
plugins: anyio-4.13.0, Faker-40.11.1
collected 0 items                                                                                                                                                           

====================================================================================================================== warnings summary =======================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================================================== 2 warnings in 0.00s =====================================================================================================================
ERROR: file or directory not found: kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]

sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
(nm-vllm) sagemoore@nm-automation-h100-standalone-1-preserve ~/g/nm-vllm (main) [4]> g2 pytest tests/kernels/moe/test_moe_layer.py::test_moe_layer[False-deepep_low_latency-2-1-True]
Reserved 2 GPU(s): [3 6] for command execution (timeout: 0h 15m 0s)
===================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.12.11, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/sagemoore/git/nm-vllm
configfile: pyproject.toml
plugins: anyio-4.13.0, Faker-40.11.1
collected 1 item                                                                                                                                                            

tests/kernels/moe/test_moe_layer.py F                                                                                                                                [100%]

================================================================================= FAILURES =================================================================================
____________________________________________________________ test_moe_layer[False-deepep_low_latency-2-1-True] _____________________________________________________________

dp_size = 2, tp_size = 1, use_ep = True, backend = 'deepep_low_latency', enable_eplb = False, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f1904d88d10>
pytestconfig = <_pytest.config.Config object at 0x7f21f33b7230>, subtests = None

    @pytest.mark.parametrize("dp_size, tp_size, use_ep", PARALLEL_COMBOS)
    @pytest.mark.parametrize("backend", BACKENDS)
    @pytest.mark.parametrize("enable_eplb", [False, True])
    def test_moe_layer(
        dp_size: int,
        tp_size: int,
        use_ep: bool,
        backend: str,
        enable_eplb: bool,
        monkeypatch,
        pytestconfig,
        subtests,
    ):
        """Test MoE layer with parallelism (multi-GPU or TP/EP enabled).

        For non-parallel cases (world_size == 1), use test_moe_layer_no_parallel instead.
        """
        num_gpus = current_platform.device_count()
        world_size = tp_size * dp_size
        ep_size = 1 if not use_ep else world_size  # or dp_size?
        assert world_size > 1

        # Check if enough GPUs available
        if world_size is not None and num_gpus is not None and world_size > num_gpus:
            pytest.skip(f"Not enough GPUs got {num_gpus}, expected {world_size}.")

        if enable_eplb and not use_ep:
            pytest.skip("EPLB requires EP.")

        verbosity = pytestconfig.getoption("verbose")

        test_env = dict()
        test_env["VLLM_MOE_DP_CHUNK_SIZE"] = "128"
        monkeypatch.setenv("VLLM_MOE_DP_CHUNK_SIZE", "128")
        if os.environ.get("VLLM_LOGGING_LEVEL") is None:
            monkeypatch.setenv("VLLM_LOGGING_LEVEL", "ERROR")

        # TODO
        # VLLM_FLASHINFER_MOE_BACKEND=latency
        # VLLM_USE_FLASHINFER_MOE_FP16=1
        # VLLM_USE_FLASHINFER_MOE_FP8
        # VLLM_USE_FLASHINFER_MOE_FP4
        # VLLM_USE_FLASHINFER_MOE_INT4

        parallel_config = ParallelConfig(
            pipeline_parallel_size=1,
            data_parallel_size=dp_size,
            tensor_parallel_size=tp_size,
            enable_expert_parallel=use_ep,
            all2all_backend=backend,
            enable_eplb=enable_eplb,
        )

        compilation_config = CompilationConfig()
        # compilation_config.mode = CompilationMode.NONE  # for now
        compilation_config.pass_config.fuse_allreduce_rms = False  # for now

        vllm_config = VllmConfig(
            parallel_config=parallel_config, compilation_config=compilation_config
        )

        test_configs = generate_valid_test_configs(
            backend, ep_size, dp_size, tp_size, enable_eplb, verbosity
        )

        if subtests is not None:
            new_test_configs = []
            for subtest in subtests.split(","):
                sub_test_config = MoETestConfig.from_id(subtest)
                if sub_test_config in test_configs:
                    new_test_configs.append(sub_test_config)
                else:
                    pytest.skip(
                        f"subtest config {subtest} does not match any valid test "
                        "configuration"
                    )
            test_configs = new_test_configs

        if len(test_configs) == 0:
            pytest.skip("No supported configs found for this testpoint.")

        try:
>           parallel_launch_with_config(
                world_size,
                _parallel_worker,
                vllm_config,
                test_env,
                test_configs,
                verbosity,
            )

tests/kernels/moe/test_moe_layer.py:1717:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/kernels/moe/modular_kernel_tools/parallel_utils.py:133: in parallel_launch_with_config
    spawn(
.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:340: in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:296: in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f1904502960>, timeout = None, grace_period = None

    def join(self, timeout: float | None = None, grace_period: float | None = None):
        r"""Join one or more processes within spawn context.

        Attempt to join one or more processes in this spawn context.
        If one of them exited with a non-zero exit status, this function
        kills the remaining processes (optionally with a grace period)
        and raises an exception with the cause of the first process exiting.

        Returns ``True`` if all processes have been joined successfully,
        ``False`` if there are more processes that need to be joined.

        Args:
            timeout (float): Wait this long (in seconds) before giving up on waiting.
            grace_period (float): When any processes fail, wait this long (in seconds)
                for others to shutdown gracefully before terminating them. If they
                still don't exit, wait another grace period before killing them.
        """
        # Ensure this function can be called even when we're done.
        if len(self.sentinels) == 0:
            return True

        # Wait for any process to fail or all of them to succeed.
        ready = multiprocessing.connection.wait(
            self.sentinels.keys(),
            timeout=timeout,
        )

        error_index = None
        for sentinel in ready:
            index = self.sentinels.pop(sentinel)
            process = self.processes[index]
            process.join()
            if process.exitcode != 0:
                error_index = index
                break

        # Return if there was no error.
        if error_index is None:
            # Return whether or not all processes have been joined.
            return len(self.sentinels) == 0
        # An error occurred. Clean-up all processes before returning.
        # First, allow a grace period for processes to shutdown themselves.
        if grace_period is not None:
            self._join_procs_with_timeout(grace_period)
        # Then, terminate processes that are still alive. Try SIGTERM first.
        for process in self.processes:
            if process.is_alive():
                log.warning("Terminating process %s via signal SIGTERM", process.pid)
                process.terminate()

        # Try SIGKILL if the process isn't going down after another grace_period.
        # The reason is related to python signal handling is limited
        # to main thread and if that is in c/c++ land and stuck it won't
        # to handle it. We have seen processes getting stuck not handling
        # SIGTERM for the above reason.
        self._join_procs_with_timeout(30 if grace_period is None else grace_period)
        for process in self.processes:
            if process.is_alive():
                log.warning(
                    "Unable to shutdown process %s via SIGTERM , forcefully exiting via SIGKILL",
                    process.pid,
                )
                process.kill()
            process.join()

        # The file will only be created if the process crashed.
        failed_process = self.processes[error_index]
        if not os.access(self.error_files[error_index], os.R_OK):
            exitcode = self.processes[error_index].exitcode
            if exitcode < 0:
                try:
                    name = signal.Signals(-exitcode).name
                except ValueError:
                    name = f"<Unknown signal {-exitcode}>"
                raise ProcessExitedException(
                    f"process {error_index:d} terminated with signal {name}",
                    error_index=error_index,
                    error_pid=failed_process.pid,
                    exit_code=exitcode,
                    signal_name=name,
                )
            else:
                raise ProcessExitedException(
                    f"process {error_index:d} terminated with exit code {exitcode:d}",
                    error_index=error_index,
                    error_pid=failed_process.pid,
                    exit_code=exitcode,
                )

        with open(self.error_files[error_index], "rb") as fh:
            original_trace = pickle.load(fh)
        msg = f"\n\n-- Process {error_index:d} terminated with the following error:\n"
        msg += original_trace
>       raise ProcessRaisedException(msg, error_index, failed_process.pid)
E       torch.multiprocessing.spawn.ProcessRaisedException:
E
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py", line 87, in _wrap
E           fn(i, *args)
E         File "/home/sagemoore/git/nm-vllm/tests/kernels/moe/modular_kernel_tools/parallel_utils.py", line 120, in _worker_parallel_launch
E           cleanup_dist_env_and_memory()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1919, in cleanup_dist_env_and_memory
E           destroy_model_parallel()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1883, in destroy_model_parallel
E           _EP.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/parallel_state.py", line 1063, in destroy
E           self.device_communicator.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/cuda_communicator.py", line 349, in destroy
E           self.all2all_manager.destroy()
E         File "/home/sagemoore/git/nm-vllm/vllm/distributed/device_communicators/all2all.py", line 302, in destroy
E           handle.destroy()
E         File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/deep_ep/buffer.py", line 146, in destroy
E           self.runtime.destroy()
E       RuntimeError: Failed: CUDA error /home/sagemoore/git/vllm_dependency_install/deepep/csrc/deep_ep.cpp:288 'unspecified launch failure'

.venv/lib64/python3.12/site-packages/torch/multiprocessing/spawn.py:211: ProcessRaisedException
--------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
INFO 04-10 14:07:35 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-10 14:07:35 [vllm.py:799] Asynchronous scheduling is enabled.
INFO 04-10 14:07:35 [vllm.py:809] Disabling NCCL for DP synchronization when using async scheduling.
INFO 04-10 14:07:35 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
DeepEP timeout check failed: rank = 1, thread = 0, value = 1024)
DeepEP timeout check failed: rank = 1, thread = 1, value = 0)
DeepEP timeout check failed: rank = 0, thread = 0, value = 0)
DeepEP timeout check failed: rank = 0, thread = 1, value = 1024)
................
============= 16 passed of 16 total tests =============
................
============= 16 passed of 16 total tests =============
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
--------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
W0410 14:09:32.356000 3348241 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:165] Terminating process 3348580 via signal SIGTERM

📝 History of failing test

Given the flaky nature of the failure, it's tough to tell when it started.

CC List.

No response

extent analysis

TL;DR

The most likely fix for the flaky test failure is to investigate and resolve the "DeepEP timeout check failed" errors, which may be related to resource leaks or synchronization issues.

Guidance

Investigate the "DeepEP timeout check failed" errors and their relationship to the destroy() method not being called before DeepEP buffer destruction, which can leak resources.
Review the synchronization mechanisms in the test, particularly those related to asynchronous scheduling and NCCL for DP synchronization.
Consider adding error handling or retries for the DeepEP timeout checks to improve test reliability.
Verify that the test environment and dependencies, including CUDA and DeepEP versions, are consistent and up-to-date.

Example

No specific code snippet is provided, as the issue is more related to the test environment and synchronization mechanisms. However, reviewing the destroy() method calls and ensuring proper resource cleanup may help resolve the issue.

Notes

The flaky nature of the test failure makes it challenging to identify the root cause. Further investigation and debugging are necessary to determine the exact cause of the "DeepEP timeout check failed" errors.

Recommendation

Apply a workaround by adding error handling or retries for the DeepEP timeout checks to improve test reliability, while continuing to investigate the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [CI Failure]: Kernels FusedMoE Layer Test (2 H100s) is flaky [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: Kernels FusedMoE Layer Test (2 H100s) is flaky [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING