vllm - ✅(Solved) Fix [CI Failure]: [Kernels (B200) [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43086Fetched 2026-05-20 03:39:58
View on GitHub
Comments
0
Participants
1
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2added_to_project_v2 ×1closed ×1

Error Message

=========================================================================== FAILURES =========================================================================== ____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________ vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm' @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"]) @pytest.mark.parametrize("eager", EAGER) @pytest.mark.parametrize( "backend", [ "emulation", "flashinfer_cudnn", "flashinfer_trtllm", # the small seq_len ensures trtllm_8x4_layout backend is used "flashinfer_cutlass", ], ) def test_nvfp4(vllm_runner, model, eager, backend): if ( not current_platform.has_device_capability(100) and backend in SM_100_NVFP4_BACKENDS ): pytest.skip( f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False" )

  with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:

tests/models/quantization/test_nvfp4.py:119:


tests/conftest.py:923: in init self.llm = LLM( /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in init self.llm_engine = LLMEngine.from_engine_args( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args return cls( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in init self.engine_core = EngineCoreClient.make_client( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client return SyncMPClient(vllm_config, executor_class, log_stats) /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in init super().init( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in init with launch_core_engines( /usr/lib/python3.12/contextlib.py:144: in exit next(self.gen) /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines wait_for_engine_startup(


handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed> addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None) core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>] parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0) coordinated_dp = False cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True) proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None def wait_for_engine_startup( handshake_socket: zmq.Socket, addresses: EngineZmqAddresses, core_engines: list[CoreEngine], parallel_config: ParallelConfig, coordinated_dp: bool, cache_config: CacheConfig, proc_manager: CoreEngineProcManager | None, coord_process: Process | None, ): # Wait for engine core process(es) to send ready messages. local_count = parallel_config.data_parallel_size_local remote_count = len(core_engines) - local_count # [local, remote] counts conn_pending, start_pending = [local_count, remote_count], [0, 0] poller = zmq.Poller() poller.register(handshake_socket, zmq.POLLIN) remote_should_be_headless = ( not parallel_config.data_parallel_hybrid_lb and not parallel_config.data_parallel_external_lb ) if proc_manager is not None: for sentinel in proc_manager.sentinels(): poller.register(sentinel, zmq.POLLIN) if coord_process is not None: poller.register(coord_process.sentinel, zmq.POLLIN) while any(conn_pending) or any(start_pending): events = poller.poll(STARTUP_POLL_PERIOD_MS) if not events: if any(conn_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) to connect.", *conn_pending, ) if any(start_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) to start.", *start_pending, ) continue if len(events) > 1 or events[0][0] != handshake_socket: # One of the local core processes exited. finished = proc_manager.finished_procs() if proc_manager else {} if coord_process is not None and coord_process.exitcode is not None: finished[coord_process.name] = coord_process.exitcode

          raise RuntimeError(
                "Engine core initialization failed. "
                "See root cause above. "
                f"Failed core proc(s): {finished}"
            )

E RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError ======================================================================= warnings summary ======================================================================= <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no module attribute ../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: torch.jit.script_method is deprecated. Please switch to torch.compile or torch.export. warnings.warn( ../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305 /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable. ref_error: type[Exception] = jsonschema.RefResolutionError, tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child. self.pid = os.fork() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =================================================================== short test summary info ==================================================================== FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} =============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================ sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute 🚨 Error: The command exited with status 1 user command error: exit status 1

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #42857: [Perf] Re-enable flashinfer autotune by default and cleanup

Description (problem / solution / changelog)

Purpose

This PR re-enables flashinfer autotune by default as previous correctness issues are now fixed: https://github.com/flashinfer-ai/flashinfer/pull/3227.

In addition, did some cleanup:

  • Remove _is_fi_autotuning wrapper as not longer needed.
  • Make autotuning done on rank 0 only, and the chosen tactics are broadcasted to other ranks, ensuring all ranks running the same tactics.

Test Plan

  • GSM8k on Deepseek v4 TP, TEP, DEP
  • GPQA on Deepseek v4 TP

Test Result

GSM8k:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

GPQA:

nemo-run_1/0 ----------------------------------------- gpqa ----------------------------------------
nemo-run_1/0 evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1          | 198         | 12762      | 3508        | 88.38%           | 0.00%

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/config/vllm.py (modified, +2/-6)
  • vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_moe.py (modified, +18/-21)
  • vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe.py (modified, +31/-37)
  • vllm/model_executor/warmup/kernel_warmup.py (modified, +61/-15)
  • vllm/utils/flashinfer.py (modified, +0/-1)

Code Example

=========================================================================== FAILURES ===========================================================================
____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm'
    @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"])
    @pytest.mark.parametrize("eager", EAGER)
    @pytest.mark.parametrize(
        "backend",
        [
            "emulation",
            "flashinfer_cudnn",
            "flashinfer_trtllm",  # the small seq_len ensures trtllm_8x4_layout backend is used
            "flashinfer_cutlass",
        ],
    )
    def test_nvfp4(vllm_runner, model, eager, backend):
        if (
            not current_platform.has_device_capability(100)
            and backend in SM_100_NVFP4_BACKENDS
        ):
            pytest.skip(
                f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False"
            )
>       with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:
tests/models/quantization/test_nvfp4.py:119:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:923: in __init__
    self.llm = LLM(
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in __init__
    self.llm_engine = LLMEngine.from_engine_args(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args
    return cls(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in __init__
    self.engine_core = EngineCoreClient.make_client(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in __init__
    super().__init__(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in __init__
    with launch_core_engines(
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines
    wait_for_engine_startup(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed>
addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None)
core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>]
parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0)
coordinated_dp = False
cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True)
proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None
    def wait_for_engine_startup(
        handshake_socket: zmq.Socket,
        addresses: EngineZmqAddresses,
        core_engines: list[CoreEngine],
        parallel_config: ParallelConfig,
        coordinated_dp: bool,
        cache_config: CacheConfig,
        proc_manager: CoreEngineProcManager | None,
        coord_process: Process | None,
    ):
        # Wait for engine core process(es) to send ready messages.
        local_count = parallel_config.data_parallel_size_local
        remote_count = len(core_engines) - local_count
        # [local, remote] counts
        conn_pending, start_pending = [local_count, remote_count], [0, 0]
        poller = zmq.Poller()
        poller.register(handshake_socket, zmq.POLLIN)
        remote_should_be_headless = (
            not parallel_config.data_parallel_hybrid_lb
            and not parallel_config.data_parallel_external_lb
        )
        if proc_manager is not None:
            for sentinel in proc_manager.sentinels():
                poller.register(sentinel, zmq.POLLIN)
        if coord_process is not None:
            poller.register(coord_process.sentinel, zmq.POLLIN)
        while any(conn_pending) or any(start_pending):
            events = poller.poll(STARTUP_POLL_PERIOD_MS)
            if not events:
                if any(conn_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to connect.",
                        *conn_pending,
                    )
                if any(start_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to start.",
                        *start_pending,
                    )
                continue
            if len(events) > 1 or events[0][0] != handshake_socket:
                # One of the local core processes exited.
                finished = proc_manager.finished_procs() if proc_manager else {}
                if coord_process is not None and coord_process.exitcode is not None:
                    finished[coord_process.name] = coord_process.exitcode
>               raise RuntimeError(
                    "Engine core initialization failed. "
                    "See root cause above. "
                    f"Failed core proc(s): {finished}"
                )
E               RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError
======================================================================= warnings summary =======================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== short test summary info ====================================================================
FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
=============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
🚨 Error: The command exited with status 1
user command error: exit status 1
RAW_BUFFERClick to expand / collapse

Name of failing test

`tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

url

=========================================================================== FAILURES ===========================================================================
____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm'
    @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"])
    @pytest.mark.parametrize("eager", EAGER)
    @pytest.mark.parametrize(
        "backend",
        [
            "emulation",
            "flashinfer_cudnn",
            "flashinfer_trtllm",  # the small seq_len ensures trtllm_8x4_layout backend is used
            "flashinfer_cutlass",
        ],
    )
    def test_nvfp4(vllm_runner, model, eager, backend):
        if (
            not current_platform.has_device_capability(100)
            and backend in SM_100_NVFP4_BACKENDS
        ):
            pytest.skip(
                f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False"
            )
>       with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:
tests/models/quantization/test_nvfp4.py:119:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:923: in __init__
    self.llm = LLM(
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in __init__
    self.llm_engine = LLMEngine.from_engine_args(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args
    return cls(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in __init__
    self.engine_core = EngineCoreClient.make_client(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in __init__
    super().__init__(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in __init__
    with launch_core_engines(
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines
    wait_for_engine_startup(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed>
addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None)
core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>]
parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0)
coordinated_dp = False
cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True)
proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None
    def wait_for_engine_startup(
        handshake_socket: zmq.Socket,
        addresses: EngineZmqAddresses,
        core_engines: list[CoreEngine],
        parallel_config: ParallelConfig,
        coordinated_dp: bool,
        cache_config: CacheConfig,
        proc_manager: CoreEngineProcManager | None,
        coord_process: Process | None,
    ):
        # Wait for engine core process(es) to send ready messages.
        local_count = parallel_config.data_parallel_size_local
        remote_count = len(core_engines) - local_count
        # [local, remote] counts
        conn_pending, start_pending = [local_count, remote_count], [0, 0]
        poller = zmq.Poller()
        poller.register(handshake_socket, zmq.POLLIN)
        remote_should_be_headless = (
            not parallel_config.data_parallel_hybrid_lb
            and not parallel_config.data_parallel_external_lb
        )
        if proc_manager is not None:
            for sentinel in proc_manager.sentinels():
                poller.register(sentinel, zmq.POLLIN)
        if coord_process is not None:
            poller.register(coord_process.sentinel, zmq.POLLIN)
        while any(conn_pending) or any(start_pending):
            events = poller.poll(STARTUP_POLL_PERIOD_MS)
            if not events:
                if any(conn_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to connect.",
                        *conn_pending,
                    )
                if any(start_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to start.",
                        *start_pending,
                    )
                continue
            if len(events) > 1 or events[0][0] != handshake_socket:
                # One of the local core processes exited.
                finished = proc_manager.finished_procs() if proc_manager else {}
                if coord_process is not None and coord_process.exitcode is not None:
                    finished[coord_process.name] = coord_process.exitcode
>               raise RuntimeError(
                    "Engine core initialization failed. "
                    "See root cause above. "
                    f"Failed core proc(s): {finished}"
                )
E               RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError
======================================================================= warnings summary =======================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== short test summary info ====================================================================
FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
=============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
🚨 Error: The command exited with status 1
user command error: exit status 1

📝 History of failing test

Bisection:

  • Last passing build: #66633 (May 18 nightly, commit 23c15acd) — enable_flashinfer_autotune: False for O0/O1/O2
  • First failing build: #66759 (May 18 daily, commit cd49a05d) — includes 8c296de6 (PR #42857) which set enable_flashinfer_autotune: True for O1 and O2
  • Default optimization level is O2, so autotuning is now enabled by default for all users

CC List.

@wzhao18 @mgoin

Tagging since it seems related to https://github.com/vllm-project/vllm/pull/42857.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [CI Failure]: [Kernels (B200) [2 pull requests, 1 participants]