vllm - ✅(Solved) Fix [CI Failure]: [Kernels (B200) [2 pull requests, 1 participants]

vllm2026-05-19 10:52:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#43086•Fetched 2026-05-20 03:39:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

elvircrn

Participants

elvircrn

Timeline (top)

mentioned ×2subscribed ×2added_to_project_v2 ×1closed ×1

Error Message

=========================================================================== FAILURES =========================================================================== ____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________ vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm' @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"]) @pytest.mark.parametrize("eager", EAGER) @pytest.mark.parametrize( "backend", [ "emulation", "flashinfer_cudnn", "flashinfer_trtllm", # the small seq_len ensures trtllm_8x4_layout backend is used "flashinfer_cutlass", ], ) def test_nvfp4(vllm_runner, model, eager, backend): if ( not current_platform.has_device_capability(100) and backend in SM_100_NVFP4_BACKENDS ): pytest.skip( f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False" )

  with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:

tests/models/quantization/test_nvfp4.py:119:

tests/conftest.py:923: in init self.llm = LLM( /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in init self.llm_engine = LLMEngine.from_engine_args( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args return cls( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in init self.engine_core = EngineCoreClient.make_client( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client return SyncMPClient(vllm_config, executor_class, log_stats) /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in init super().init( /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in init with launch_core_engines( /usr/lib/python3.12/contextlib.py:144: in exit next(self.gen) /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines wait_for_engine_startup(

handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed> addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None) core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>] parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0) coordinated_dp = False cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True) proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None def wait_for_engine_startup( handshake_socket: zmq.Socket, addresses: EngineZmqAddresses, core_engines: list[CoreEngine], parallel_config: ParallelConfig, coordinated_dp: bool, cache_config: CacheConfig, proc_manager: CoreEngineProcManager | None, coord_process: Process | None, ): # Wait for engine core process(es) to send ready messages. local_count = parallel_config.data_parallel_size_local remote_count = len(core_engines) - local_count # [local, remote] counts conn_pending, start_pending = [local_count, remote_count], [0, 0] poller = zmq.Poller() poller.register(handshake_socket, zmq.POLLIN) remote_should_be_headless = ( not parallel_config.data_parallel_hybrid_lb and not parallel_config.data_parallel_external_lb ) if proc_manager is not None: for sentinel in proc_manager.sentinels(): poller.register(sentinel, zmq.POLLIN) if coord_process is not None: poller.register(coord_process.sentinel, zmq.POLLIN) while any(conn_pending) or any(start_pending): events = poller.poll(STARTUP_POLL_PERIOD_MS) if not events: if any(conn_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) to connect.", *conn_pending, ) if any(start_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) to start.", *start_pending, ) continue if len(events) > 1 or events[0][0] != handshake_socket: # One of the local core processes exited. finished = proc_manager.finished_procs() if proc_manager else {} if coord_process is not None and coord_process.exitcode is not None: finished[coord_process.name] = coord_process.exitcode

          raise RuntimeError(

                "Engine core initialization failed. "

                "See root cause above. "
                f"Failed core proc(s): {finished}"
            )

E RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError ======================================================================= warnings summary ======================================================================= <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no module attribute ../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: torch.jit.script_method is deprecated. Please switch to torch.compile or torch.export. warnings.warn( ../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305 /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable. ref_error: type[Exception] = jsonschema.RefResolutionError, tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4] /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child. self.pid = os.fork() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =================================================================== short test summary info ==================================================================== FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} =============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================ sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute 🚨 Error: The command exited with status 1 user command error: exit status 1

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #42857: [Perf] Re-enable flashinfer autotune by default and cleanup

Repository: vllm-project/vllm
Author: wzhao18
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/42857

Description (problem / solution / changelog)

Purpose

This PR re-enables flashinfer autotune by default as previous correctness issues are now fixed: https://github.com/flashinfer-ai/flashinfer/pull/3227.

In addition, did some cleanup:

Remove _is_fi_autotuning wrapper as not longer needed.
Make autotuning done on rank 0 only, and the chosen tactics are broadcasted to other ranks, ensuring all ranks running the same tactics.

Test Plan

GSM8k on Deepseek v4 TP, TEP, DEP
GPQA on Deepseek v4 TP

Test Result

GSM8k:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

GPQA:

nemo-run_1/0 ----------------------------------------- gpqa ----------------------------------------
nemo-run_1/0 evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1          | 198         | 12762      | 3508        | 88.38%           | 0.00%

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/config/vllm.py (modified, +2/-6)
vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_moe.py (modified, +18/-21)
vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe.py (modified, +31/-37)
vllm/model_executor/warmup/kernel_warmup.py (modified, +61/-15)
vllm/utils/flashinfer.py (modified, +0/-1)

Code Example

=========================================================================== FAILURES ===========================================================================
____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm'
    @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"])
    @pytest.mark.parametrize("eager", EAGER)
    @pytest.mark.parametrize(
        "backend",
        [
            "emulation",
            "flashinfer_cudnn",
            "flashinfer_trtllm",  # the small seq_len ensures trtllm_8x4_layout backend is used
            "flashinfer_cutlass",
        ],
    )
    def test_nvfp4(vllm_runner, model, eager, backend):
        if (
            not current_platform.has_device_capability(100)
            and backend in SM_100_NVFP4_BACKENDS
        ):
            pytest.skip(
                f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False"
            )
>       with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:
tests/models/quantization/test_nvfp4.py:119:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:923: in __init__
    self.llm = LLM(
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in __init__
    self.llm_engine = LLMEngine.from_engine_args(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args
    return cls(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in __init__
    self.engine_core = EngineCoreClient.make_client(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in __init__
    super().__init__(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in __init__
    with launch_core_engines(
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines
    wait_for_engine_startup(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed>
addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None)
core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>]
parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0)
coordinated_dp = False
cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True)
proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None
    def wait_for_engine_startup(
        handshake_socket: zmq.Socket,
        addresses: EngineZmqAddresses,
        core_engines: list[CoreEngine],
        parallel_config: ParallelConfig,
        coordinated_dp: bool,
        cache_config: CacheConfig,
        proc_manager: CoreEngineProcManager | None,
        coord_process: Process | None,
    ):
        # Wait for engine core process(es) to send ready messages.
        local_count = parallel_config.data_parallel_size_local
        remote_count = len(core_engines) - local_count
        # [local, remote] counts
        conn_pending, start_pending = [local_count, remote_count], [0, 0]
        poller = zmq.Poller()
        poller.register(handshake_socket, zmq.POLLIN)
        remote_should_be_headless = (
            not parallel_config.data_parallel_hybrid_lb
            and not parallel_config.data_parallel_external_lb
        )
        if proc_manager is not None:
            for sentinel in proc_manager.sentinels():
                poller.register(sentinel, zmq.POLLIN)
        if coord_process is not None:
            poller.register(coord_process.sentinel, zmq.POLLIN)
        while any(conn_pending) or any(start_pending):
            events = poller.poll(STARTUP_POLL_PERIOD_MS)
            if not events:
                if any(conn_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to connect.",
                        *conn_pending,
                    )
                if any(start_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to start.",
                        *start_pending,
                    )
                continue
            if len(events) > 1 or events[0][0] != handshake_socket:
                # One of the local core processes exited.
                finished = proc_manager.finished_procs() if proc_manager else {}
                if coord_process is not None and coord_process.exitcode is not None:
                    finished[coord_process.name] = coord_process.exitcode
>               raise RuntimeError(
                    "Engine core initialization failed. "
                    "See root cause above. "
                    f"Failed core proc(s): {finished}"
                )
E               RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError
======================================================================= warnings summary =======================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== short test summary info ====================================================================
FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
=============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
🚨 Error: The command exited with status 1
user command error: exit status 1

RAW_BUFFERClick to expand / collapse

Name of failing test

`tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

url

=========================================================================== FAILURES ===========================================================================
____________________________________________ test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] _____________________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>, model = 'nvidia/Llama-3.1-8B-Instruct-NVFP4', eager = True, backend = 'flashinfer_trtllm'
    @pytest.mark.parametrize("model", ["nvidia/Llama-3.1-8B-Instruct-NVFP4"])
    @pytest.mark.parametrize("eager", EAGER)
    @pytest.mark.parametrize(
        "backend",
        [
            "emulation",
            "flashinfer_cudnn",
            "flashinfer_trtllm",  # the small seq_len ensures trtllm_8x4_layout backend is used
            "flashinfer_cutlass",
        ],
    )
    def test_nvfp4(vllm_runner, model, eager, backend):
        if (
            not current_platform.has_device_capability(100)
            and backend in SM_100_NVFP4_BACKENDS
        ):
            pytest.skip(
                f"The backend {backend} is not supported with current_platform.has_device_capability(100) == False"
            )
>       with vllm_runner(model, enforce_eager=eager, linear_backend=backend) as llm:
tests/models/quantization/test_nvfp4.py:119:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/conftest.py:923: in __init__
    self.llm = LLM(
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py:375: in __init__
    self.llm_engine = LLMEngine.from_engine_args(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:170: in from_engine_args
    return cls(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py:104: in __init__
    self.engine_core = EngineCoreClient.make_client(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:723: in __init__
    super().__init__(
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:535: in __init__
    with launch_core_engines(
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1133: in launch_core_engines
    wait_for_engine_startup(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7286bd7f64a0 closed>
addresses = EngineZmqAddresses(inputs=['ipc:///tmp/2339eccd-e5db-4320-a23d-4133acef7791'], outputs=['ipc:///tmp/bd46a73b-e999-4f1c-9be8-cc19a573a463'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None)
core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7287275a3fe0>]
parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, prefill_context_parallel_size=1, data_parallel_size=1..._comm_backend='ag_rs', cp_kv_cache_interleave_size=1, data_parallel_index=0, _api_process_count=1, _api_process_rank=0)
coordinated_dp = False
cache_config = CacheConfig(block_size=16, user_specified_block_size=True, user_specified_mamba_block_size=False, hash_block_size=None...=False, kv_cache_memory_bytes=None, kv_offloading_size=None, kv_offloading_backend='native', _block_size_resolved=True)
proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x728717fa37a0>, coord_process = None
    def wait_for_engine_startup(
        handshake_socket: zmq.Socket,
        addresses: EngineZmqAddresses,
        core_engines: list[CoreEngine],
        parallel_config: ParallelConfig,
        coordinated_dp: bool,
        cache_config: CacheConfig,
        proc_manager: CoreEngineProcManager | None,
        coord_process: Process | None,
    ):
        # Wait for engine core process(es) to send ready messages.
        local_count = parallel_config.data_parallel_size_local
        remote_count = len(core_engines) - local_count
        # [local, remote] counts
        conn_pending, start_pending = [local_count, remote_count], [0, 0]
        poller = zmq.Poller()
        poller.register(handshake_socket, zmq.POLLIN)
        remote_should_be_headless = (
            not parallel_config.data_parallel_hybrid_lb
            and not parallel_config.data_parallel_external_lb
        )
        if proc_manager is not None:
            for sentinel in proc_manager.sentinels():
                poller.register(sentinel, zmq.POLLIN)
        if coord_process is not None:
            poller.register(coord_process.sentinel, zmq.POLLIN)
        while any(conn_pending) or any(start_pending):
            events = poller.poll(STARTUP_POLL_PERIOD_MS)
            if not events:
                if any(conn_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to connect.",
                        *conn_pending,
                    )
                if any(start_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) to start.",
                        *start_pending,
                    )
                continue
            if len(events) > 1 or events[0][0] != handshake_socket:
                # One of the local core processes exited.
                finished = proc_manager.finished_procs() if proc_manager else {}
                if coord_process is not None and coord_process.exitcode is not None:
                    finished[coord_process.name] = coord_process.exitcode
>               raise RuntimeError(
                    "Engine core initialization failed. "
                    "See root cause above. "
                    f"Failed core proc(s): {finished}"
                )
E               RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:1192: RuntimeError
======================================================================= warnings summary =======================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[emulation-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cudnn-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-True-nvidia/Llama-3.1-8B-Instruct-NVFP4]
tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_cutlass-False-nvidia/Llama-3.1-8B-Instruct-NVFP4]
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=7512) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== short test summary info ====================================================================
FAILED tests/models/quantization/test_nvfp4.py::test_nvfp4[flashinfer_trtllm-True-nvidia/Llama-3.1-8B-Instruct-NVFP4] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
=============================================== 1 failed, 7 passed, 3 skipped, 25 warnings in 306.65s (0:05:06) ================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
🚨 Error: The command exited with status 1
user command error: exit status 1

📝 History of failing test

Bisection:

Last passing build: #66633 (May 18 nightly, commit 23c15acd) — enable_flashinfer_autotune: False for O0/O1/O2
First failing build: #66759 (May 18 daily, commit cd49a05d) — includes 8c296de6 (PR #42857) which set enable_flashinfer_autotune: True for O1 and O2
Default optimization level is O2, so autotuning is now enabled by default for all users

CC List.

@wzhao18 @mgoin

Tagging since it seems related to https://github.com/vllm-project/vllm/pull/42857.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [CI Failure]: [Kernels (B200) [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #42857: [Perf] Re-enable flashinfer autotune by default and cleanup

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [CI Failure]: [Kernels (B200) [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #42857: [Perf] Re-enable flashinfer autotune by default and cleanup

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

CC List.

Still need to ship something?

RELATED_DISCOVERY

TRENDING