PR #39337: [Model Runner v2] Oracle for model runner v2 - dense model by default [1/N]

Repository: vllm-project/vllm
Author: yewentao256
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39337

Description (problem / solution / changelog)

Purpose

Oracle for model runner v2 - dense model by default

Now the env function:

Not set: using our oracle
set to 1: force v2
set to 0: force v1

We are testing "Qwen/Qwen3-0.6B" and "facebook/opt-125m" since they cover the most current v1 unit test.

Should land after https://github.com/vllm-project/vllm/pull/39353

Test

Covered in unit test

Changed files

tests/test_config.py (modified, +18/-0)
tests/v1/sample/test_logprobs.py (modified, +1/-2)
vllm/config/vllm.py (modified, +91/-12)
vllm/envs.py (modified, +4/-4)
vllm/v1/attention/backends/flashinfer.py (modified, +1/-1)
vllm/v1/core/sched/scheduler.py (modified, +1/-2)
vllm/v1/worker/gpu_worker.py (modified, +1/-1)

PR #39353: [Model Runner V2] Fix flex attention kv blocks calculation issue

Repository: vllm-project/vllm
Author: yewentao256
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/39353

Description (problem / solution / changelog)

Purpose

VLLM_USE_V2_MODEL_RUNNER=1 pytest -s tests/v1/e2e/general/test_async_scheduling.py

(EngineCore pid=2359877) Process EngineCore:
(EngineCore pid=2359877) Traceback (most recent call last):
(EngineCore pid=2359877)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=2359877)     self.run()
(EngineCore pid=2359877)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=2359877)     self._target(*self._args, **self._kwargs)
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1115, in run_engine_core
(EngineCore pid=2359877)     raise e
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1085, in run_engine_core
(EngineCore pid=2359877)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2359877)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2359877)     return func(*args, **kwargs)
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 849, in __init__
(EngineCore pid=2359877)     super().__init__(
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 125, in __init__
(EngineCore pid=2359877)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2359877)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2359877)     return func(*args, **kwargs)
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 281, in _initialize_kv_caches
(EngineCore pid=2359877)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=2359877)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=2359877)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 412, in collective_rpc
(EngineCore pid=2359877)     return future if non_block else future.result()
(EngineCore pid=2359877)                                     ^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 89, in result
(EngineCore pid=2359877)     return super().result()
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2359877)     return self.__get_result()
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2359877)     raise self._exception
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 93, in _wait_for_response
(EngineCore pid=2359877)     response = self.aggregate(self.get_response())
(EngineCore pid=2359877)                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 399, in get_response
(EngineCore pid=2359877)     raise RuntimeError(
(EngineCore pid=2359877) RuntimeError: Worker failed with error 'Fail to re-stride a persistent tensor of shape torch.Size([4096, 256]) for a tensor of shape torch.Size([1024, 4096])', please check the stack trace above for the root cause

This is a bug since we should consider batch tokens max_num_batched_tokens instead of max_model_len (only one request)

Test

Rerun unit test and pass now

Changed files

vllm/v1/attention/backends/flex_attention.py (modified, +6/-9)

PR #39937: [Model Runner V2] Multiple prompt logprobs support

Repository: vllm-project/vllm
Author: yewentao256
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/39937

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

Multiple prompt logprobs support

Test

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/v1/sample/test_logprobs.py -k prompt_logprobs_with_chunking_and_preemption

Originnaly

__________________________________ test_prompt_logprobs_with_chunking_and_preemption ___________________________________

    def test_prompt_logprobs_with_chunking_and_preemption():
        """Test that prompt logprobs are correctly returned when using
        both chunked prefill and preemption.
    
        This test ensures that the num_prompt_logprobs tracking persists
        across preemptions and prefill chunks.
        """
    
        # Create prompts that will trigger chunking and preemption
        prompts = [
            "The following numbers of the sequence "
            + ", ".join(str(i) for i in range(10))
            + " are:",
            "In one word, the capital of France is ",
        ] + [f"Tell me about the number {i}: " for i in range(32)]
    
        sampling_params = SamplingParams(
            temperature=0.0,
            max_tokens=40,
            min_tokens=20,
            prompt_logprobs=2,  # Request prompt logprobs
        )
    
        with VllmRunner(
            "Qwen/Qwen3-0.6B",
            max_model_len=512,
            enable_chunked_prefill=True,
            max_num_batched_tokens=48,  # Force prefill chunking
            num_gpu_blocks_override=32,  # Force preemptions
            disable_log_stats=False,
            gpu_memory_utilization=0.25,
        ) as vllm_model:
            metrics_before = vllm_model.llm.get_metrics()
    
            # Generate with prompt logprobs using generate_w_logprobs which
            # returns (output_ids, output_str, output_logprobs, prompt_logprobs)
            outputs = vllm_model.generate_w_logprobs(
                prompts, sampling_params=sampling_params, include_prompt_token_ids=True
            )
    
            # Verify that all outputs have prompt logprobs
            for i, output in enumerate(outputs):
                _, _, _, prompt_token_ids, prompt_logprobs = output
                assert prompt_logprobs is not None and len(prompt_logprobs) > 0, (
                    f"Output {i} missing prompt logprobs"
                )
                assert len(prompt_logprobs) == len(prompt_token_ids), (
                    "Unexpected number of prompt logprob positions"
                )
    
                # Each position should have the requested number of logprobs
                for pos, logprobs_dict in enumerate(prompt_logprobs):
                    if logprobs_dict is not None:  # First token may be None
>                       assert (
                            sampling_params.prompt_logprobs
                            <= len(logprobs_dict)
                            <= sampling_params.prompt_logprobs + 1
                        ), (
                            f"Output {i} position {pos} has {len(logprobs_dict)} "
                            f"logprobs, expected {sampling_params.prompt_logprobs}"
                        )
E                       AssertionError: Output 0 position 1 has 1 logprobs, expected 2
E                       assert 2 <= 1
E                        +  where 2 = SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, t...mpt_logprobs=2, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None).prompt_logprobs
E                        +  and   1 = len({2701: Logprob(logprob=-10.656400680541992, rank=5307, decoded_token=' following')})

tests/v1/sample/test_logprobs.py:1216: AssertionError

Now

======================= 1 passed, 52 deselected, 17 warnings in 14.63s =======================

CC @WoosukKwon

Changed files

vllm/v1/worker/gpu/sample/prompt_logprob.py (modified, +48/-18)

PR #40559: [Model Runner V2] Add `logprob_token_ids` support

Repository: vllm-project/vllm
Author: yewentao256
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40559

Description (problem / solution / changelog)

Purpose

Part of https://github.com/vllm-project/vllm/pull/39337

https://buildkite.com/vllm/ci/builds/62340#019db17f-d9c4-413f-b8d1-f6368454ce53 fails because of this

Test

VLLM_USE_V2_MODEL_RUNNER=1 pytest -v -s tests/entry points/openai/generative_scoring/test_generative_scoring_e2e.py -k test_basic_score_and_response_structure

Now

===================================== 1 passed, 5 deselected, 16 warnings in 34.78s =====================================

Main

================================================ short test summary info ================================================
FAILED tests/entrypoints/openai/generative_scoring/test_generative_scoring_e2e.py::TestGenerativeScoringAPI::test_basic_score_and_response_structure - AssertionError: Response: {"error":{"message":"Token IDs [9454, 2753] not found in logprobs for item 0. This might i...
===================================== 1 failed, 5 deselected, 16 warnings in 32.02s =====================================

Changed files

vllm/sampling_params.py (modified, +25/-0)
vllm/v1/core/sched/scheduler.py (modified, +1/-1)
vllm/v1/engine/logprobs.py (modified, +1/-1)
vllm/v1/worker/gpu/sample/logprob.py (modified, +133/-9)
vllm/v1/worker/gpu/sample/sampler.py (modified, +19/-3)

PR #40648: [Model Runner v2] Fix block table IMA issue

Repository: vllm-project/vllm
Author: yewentao256
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/40648

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/basic_correctness/test_cumem.py -k "test_end_to_end and opt-125m" -sv

Originally

(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1205, in _process_engine_step
(EngineCore pid=2694707)     outputs, model_executed = self.step_fn()
(EngineCore pid=2694707)                               ^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 475, in step_with_batch_queue
(EngineCore pid=2694707)     exec_future = self.model_executor.execute_model(
(EngineCore pid=2694707)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model
(EngineCore pid=2694707)     output.result()
(EngineCore pid=2694707)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2694707)     return self.__get_result()
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2694707)     raise self._exception
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=2694707)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2694707)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/worker_base.py", line 337, in execute_model
(EngineCore pid=2694707)     return self.worker.execute_model(scheduler_output)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu_worker.py", line 814, in execute_model
(EngineCore pid=2694707)     output = self.model_runner.execute_model(
(EngineCore pid=2694707)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/model_runner.py", line 1020, in execute_model
(EngineCore pid=2694707)     attn_metadata = self.model_state.prepare_attn(
(EngineCore pid=2694707)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/model_states/default.py", line 176, in prepare_attn
(EngineCore pid=2694707)     attn_metadata = build_attn_metadata(
(EngineCore pid=2694707)                     ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/attn_utils.py", line 263, in build_attn_metadata
(EngineCore pid=2694707)     metadata = attn_metadata_builder.build(
(EngineCore pid=2694707)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/attention/backends/flashinfer.py", line 1096, in build
(EngineCore pid=2694707)     paged_kv_indptr_prefill_gpu[0] = 0
(EngineCore pid=2694707)     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore pid=2694707) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore pid=2694707) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=2694707) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=2694707) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=2694707) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=2694707) 
[rank0]:[W422 15:45:41.364245121 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
FAILED

Now

========================= 1 passed, 7 deselected, 17 warnings in 20.25s ==========================

CC @njhill

Changed files

vllm/v1/worker/gpu/block_table.py (modified, +21/-12)
vllm/v1/worker/gpu/model_runner.py (modified, +3/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +3/-0)
vllm/v1/worker/gpu_worker.py (modified, +3/-10)

PR #41285: [Model Runner v2] Fix v2 compile counter `num_gpu_runner_capture_triggers` and `num_cudagraph_captured`

Repository: vllm-project/vllm
Author: yewentao256
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41285

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] -xvs

Originaly

tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] FAILED

====================================================== FAILURES =======================================================
_______________________________________ test_use_cudagraphs[FULL_DECODE_ONLY-1] _______________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7545e4927a10>
cudagraph_mode = <CUDAGraphMode.FULL_DECODE_ONLY: (2, 0)>
num_cudagraph_captured = 1

    @pytest.mark.forked
    @pytest.mark.parametrize(
        "cudagraph_mode,num_cudagraph_captured",
        [
            (CUDAGraphMode.NONE, 0),
            (CUDAGraphMode.FULL_DECODE_ONLY, 1),
            (CUDAGraphMode.PIECEWISE, 13),
            (CUDAGraphMode.FULL_AND_PIECEWISE, 14),
        ],
    )
    def test_use_cudagraphs(
        vllm_runner, monkeypatch, cudagraph_mode, num_cudagraph_captured
    ):
        # Disable multiprocessing so that the counter is in the same process
        monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
    
        compilation_config = {
            "cudagraph_capture_sizes": [100],
            "cudagraph_mode": cudagraph_mode,
        }
        num_gpu_runner_capture_triggers = 1 if cudagraph_mode != CUDAGraphMode.NONE else 0
>       with (
            compilation_counter.expect(
                num_graphs_seen=1,
                num_gpu_runner_capture_triggers=num_gpu_runner_capture_triggers,
                num_cudagraph_captured=num_cudagraph_captured,
            ),
            # loading the model causes compilation (if enabled) to happen
            vllm_runner(
                "facebook/opt-125m",
                compilation_config=compilation_config,
                gpu_memory_utilization=0.4,
            ) as _,
        ):

tests/compile/test_config.py:141: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = CompilationCounter(num_models_seen=1, num_graphs_seen=1, num_piecewise_graphs_seen=25, num_piecewise_capturable_graphs...facts_loaded=0, num_aot_compiles=1, num_aot_artifacts_saved=1, num_aot_artifacts_loaded=0, stock_torch_compile_count=0)
kwargs = {'num_cudagraph_captured': 1, 'num_gpu_runner_capture_triggers': 1, 'num_graphs_seen': 1}
old = CompilationCounter(num_models_seen=0, num_graphs_seen=0, num_piecewise_graphs_seen=0, num_piecewise_capturable_graphs_...facts_loaded=0, num_aot_compiles=0, num_aot_artifacts_saved=0, num_aot_artifacts_loaded=0, stock_torch_compile_count=0)
k = 'num_gpu_runner_capture_triggers', v = 1

    @contextmanager
    def expect(self, **kwargs: Any) -> Generator[None, None, None]:
        old = self.clone()
        yield
        for k, v in kwargs.items():
>           assert getattr(self, k) - getattr(old, k) == v, (
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                f"{k} not as expected, before it is {getattr(old, k)}"
                f", after it is {getattr(self, k)}, "
                f"expected diff is {v}"
            )
E           AssertionError: num_gpu_runner_capture_triggers not as expected, before it is 0, after it is 0, expected diff is 1

vllm/compilation/counter.py:51: AssertionError
================================================== warnings summary ===================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 14 warnings
  /home/yewentao256/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/compile/test_config.py::test_VLLM_DISABLE_COMPILE_CACHE[1]
tests/compile/test_config.py::test_use_cudagraphs[NONE-0]
tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1]
  /home/yewentao256/.venv/lib/python3.12/site-packages/py/_process/forkedfunc.py:45: DeprecationWarning: This process (pid=2502479) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================== short test summary info ===============================================
FAILED tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] - vllm_runner = <class 'tests.conftest.VllmRunner'>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
====================================== 1 failed, 6 passed, 19 warnings in 31.86s ======================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Now

================================== 1 passed, 17 warnings in 6.72s ==================================

Changed files

vllm/v1/worker/gpu/cudagraph_utils.py (modified, +2/-0)
vllm/v1/worker/gpu/model_runner.py (modified, +3/-0)

TL;DR

The migration to model runner v2 can be achieved by completing the outstanding tasks listed in the roadmap, starting with the dense models "Qwen/Qwen3-0.6B" and "facebook/opt-125m".

Guidance

Complete the pending pull requests (https://github.com/vllm-project/vllm/pull/40559 and https://github.com/vllm-project/vllm/pull/41285) to progress with the migration.
Test the migration with the moe model "deepseek-ai/DeepSeek-V2-lite" after completing the dense model tasks.
Verify the migration with popular models like "deepseek-ai/DeepSeek-V4-Pro" once the moe model testing is complete.
Switch to model runner v2 by default after all tasks are completed.

Notes

The provided information does not mention any specific technical issues or errors, so the guidance is based on the roadmap and tasks listed.

Recommendation

Apply workaround: Complete the outstanding tasks in the roadmap to progress with the migration, as there is no clear indication of a need to upgrade to a fixed version.

vllm - ✅(Solved) Fix [Feature]: Migration from Model Runner v1 to Model Runner v2 [6 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #39337: [Model Runner v2] Oracle for model runner v2 - dense model by default [1/N]

Description (problem / solution / changelog)

Purpose

Test

Changed files

PR #39353: [Model Runner V2] Fix flex attention kv blocks calculation issue

Description (problem / solution / changelog)

Purpose

Test

Changed files

PR #39937: [Model Runner V2] Multiple prompt logprobs support

Description (problem / solution / changelog)

Purpose

Test

Changed files

PR #40559: [Model Runner V2] Add logprob_token_ids support

Description (problem / solution / changelog)

Purpose

Test

Changed files

PR #40648: [Model Runner v2] Fix block table IMA issue

Description (problem / solution / changelog)

Purpose

Changed files

PR #41285: [Model Runner v2] Fix v2 compile counter num_gpu_runner_capture_triggers and num_cudagraph_captured

Description (problem / solution / changelog)

Purpose

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #40559: [Model Runner V2] Add `logprob_token_ids` support

PR #41285: [Model Runner v2] Fix v2 compile counter `num_gpu_runner_capture_triggers` and `num_cudagraph_captured`