vllm - ✅(Solved) Fix [Feature]: Migration from Model Runner v1 to Model Runner v2 [6 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41286Fetched 2026-04-30 06:19:03
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Assignees
Timeline (top)
assigned ×1labeled ×1

PR fix notes

PR #39337: [Model Runner v2] Oracle for model runner v2 - dense model by default [1/N]

Description (problem / solution / changelog)

Purpose

Oracle for model runner v2 - dense model by default

Now the env function:

  • Not set: using our oracle
  • set to 1: force v2
  • set to 0: force v1

We are testing "Qwen/Qwen3-0.6B" and "facebook/opt-125m" since they cover the most current v1 unit test.

Should land after https://github.com/vllm-project/vllm/pull/39353

Test

Covered in unit test

Changed files

  • tests/test_config.py (modified, +18/-0)
  • tests/v1/sample/test_logprobs.py (modified, +1/-2)
  • vllm/config/vllm.py (modified, +91/-12)
  • vllm/envs.py (modified, +4/-4)
  • vllm/v1/attention/backends/flashinfer.py (modified, +1/-1)
  • vllm/v1/core/sched/scheduler.py (modified, +1/-2)
  • vllm/v1/worker/gpu_worker.py (modified, +1/-1)

PR #39353: [Model Runner V2] Fix flex attention kv blocks calculation issue

Description (problem / solution / changelog)

Purpose

VLLM_USE_V2_MODEL_RUNNER=1 pytest -s tests/v1/e2e/general/test_async_scheduling.py

(EngineCore pid=2359877) Process EngineCore:
(EngineCore pid=2359877) Traceback (most recent call last):
(EngineCore pid=2359877)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=2359877)     self.run()
(EngineCore pid=2359877)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=2359877)     self._target(*self._args, **self._kwargs)
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1115, in run_engine_core
(EngineCore pid=2359877)     raise e
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1085, in run_engine_core
(EngineCore pid=2359877)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2359877)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2359877)     return func(*args, **kwargs)
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 849, in __init__
(EngineCore pid=2359877)     super().__init__(
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 125, in __init__
(EngineCore pid=2359877)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2359877)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2359877)     return func(*args, **kwargs)
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 281, in _initialize_kv_caches
(EngineCore pid=2359877)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=2359877)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=2359877)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 412, in collective_rpc
(EngineCore pid=2359877)     return future if non_block else future.result()
(EngineCore pid=2359877)                                     ^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 89, in result
(EngineCore pid=2359877)     return super().result()
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2359877)     return self.__get_result()
(EngineCore pid=2359877)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2359877)     raise self._exception
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 93, in _wait_for_response
(EngineCore pid=2359877)     response = self.aggregate(self.get_response())
(EngineCore pid=2359877)                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2359877)   File "/home/yewentao256/vllm-source/vllm/v1/executor/multiproc_executor.py", line 399, in get_response
(EngineCore pid=2359877)     raise RuntimeError(
(EngineCore pid=2359877) RuntimeError: Worker failed with error 'Fail to re-stride a persistent tensor of shape torch.Size([4096, 256]) for a tensor of shape torch.Size([1024, 4096])', please check the stack trace above for the root cause

This is a bug since we should consider batch tokens max_num_batched_tokens instead of max_model_len (only one request)

Test

Rerun unit test and pass now

Changed files

  • vllm/v1/attention/backends/flex_attention.py (modified, +6/-9)

PR #39937: [Model Runner V2] Multiple prompt logprobs support

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

Multiple prompt logprobs support

Test

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/v1/sample/test_logprobs.py -k prompt_logprobs_with_chunking_and_preemption

Originnaly

__________________________________ test_prompt_logprobs_with_chunking_and_preemption ___________________________________

    def test_prompt_logprobs_with_chunking_and_preemption():
        """Test that prompt logprobs are correctly returned when using
        both chunked prefill and preemption.
    
        This test ensures that the num_prompt_logprobs tracking persists
        across preemptions and prefill chunks.
        """
    
        # Create prompts that will trigger chunking and preemption
        prompts = [
            "The following numbers of the sequence "
            + ", ".join(str(i) for i in range(10))
            + " are:",
            "In one word, the capital of France is ",
        ] + [f"Tell me about the number {i}: " for i in range(32)]
    
        sampling_params = SamplingParams(
            temperature=0.0,
            max_tokens=40,
            min_tokens=20,
            prompt_logprobs=2,  # Request prompt logprobs
        )
    
        with VllmRunner(
            "Qwen/Qwen3-0.6B",
            max_model_len=512,
            enable_chunked_prefill=True,
            max_num_batched_tokens=48,  # Force prefill chunking
            num_gpu_blocks_override=32,  # Force preemptions
            disable_log_stats=False,
            gpu_memory_utilization=0.25,
        ) as vllm_model:
            metrics_before = vllm_model.llm.get_metrics()
    
            # Generate with prompt logprobs using generate_w_logprobs which
            # returns (output_ids, output_str, output_logprobs, prompt_logprobs)
            outputs = vllm_model.generate_w_logprobs(
                prompts, sampling_params=sampling_params, include_prompt_token_ids=True
            )
    
            # Verify that all outputs have prompt logprobs
            for i, output in enumerate(outputs):
                _, _, _, prompt_token_ids, prompt_logprobs = output
                assert prompt_logprobs is not None and len(prompt_logprobs) > 0, (
                    f"Output {i} missing prompt logprobs"
                )
                assert len(prompt_logprobs) == len(prompt_token_ids), (
                    "Unexpected number of prompt logprob positions"
                )
    
                # Each position should have the requested number of logprobs
                for pos, logprobs_dict in enumerate(prompt_logprobs):
                    if logprobs_dict is not None:  # First token may be None
>                       assert (
                            sampling_params.prompt_logprobs
                            <= len(logprobs_dict)
                            <= sampling_params.prompt_logprobs + 1
                        ), (
                            f"Output {i} position {pos} has {len(logprobs_dict)} "
                            f"logprobs, expected {sampling_params.prompt_logprobs}"
                        )
E                       AssertionError: Output 0 position 1 has 1 logprobs, expected 2
E                       assert 2 <= 1
E                        +  where 2 = SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, t...mpt_logprobs=2, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None).prompt_logprobs
E                        +  and   1 = len({2701: Logprob(logprob=-10.656400680541992, rank=5307, decoded_token=' following')})

tests/v1/sample/test_logprobs.py:1216: AssertionError

Now

======================= 1 passed, 52 deselected, 17 warnings in 14.63s =======================

CC @WoosukKwon

Changed files

  • vllm/v1/worker/gpu/sample/prompt_logprob.py (modified, +48/-18)

PR #40559: [Model Runner V2] Add logprob_token_ids support

Description (problem / solution / changelog)

Purpose

Part of https://github.com/vllm-project/vllm/pull/39337

https://buildkite.com/vllm/ci/builds/62340#019db17f-d9c4-413f-b8d1-f6368454ce53 fails because of this

Test

VLLM_USE_V2_MODEL_RUNNER=1 pytest -v -s tests/entry points/openai/generative_scoring/test_generative_scoring_e2e.py -k test_basic_score_and_response_structure

Now

===================================== 1 passed, 5 deselected, 16 warnings in 34.78s =====================================

Main

================================================ short test summary info ================================================
FAILED tests/entrypoints/openai/generative_scoring/test_generative_scoring_e2e.py::TestGenerativeScoringAPI::test_basic_score_and_response_structure - AssertionError: Response: {"error":{"message":"Token IDs [9454, 2753] not found in logprobs for item 0. This might i...
===================================== 1 failed, 5 deselected, 16 warnings in 32.02s =====================================

Changed files

  • vllm/sampling_params.py (modified, +25/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +1/-1)
  • vllm/v1/engine/logprobs.py (modified, +1/-1)
  • vllm/v1/worker/gpu/sample/logprob.py (modified, +133/-9)
  • vllm/v1/worker/gpu/sample/sampler.py (modified, +19/-3)

PR #40648: [Model Runner v2] Fix block table IMA issue

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/basic_correctness/test_cumem.py -k "test_end_to_end and opt-125m" -sv

Originally

(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 1205, in _process_engine_step
(EngineCore pid=2694707)     outputs, model_executed = self.step_fn()
(EngineCore pid=2694707)                               ^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/engine/core.py", line 475, in step_with_batch_queue
(EngineCore pid=2694707)     exec_future = self.model_executor.execute_model(
(EngineCore pid=2694707)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/executor/uniproc_executor.py", line 114, in execute_model
(EngineCore pid=2694707)     output.result()
(EngineCore pid=2694707)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2694707)     return self.__get_result()
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2694707)     raise self._exception
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=2694707)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2694707)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/worker_base.py", line 337, in execute_model
(EngineCore pid=2694707)     return self.worker.execute_model(scheduler_output)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu_worker.py", line 814, in execute_model
(EngineCore pid=2694707)     output = self.model_runner.execute_model(
(EngineCore pid=2694707)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2694707)     return func(*args, **kwargs)
(EngineCore pid=2694707)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/model_runner.py", line 1020, in execute_model
(EngineCore pid=2694707)     attn_metadata = self.model_state.prepare_attn(
(EngineCore pid=2694707)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/model_states/default.py", line 176, in prepare_attn
(EngineCore pid=2694707)     attn_metadata = build_attn_metadata(
(EngineCore pid=2694707)                     ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu/attn_utils.py", line 263, in build_attn_metadata
(EngineCore pid=2694707)     metadata = attn_metadata_builder.build(
(EngineCore pid=2694707)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2694707)   File "/home/yewentao256/vllm-source/vllm/v1/attention/backends/flashinfer.py", line 1096, in build
(EngineCore pid=2694707)     paged_kv_indptr_prefill_gpu[0] = 0
(EngineCore pid=2694707)     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore pid=2694707) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore pid=2694707) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=2694707) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=2694707) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=2694707) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=2694707) 
[rank0]:[W422 15:45:41.364245121 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
FAILED

Now

========================= 1 passed, 7 deselected, 17 warnings in 20.25s ==========================

CC @njhill

Changed files

  • vllm/v1/worker/gpu/block_table.py (modified, +21/-12)
  • vllm/v1/worker/gpu/model_runner.py (modified, +3/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +3/-0)
  • vllm/v1/worker/gpu_worker.py (modified, +3/-10)

PR #41285: [Model Runner v2] Fix v2 compile counter num_gpu_runner_capture_triggers and num_cudagraph_captured

Description (problem / solution / changelog)

Purpose

Part of the https://github.com/vllm-project/vllm/pull/39337

VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] -xvs

Originaly

tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] FAILED

====================================================== FAILURES =======================================================
_______________________________________ test_use_cudagraphs[FULL_DECODE_ONLY-1] _______________________________________
vllm_runner = <class 'tests.conftest.VllmRunner'>
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7545e4927a10>
cudagraph_mode = <CUDAGraphMode.FULL_DECODE_ONLY: (2, 0)>
num_cudagraph_captured = 1

    @pytest.mark.forked
    @pytest.mark.parametrize(
        "cudagraph_mode,num_cudagraph_captured",
        [
            (CUDAGraphMode.NONE, 0),
            (CUDAGraphMode.FULL_DECODE_ONLY, 1),
            (CUDAGraphMode.PIECEWISE, 13),
            (CUDAGraphMode.FULL_AND_PIECEWISE, 14),
        ],
    )
    def test_use_cudagraphs(
        vllm_runner, monkeypatch, cudagraph_mode, num_cudagraph_captured
    ):
        # Disable multiprocessing so that the counter is in the same process
        monkeypatch.setenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
    
        compilation_config = {
            "cudagraph_capture_sizes": [100],
            "cudagraph_mode": cudagraph_mode,
        }
        num_gpu_runner_capture_triggers = 1 if cudagraph_mode != CUDAGraphMode.NONE else 0
>       with (
            compilation_counter.expect(
                num_graphs_seen=1,
                num_gpu_runner_capture_triggers=num_gpu_runner_capture_triggers,
                num_cudagraph_captured=num_cudagraph_captured,
            ),
            # loading the model causes compilation (if enabled) to happen
            vllm_runner(
                "facebook/opt-125m",
                compilation_config=compilation_config,
                gpu_memory_utilization=0.4,
            ) as _,
        ):

tests/compile/test_config.py:141: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = CompilationCounter(num_models_seen=1, num_graphs_seen=1, num_piecewise_graphs_seen=25, num_piecewise_capturable_graphs...facts_loaded=0, num_aot_compiles=1, num_aot_artifacts_saved=1, num_aot_artifacts_loaded=0, stock_torch_compile_count=0)
kwargs = {'num_cudagraph_captured': 1, 'num_gpu_runner_capture_triggers': 1, 'num_graphs_seen': 1}
old = CompilationCounter(num_models_seen=0, num_graphs_seen=0, num_piecewise_graphs_seen=0, num_piecewise_capturable_graphs_...facts_loaded=0, num_aot_compiles=0, num_aot_artifacts_saved=0, num_aot_artifacts_loaded=0, stock_torch_compile_count=0)
k = 'num_gpu_runner_capture_triggers', v = 1

    @contextmanager
    def expect(self, **kwargs: Any) -> Generator[None, None, None]:
        old = self.clone()
        yield
        for k, v in kwargs.items():
>           assert getattr(self, k) - getattr(old, k) == v, (
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                f"{k} not as expected, before it is {getattr(old, k)}"
                f", after it is {getattr(self, k)}, "
                f"expected diff is {v}"
            )
E           AssertionError: num_gpu_runner_capture_triggers not as expected, before it is 0, after it is 0, expected diff is 1

vllm/compilation/counter.py:51: AssertionError
================================================== warnings summary ===================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 14 warnings
  /home/yewentao256/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/compile/test_config.py::test_VLLM_DISABLE_COMPILE_CACHE[1]
tests/compile/test_config.py::test_use_cudagraphs[NONE-0]
tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1]
  /home/yewentao256/.venv/lib/python3.12/site-packages/py/_process/forkedfunc.py:45: DeprecationWarning: This process (pid=2502479) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================== short test summary info ===============================================
FAILED tests/compile/test_config.py::test_use_cudagraphs[FULL_DECODE_ONLY-1] - vllm_runner = <class 'tests.conftest.VllmRunner'>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
====================================== 1 failed, 6 passed, 19 warnings in 31.86s ======================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Now

================================== 1 passed, 17 warnings in 6.72s ==================================

Changed files

  • vllm/v1/worker/gpu/cudagraph_utils.py (modified, +2/-0)
  • vllm/v1/worker/gpu/model_runner.py (modified, +3/-0)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

We are going to migrate from model runner v1 to model runner v2 gradually, here is the roadmap:

  • Start with dense model, namely "Qwen/Qwen3-0.6B" and "facebook/opt-125m" as they covered most of the CI tests
  • Then, with moe model, like "deepseek-ai/DeepSeek-V2-lite"
  • Finally, test with popular model, like "deepseek-ai/DeepSeek-V4-Pro"

Tasks:

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The migration to model runner v2 can be achieved by completing the outstanding tasks listed in the roadmap, starting with the dense models "Qwen/Qwen3-0.6B" and "facebook/opt-125m".

Guidance

Notes

The provided information does not mention any specific technical issues or errors, so the guidance is based on the roadmap and tasks listed.

Recommendation

Apply workaround: Complete the outstanding tasks in the roadmap to progress with the migration, as there is no clear indication of a need to upgrade to a fixed version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING