vllm - 💡(How to fix) Fix [CI Failure]: Spec Decode Draft Model fails during graph capture

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Capturing CUDA graphs (PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.36it/s] Capturing CUDA graphs (FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 16.54it/s] (EngineCore pid=3374923) Process EngineCore: (EngineCore pid=3374923) Traceback (most recent call last): (EngineCore pid=3374923) File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=3374923) self.run() (EngineCore pid=3374923) File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=3374923) self._target(*self._args, **self._kwargs) (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1163, in run_engine_core (EngineCore pid=3374923) raise e (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1133, in run_engine_core (EngineCore pid=3374923) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 899, in init (EngineCore pid=3374923) super().init( (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=3374923) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches (EngineCore pid=3374923) self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config (EngineCore pid=3374923) compilation_times: list[CompilationTimes] = self.collective_rpc( (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc (EngineCore pid=3374923) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 689, in compile_or_warm_up_model (EngineCore pid=3374923) warmup_kernels(self.model_runner, self.execute_model, self.sample_tokens) (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/warmup.py", line 100, in warmup_kernels (EngineCore pid=3374923) worker_execute_model(prefill_output) (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 843, in execute_model (EngineCore pid=3374923) output = self.model_runner.execute_model( (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=3374923) return func(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/model_runner.py", line 1167, in execute_model (EngineCore pid=3374923) model_output = self.model(**model_inputs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=3374923) return self._call_impl(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=3374923) return forward_call(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen3.py", line 323, in forward (EngineCore pid=3374923) hidden_states = self.model( (EngineCore pid=3374923) ^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/decorators.py", line 520, in call (EngineCore pid=3374923) return self.aot_compiled_fn(self, *args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in call (EngineCore pid=3374923) return self.fn(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen2.py", line 389, in forward (EngineCore pid=3374923) def forward( (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/caching.py", line 217, in call (EngineCore pid=3374923) return self.optimized_call(*args, **kwargs) (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "<string>", line 145, in execution_fn (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/cuda_graph.py", line 313, in call (EngineCore pid=3374923) with torch.cuda.graph( (EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^ (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 257, in enter (EngineCore pid=3374923) self.cuda_graph.capture_begin( (EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 115, in capture_begin (EngineCore pid=3374923) super().capture_begin(pool=pool, capture_error_mode=capture_error_mode) (EngineCore pid=3374923) RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

Capturing CUDA graphs (PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.36it/s]
Capturing CUDA graphs (FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 16.54it/s]
(EngineCore pid=3374923) Process EngineCore:
(EngineCore pid=3374923) Traceback (most recent call last):
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=3374923)     self.run()
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=3374923)     self._target(*self._args, **self._kwargs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1163, in run_engine_core
(EngineCore pid=3374923)     raise e
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1133, in run_engine_core
(EngineCore pid=3374923)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=3374923)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 899, in __init__
(EngineCore pid=3374923)     super().__init__(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=3374923)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=3374923)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=3374923)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=3374923)     compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=3374923)                                                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=3374923)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 689, in compile_or_warm_up_model
(EngineCore pid=3374923)     warmup_kernels(self.model_runner, self.execute_model, self.sample_tokens)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/warmup.py", line 100, in warmup_kernels
(EngineCore pid=3374923)     worker_execute_model(prefill_output)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(EngineCore pid=3374923)     output = self.model_runner.execute_model(
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/model_runner.py", line 1167, in execute_model
(EngineCore pid=3374923)     model_output = self.model(**model_inputs)
(EngineCore pid=3374923)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=3374923)     return self._call_impl(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=3374923)     return forward_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(EngineCore pid=3374923)     hidden_states = self.model(
(EngineCore pid=3374923)                     ^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/decorators.py", line 520, in __call__
(EngineCore pid=3374923)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(EngineCore pid=3374923)     return self.fn(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore pid=3374923)     def forward(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/caching.py", line 217, in __call__
(EngineCore pid=3374923)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "<string>", line 145, in execution_fn
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/cuda_graph.py", line 313, in __call__
(EngineCore pid=3374923)     with torch.cuda.graph(
(EngineCore pid=3374923)          ^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 257, in __enter__
(EngineCore pid=3374923)     self.cuda_graph.capture_begin(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 115, in capture_begin
(EngineCore pid=3374923)     super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
(EngineCore pid=3374923) RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/v1/e2e/spec_decode/test_lora_with_spec_decode.py::test_batch_inference_correctness[model_setup0]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Here's the full callstack

Capturing CUDA graphs (PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.36it/s]
Capturing CUDA graphs (FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 16.54it/s]
(EngineCore pid=3374923) Process EngineCore:
(EngineCore pid=3374923) Traceback (most recent call last):
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=3374923)     self.run()
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=3374923)     self._target(*self._args, **self._kwargs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1163, in run_engine_core
(EngineCore pid=3374923)     raise e
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1133, in run_engine_core
(EngineCore pid=3374923)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=3374923)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 899, in __init__
(EngineCore pid=3374923)     super().__init__(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=3374923)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=3374923)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=3374923)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=3374923)     compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=3374923)                                                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=3374923)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 689, in compile_or_warm_up_model
(EngineCore pid=3374923)     warmup_kernels(self.model_runner, self.execute_model, self.sample_tokens)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/warmup.py", line 100, in warmup_kernels
(EngineCore pid=3374923)     worker_execute_model(prefill_output)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(EngineCore pid=3374923)     output = self.model_runner.execute_model(
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/model_runner.py", line 1167, in execute_model
(EngineCore pid=3374923)     model_output = self.model(**model_inputs)
(EngineCore pid=3374923)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=3374923)     return self._call_impl(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=3374923)     return forward_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(EngineCore pid=3374923)     hidden_states = self.model(
(EngineCore pid=3374923)                     ^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/decorators.py", line 520, in __call__
(EngineCore pid=3374923)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(EngineCore pid=3374923)     return self.fn(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore pid=3374923)     def forward(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/caching.py", line 217, in __call__
(EngineCore pid=3374923)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "<string>", line 145, in execution_fn
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/cuda_graph.py", line 313, in __call__
(EngineCore pid=3374923)     with torch.cuda.graph(
(EngineCore pid=3374923)          ^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 257, in __enter__
(EngineCore pid=3374923)     self.cuda_graph.capture_begin(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 115, in capture_begin
(EngineCore pid=3374923)     super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
(EngineCore pid=3374923) RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)

📝 History of failing test

According to the CI dashboard this test has been failing 100% of the time since f887aa1a53.

https://buildkite.com/vllm/ci/builds/66298

CC List.

@MatthewBonanni @LucasWilkinson @benchislett

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: Spec Decode Draft Model fails during graph capture