vllm - 💡(How to fix) Fix [Bug]: NVML_SUCCESS == r INTERNAL ASSERT FAILED and OOM [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39048Fetched 2026-04-08 02:52:49
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Error Message

(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] EngineCore failed to start. (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] Traceback (most recent call last): (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] super().init( (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in init (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] self.model_executor = executor_class(vllm_config) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] self._init_executor() (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] self.driver_worker.load_model() (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] self.model = model_loader.load_model( (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] return func(*args, **kwargs)

Root Cause

LOG OOM

APIServer pid=1) INFO 04-02 17:45:05 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'api_key': ['gmB9YGTnDjGDaqhGvWgFAUZ1kcBvNbZt'], 'model': 'openai/gpt-oss-120b', 'max_model_len': 16384, 'gpu_memory_utilization': 0.5, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'stream_interval': 20, 'max_cudagraph_capture_size': 2048}
(APIServer pid=1) INFO 04-02 17:45:13 [model.py:533] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:05<00:00,  2.61it/s]
(APIServer pid=1) INFO 04-02 17:45:20 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-02 17:45:20 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-02 17:45:20 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=159) INFO 04-02 17:45:28 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.1.2:47807 backend=nccl
[W402 17:45:29.050330764 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [vllm-gpt]:47807 (errno: 97 - Address family not supported by protocol).
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] EngineCore failed to start.
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     super().__init__(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self._init_executor()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.driver_worker.init_device()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.worker.init_device()  # type: ignore
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     raise ValueError(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore pid=159) Process EngineCore:
(EngineCore pid=159) Traceback (most recent call last):
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=159)     self.run()
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=159)     self._target(*self._args, **self._kwargs)
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=159)     raise e
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159)     super().__init__(
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159)     self._init_executor()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159)     self.driver_worker.init_device()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159)     self.worker.init_device()  # type: ignore
(EngineCore pid=159)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159)     raise ValueError(
(EngineCore pid=159) ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
[rank0]:[W402 17:45:30.747444575 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Fix / Workaround

LOG OOM

APIServer pid=1) INFO 04-02 17:45:05 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'api_key': ['gmB9YGTnDjGDaqhGvWgFAUZ1kcBvNbZt'], 'model': 'openai/gpt-oss-120b', 'max_model_len': 16384, 'gpu_memory_utilization': 0.5, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'stream_interval': 20, 'max_cudagraph_capture_size': 2048}
(APIServer pid=1) INFO 04-02 17:45:13 [model.py:533] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:05<00:00,  2.61it/s]
(APIServer pid=1) INFO 04-02 17:45:20 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-02 17:45:20 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-02 17:45:20 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=159) INFO 04-02 17:45:28 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.1.2:47807 backend=nccl
[W402 17:45:29.050330764 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [vllm-gpt]:47807 (errno: 97 - Address family not supported by protocol).
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] EngineCore failed to start.
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     super().__init__(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self._init_executor()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.driver_worker.init_device()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.worker.init_device()  # type: ignore
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     raise ValueError(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore pid=159) Process EngineCore:
(EngineCore pid=159) Traceback (most recent call last):
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=159)     self.run()
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=159)     self._target(*self._args, **self._kwargs)
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=159)     raise e
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159)     super().__init__(
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159)     self._init_executor()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159)     self.driver_worker.init_device()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159)     self.worker.init_device()  # type: ignore
(EngineCore pid=159)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159)     raise ValueError(
(EngineCore pid=159) ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
[rank0]:[W402 17:45:30.747444575 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Code Example



---

# LOGS

---

##  LOG OOM
RAW_BUFFERClick to expand / collapse

Your current environment

GPU: H200 Nvidia Drivers: 580.126.09 Nvidia CUDA : 13.0 OS: Redhat Container Runtime: Podman

🐛 Describe the bug

I don't understand why I have all theses new errors and other OOM when I tried on my H200 GPU to launch multiples models. Theorically i Got this:



```yaml
version: "3.9"
services:
  vllm-gpt:
    image: vllm/vllm-openai:v0.18.1
    container_name: vllm-gpt
    devices:
      - nvidia.com/gpu=all
    ports:
      - "127.0.0.1:8001:8000"
    volumes:
      - /data/huggingface:/root/.cache/huggingface
    ipc: host
    environnement:
      - HF_TOKEN="TOKEN"
    command: >
      --model openai/gpt-oss-120b
      --no-enable-prefix-caching
      --max-cudagraph-capture-size 2048
      --max-num-batched-tokens 8192
      --stream-interval 20
      --max-model-len 16384
      --max-num-seqs 10
      --tensor-parallel-size 1
      --enable-auto-tool-choice
      --tool-call-parser openai
      --gpu-memory-utilization 0.4
      --port 8000
      --api-key YYYYYYYYYYYYYYYYYYYYYYYYYY


  vllm-qwen:
    image: vllm/vllm-openai:v0.18.1
    container_name: vllm-qwen
    devices:
      - nvidia.com/gpu=all
    ports:
      - "127.0.0.1:8002:8000"
    volumes:
      - /data/huggingface:/root/.cache/huggingface
    ipc: host
    command: >
      --model Qwen/Qwen3.5-27B-FP8
      --gpu-memory-utilization 0.4
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --max-model-len 16384
      --kv-cache-dtype fp8
      --tensor-parallel-size 1
      --max-num-batched-tokens 8192
      --max-num-seqs 32
      --port 8000
      --api-key XXXXXXXXXXXXXXXXXXXXXXX
  vllm-qwen-emb:
    image: vllm/vllm-openai:v0.18.0
    container_name: vllm-qwen-emb
    devices:
      - nvidia.com/gpu=all
    ports:
      - "127.0.0.1:8003:8000"
    volumes:
      - /data/huggingface:/root/.cache/huggingface
    ipc: host
    command: >
      --model Qwen/Qwen3-Embedding-0.6B
      --gpu-memory-utilization 0.1
      --runner pooling
      --port 8000
      --api-key ZZZZZZZZZZZZZZZZZZZ

```

# LOGS

```
[root@lrllmhub1 vllm]# cd gptoss/
[root@lrllmhub1 gptoss]# ls
compose.yaml
(failed reverse-i-search)`ci': ^C gptoss/
[root@lrllmhub1 gptoss]# podman compose up -d --force-recreate
>>>> Executing external compose provider "/bin/podman-compose". Please see podman-compose(1) for how to disable this message. <<<<

vllm-gpt
vllm-gpt
5011e15d0903bc5c1088f1c531b75b5bec3a9de5c6b8f3f361670a13516274f2
gptoss_default
46be60dd8a653d9a0729ae79992613a18c20f22ce336602730abb143780b91b7
3c8218bd10b2db942e1c565977a6a23ebe8083eed4719117c0bb83db6eff3cfa
vllm-gpt
(reverse-i-search)`gpt': cd ^Ctoss/
[root@lrllmhub1 gptoss]# ^C
[root@lrllmhub1 gptoss]# podman compose up -d --force-recrea^C
[root@lrllmhub1 gptoss]#
[root@lrllmhub1 gptoss]# podman logs vllm-gpt
WARNING 04-02 19:28:24 [argparse_utils.py:193] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.1
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]   █▄█▀ █     █     █     █  model   openai/gpt-oss-120b
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'api_key': ['gmB9YGTnDjGDaqhGvWgFAUZ1kcBvNbZt'], 'model': 'openai/gpt-oss-120b', 'max_model_len': 16384, 'gpu_memory_utilization': 0.4, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 10, 'stream_interval': 20, 'max_cudagraph_capture_size': 2048}
(APIServer pid=1) INFO 04-02 19:28:32 [model.py:533] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:05<00:00,  2.59it/s]
(APIServer pid=1) INFO 04-02 19:28:38 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-02 19:28:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-02 19:28:39 [vllm.py:775] Asynchronous scheduling is enabled.
(EngineCore pid=161) INFO 04-02 19:28:47 [core.py:103] Initializing a V1 LLM engine (v0.18.1) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=161) INFO 04-02 19:28:47 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.0.2:59763 backend=nccl
[W402 19:28:47.862967742 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [vllm-gpt]:59763 (errno: 97 - Address family not supported by protocol).
(EngineCore pid=161) INFO 04-02 19:28:47 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=161) INFO 04-02 19:28:48 [gpu_model_runner.py:4481] Starting to load model openai/gpt-oss-120b...
(EngineCore pid=161) INFO 04-02 19:28:49 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'TRITON_ATTN'].
(EngineCore pid=161) INFO 04-02 19:28:49 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=161) INFO 04-02 19:28:49 [mxfp4.py:169] Using Triton backend
(EngineCore pid=161) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=161) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:01<00:25,  1.82s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:03<00:19,  1.48s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:04<00:16,  1.42s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:05<00:14,  1.35s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:07<00:15,  1.59s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:09<00:15,  1.67s/it]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:11<00:14,  1.87s/it]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:13<00:13,  1.93s/it]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:16<00:13,  2.22s/it]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:18<00:11,  2.23s/it]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:21<00:09,  2.44s/it]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:24<00:07,  2.45s/it]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:26<00:04,  2.23s/it]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:28<00:02,  2.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:30<00:00,  2.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:30<00:00,  2.03s/it]
(EngineCore pid=161)
(EngineCore pid=161) INFO 04-02 19:29:21 [default_loader.py:384] Loading weights took 30.51 seconds
[root@lrllmhub1 gptoss]# podman logs vllm-gpt
WARNING 04-02 19:28:24 [argparse_utils.py:193] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.1
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]   █▄█▀ █     █     █     █  model   openai/gpt-oss-120b
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:297]
(APIServer pid=1) INFO 04-02 19:28:24 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'api_key': ['gmB9YGTnDjGDaqhGvWgFAUZ1kcBvNbZt'], 'model': 'openai/gpt-oss-120b', 'max_model_len': 16384, 'gpu_memory_utilization': 0.4, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 10, 'stream_interval': 20, 'max_cudagraph_capture_size': 2048}
(APIServer pid=1) INFO 04-02 19:28:32 [model.py:533] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:05<00:00,  2.59it/s]
(APIServer pid=1) INFO 04-02 19:28:38 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-02 19:28:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-02 19:28:39 [vllm.py:775] Asynchronous scheduling is enabled.
(EngineCore pid=161) INFO 04-02 19:28:47 [core.py:103] Initializing a V1 LLM engine (v0.18.1) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=161) INFO 04-02 19:28:47 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.0.2:59763 backend=nccl
[W402 19:28:47.862967742 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [vllm-gpt]:59763 (errno: 97 - Address family not supported by protocol).
(EngineCore pid=161) INFO 04-02 19:28:47 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=161) INFO 04-02 19:28:48 [gpu_model_runner.py:4481] Starting to load model openai/gpt-oss-120b...
(EngineCore pid=161) INFO 04-02 19:28:49 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'TRITON_ATTN'].
(EngineCore pid=161) INFO 04-02 19:28:49 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=161) INFO 04-02 19:28:49 [mxfp4.py:169] Using Triton backend
(EngineCore pid=161) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=161) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:01<00:25,  1.82s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:03<00:19,  1.48s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:04<00:16,  1.42s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:05<00:14,  1.35s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:07<00:15,  1.59s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:09<00:15,  1.67s/it]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:11<00:14,  1.87s/it]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:13<00:13,  1.93s/it]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:16<00:13,  2.22s/it]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:18<00:11,  2.23s/it]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:21<00:09,  2.44s/it]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:24<00:07,  2.45s/it]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:26<00:04,  2.23s/it]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:28<00:02,  2.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:30<00:00,  2.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:30<00:00,  2.03s/it]
(EngineCore pid=161)
(EngineCore pid=161) INFO 04-02 19:29:21 [default_loader.py:384] Loading weights took 30.51 seconds
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] EngineCore failed to start.
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     super().__init__(
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     self._init_executor()
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     self.driver_worker.load_model()
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     self.model = model_loader.load_model(
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     quant_method.process_weights_after_loading(module)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 906, in process_weights_after_loading
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     w13_weight, w13_flex, w13_scale = _swizzle_mxfp4(
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                                       ^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/mxfp4_utils.py", line 83, in _swizzle_mxfp4
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     quant_tensor = convert_layout(
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                    ^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor.py", line 261, in convert_layout
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     new_data = new_layout.swizzle_data(old_data)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 188, in swizzle_data
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     data = _pack_bits(data, self.mx_axis)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 37, in _pack_bits
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     first = _compress_fp4(x[..., 0]) | (_compress_fp4(x[..., 0] >> 4) << 16)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 25, in _compress_fp4
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]     return ((x & 0x8) << 12) | ((x & 0x7) << 6)
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099]                                 ~~~~~~~~~~^~~~
(EngineCore pid=161) ERROR 04-02 19:29:23 [core.py:1099] RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.
(EngineCore pid=161) Process EngineCore:
(EngineCore pid=161) Traceback (most recent call last):
(EngineCore pid=161)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=161)     self.run()
(EngineCore pid=161)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=161)     self._target(*self._args, **self._kwargs)
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=161)     raise e
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=161)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=161)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161)     return func(*args, **kwargs)
(EngineCore pid=161)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=161)     super().__init__(
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=161)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=161)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161)     return func(*args, **kwargs)
(EngineCore pid=161)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=161)     self._init_executor()
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor
(EngineCore pid=161)     self.driver_worker.load_model()
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model
(EngineCore pid=161)     self.model_runner.load_model(load_dummy_weights=dummy_weights)
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161)     return func(*args, **kwargs)
(EngineCore pid=161)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model
(EngineCore pid=161)     self.model = model_loader.load_model(
(EngineCore pid=161)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=161)     return func(*args, **kwargs)
(EngineCore pid=161)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 74, in load_model
(EngineCore pid=161)     process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 106, in process_weights_after_loading
(EngineCore pid=161)     quant_method.process_weights_after_loading(module)
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 906, in process_weights_after_loading
(EngineCore pid=161)     w13_weight, w13_flex, w13_scale = _swizzle_mxfp4(
(EngineCore pid=161)                                       ^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/mxfp4_utils.py", line 83, in _swizzle_mxfp4
(EngineCore pid=161)     quant_tensor = convert_layout(
(EngineCore pid=161)                    ^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor.py", line 261, in convert_layout
(EngineCore pid=161)     new_data = new_layout.swizzle_data(old_data)
(EngineCore pid=161)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 188, in swizzle_data
(EngineCore pid=161)     data = _pack_bits(data, self.mx_axis)
(EngineCore pid=161)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 37, in _pack_bits
(EngineCore pid=161)     first = _compress_fp4(x[..., 0]) | (_compress_fp4(x[..., 0] >> 4) << 16)
(EngineCore pid=161)                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=161)   File "/usr/local/lib/python3.12/dist-packages/vllm/third_party/triton_kernels/tensor_details/layout_details/hopper_value.py", line 25, in _compress_fp4
(EngineCore pid=161)     return ((x & 0x8) << 12) | ((x & 0x7) << 6)
(EngineCore pid=161)                                 ~~~~~~~~~~^~~~
(EngineCore pid=161) RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":1154, please report a bug to PyTorch.

LOG OOM

APIServer pid=1) INFO 04-02 17:45:05 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-120b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'api_key': ['gmB9YGTnDjGDaqhGvWgFAUZ1kcBvNbZt'], 'model': 'openai/gpt-oss-120b', 'max_model_len': 16384, 'gpu_memory_utilization': 0.5, 'enable_prefix_caching': False, 'max_num_batched_tokens': 8192, 'max_num_seqs': 32, 'stream_interval': 20, 'max_cudagraph_capture_size': 2048}
(APIServer pid=1) INFO 04-02 17:45:13 [model.py:533] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:05<00:00,  2.61it/s]
(APIServer pid=1) INFO 04-02 17:45:20 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-02 17:45:20 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-02 17:45:20 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=159) INFO 04-02 17:45:28 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.1.2:47807 backend=nccl
[W402 17:45:29.050330764 socket.cpp:764] [c10d] The client socket cannot be initialized to connect to [vllm-gpt]:47807 (errno: 97 - Address family not supported by protocol).
(EngineCore pid=159) INFO 04-02 17:45:29 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] EngineCore failed to start.
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     super().__init__(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self._init_executor()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.driver_worker.init_device()
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.worker.init_device()  # type: ignore
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099]     raise ValueError(
(EngineCore pid=159) ERROR 04-02 17:45:30 [core.py:1099] ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore pid=159) Process EngineCore:
(EngineCore pid=159) Traceback (most recent call last):
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=159)     self.run()
(EngineCore pid=159)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=159)     self._target(*self._args, **self._kwargs)
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=159)     raise e
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=159)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=159)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=159)     super().__init__(
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=159)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=159)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=159)     self._init_executor()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore pid=159)     self.driver_worker.init_device()
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 312, in init_device
(EngineCore pid=159)     self.worker.init_device()  # type: ignore
(EngineCore pid=159)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=159)     return func(*args, **kwargs)
(EngineCore pid=159)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=159)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=159)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=159)     raise ValueError(
(EngineCore pid=159) ValueError: Free memory on device cuda:0 (67.94/140.55 GiB) on startup is less than desired GPU memory utilization (0.5, 70.27 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
[rank0]:[W402 17:45:30.747444575 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Decrease the gpu_memory_utilization parameter or reduce GPU memory used by other processes to resolve the out-of-memory (OOM) issue.

Guidance

  1. Check GPU memory usage: Verify the available GPU memory and adjust the gpu_memory_utilization parameter accordingly.
  2. Adjust gpu_memory_utilization: Decrease the gpu_memory_utilization parameter to a value that is less than the available GPU memory.
  3. Reduce other process memory usage: If possible, reduce the memory usage of other processes running on the same GPU to free up more memory for the model.
  4. Monitor GPU memory: Use tools like nvidia-smi to monitor GPU memory usage and adjust the gpu_memory_utilization parameter as needed.

Example

For example, you can decrease the gpu_memory_utilization parameter to 0.3:

services:
  vllm-gpt:
    ...
    command: >
      --model openai/gpt-oss-120b
      --no-enable-prefix-caching
      --max-cudagraph-capture-size 2048
      --max-num-batched-tokens 8192
      --stream-interval 20
      --max-model-len 16384
      --max-num-seqs 10
      --tensor-parallel-size 1
      --enable-auto-tool-choice
      --tool-call-parser openai
      --gpu-memory-utilization 0.3  # Decreased from 0.4 to 0.3
      --port 8000
      --api-key YYYYYYYYYYYYYYYYYYYYYYYYYY

Notes

  • The gpu_memory_utilization parameter controls the amount of GPU memory used by the model. Decreasing this value can help resolve OOM issues but may also affect model performance.
  • Make sure to monitor GPU memory usage and adjust the gpu_memory_utilization parameter as needed to achieve a balance between model performance and memory usage.

Recommendation

Apply the workaround by decreasing the gpu_memory_utilization parameter to a value that is less than the available GPU memory. This should help resolve the OOM issue and allow the model to run successfully.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING