vllm - 💡(How to fix) Fix [Bug]: Gemma-4-31B-IT-NVFP4 (modelopt) causing OOM on single RTX 5090, suspect full BF16 weights during init [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40291Fetched 2026-04-20 11:59:33
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Error Message

uv run vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192
WARNING 04-19 15:32:24 [argparse_utils.py:191] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13. (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1 (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4 (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] (APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'max_model_len': 8192, 'quantization': 'modelopt'} (APIServer pid=6771) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. (APIServer pid=6771) Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/home/christian/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/.no_exist/61521ee452a45ae05ca99b3b19fb44df64d36824/preprocessor_config.json' (APIServer pid=6771) INFO 04-19 15:32:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration (APIServer pid=6771) INFO 04-19 15:32:29 [model.py:1678] Using max model len 8192 (APIServer pid=6771) INFO 04-19 15:32:29 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor. (APIServer pid=6771) INFO 04-19 15:32:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. (APIServer pid=6771) WARNING 04-19 15:32:29 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future. (APIServer pid=6771) INFO 04-19 15:32:29 [vllm.py:790] Asynchronous scheduling is enabled. (APIServer pid=6771) INFO 04-19 15:32:29 [compilation.py:292] Enabled custom fusions: act_quant (EngineCore pid=6849) INFO 04-19 15:32:37 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore pid=6849) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. (EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.177.60:41027 backend=nccl (EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=6849) INFO 04-19 15:32:42 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4... (EngineCore pid=6849) INFO 04-19 15:32:42 [vllm.py:790] Asynchronous scheduling is enabled. (EngineCore pid=6849) INFO 04-19 15:32:42 [compilation.py:292] Enabled custom fusions: act_quant (EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend. (EngineCore pid=6849) INFO 04-19 15:32:42 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM (EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend. (EngineCore pid=6849) ERROR 04-19 15:32:42 [gpu_model_runner.py:4818] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more tips. (original error: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] EngineCore failed to start. (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] Traceback (most recent call last): (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] super().init( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.model_executor = executor_class(vllm_config) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self._init_executor() (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.driver_worker.load_model() (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] raise e (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.model = model_loader.load_model( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] model = initialize_model( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.language_model: Gemma4ForCausalLM = init_vllm_registered_model( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return initialize_model(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.model = Gemma4Model( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] old_init(self, *args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.start_layer, self.end_layer, self.layers = make_layers( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] + get_offloader().wrap_modules( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return list(modules_generator) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr> (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda> (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] lambda prefix: Gemma4DecoderLayer( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.self_attn = Gemma4Attention( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.o_proj = RowParallelLinear( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in init (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] self.quant_method.create_weights( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] data=torch.empty( (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in torch_function (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (EngineCore pid=6849) Process EngineCore: (EngineCore pid=6849) Traceback (most recent call last): (EngineCore pid=6849) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=6849) self.run() (EngineCore pid=6849) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=6849) self._target(*self._args, **self._kwargs) (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core (EngineCore pid=6849) raise e (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=6849) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=6849) super().init( (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in init (EngineCore pid=6849) self.model_executor = executor_class(vllm_config) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore pid=6849) self._init_executor() (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor (EngineCore pid=6849) self.driver_worker.load_model() (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model (EngineCore pid=6849) self.model_runner.load_model(load_dummy_weights=load_dummy_weights) (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model (EngineCore pid=6849) raise e (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model (EngineCore pid=6849) self.model = model_loader.load_model( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model (EngineCore pid=6849) model = initialize_model( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model (EngineCore pid=6849) model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in init (EngineCore pid=6849) self.language_model: Gemma4ForCausalLM = init_vllm_registered_model( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model (EngineCore pid=6849) return initialize_model(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model (EngineCore pid=6849) model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in init (EngineCore pid=6849) self.model = Gemma4Model( (EngineCore pid=6849) ^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in init (EngineCore pid=6849) old_init(self, *args, **kwargs) (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in init (EngineCore pid=6849) self.start_layer, self.end_layer, self.layers = make_layers( (EngineCore pid=6849) ^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers (EngineCore pid=6849) + get_offloader().wrap_modules( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules (EngineCore pid=6849) return list(modules_generator) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr> (EngineCore pid=6849) layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda> (EngineCore pid=6849) lambda prefix: Gemma4DecoderLayer( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in init (EngineCore pid=6849) self.self_attn = Gemma4Attention( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in init (EngineCore pid=6849) self.o_proj = RowParallelLinear( (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in init (EngineCore pid=6849) self.quant_method.create_weights( (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights (EngineCore pid=6849) data=torch.empty( (EngineCore pid=6849) ^^^^^^^^^^^^ (EngineCore pid=6849) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in torch_function (EngineCore pid=6849) return func(*args, **kwargs) (EngineCore pid=6849) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=6849) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]:[W419 15:32:43.387655627 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=6771) Traceback (most recent call last): (APIServer pid=6771) File "/home/christian/vllm/.venv/bin/vllm", line 10, in <module> (APIServer pid=6771) sys.exit(main()) (APIServer pid=6771) ^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=6771) args.dispatch_function(args) (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=6771) uvloop.run(run_server(args)) (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/init.py", line 96, in run (APIServer pid=6771) return __asyncio.run( (APIServer pid=6771) ^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=6771) return runner.run(main) (APIServer pid=6771) ^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=6771) return self._loop.run_until_complete(task) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=6771) return await main (APIServer pid=6771) ^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server (APIServer pid=6771) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker (APIServer pid=6771) async with build_async_engine_client( (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=6771) return await anext(self.gen) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=6771) async with build_async_engine_client_from_engine_args( (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=6771) return await anext(self.gen) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=6771) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=6771) return cls( (APIServer pid=6771) ^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=6771) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=6771) return func(*args, **kwargs) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=6771) return AsyncMPClient(*client_args) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=6771) return func(*args, **kwargs) (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in init (APIServer pid=6771) super().init( (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in init (APIServer pid=6771) with launch_core_engines( (APIServer pid=6771) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=6771) File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=6771) next(self.gen) (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines (APIServer pid=6771) wait_for_engine_startup( (APIServer pid=6771) File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup (APIServer pid=6771) raise RuntimeError( (APIServer pid=6771) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

Full error trace

uv run vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192  
WARNING 04-19 15:32:24 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]   █▄█▀ █     █     █     █  model   nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'max_model_len': 8192, 'quantization': 'modelopt'}
(APIServer pid=6771) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=6771) Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/home/christian/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/.no_exist/61521ee452a45ae05ca99b3b19fb44df64d36824/preprocessor_config.json'
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:1678] Using max model len 8192
(APIServer pid=6771) INFO 04-19 15:32:29 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=6771) INFO 04-19 15:32:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=6771) WARNING 04-19 15:32:29 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=6771) INFO 04-19 15:32:29 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=6771) INFO 04-19 15:32:29 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:37 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6849) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.177.60:41027 backend=nccl
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=6849) INFO 04-19 15:32:42 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4...
(EngineCore pid=6849) INFO 04-19 15:32:42 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6849) INFO 04-19 15:32:42 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) INFO 04-19 15:32:42 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [gpu_model_runner.py:4818] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more tips. (original error: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] EngineCore failed to start.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     super().__init__(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self._init_executor()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     raise e
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = model_loader.load_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = initialize_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = Gemma4Model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     old_init(self, *args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                                     ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     + get_offloader().wrap_modules(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return list(modules_generator)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.self_attn = Gemma4Attention(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.o_proj = RowParallelLinear(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.quant_method.create_weights(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     data=torch.empty(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]          ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore pid=6849) Process EngineCore:
(EngineCore pid=6849) Traceback (most recent call last):
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=6849)     self.run()
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=6849)     self._target(*self._args, **self._kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849)     super().__init__(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849)     self._init_executor()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849)     self.driver_worker.load_model()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849)     self.model = model_loader.load_model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849)     model = initialize_model(
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849)     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849)                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849)     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849)     self.model = Gemma4Model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849)     old_init(self, *args, **kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849)                                                     ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849)     + get_offloader().wrap_modules(
(EngineCore pid=6849)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849)     return list(modules_generator)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849)     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849)     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849)     self.self_attn = Gemma4Attention(
(EngineCore pid=6849)                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849)     self.o_proj = RowParallelLinear(
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849)     self.quant_method.create_weights(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849)     data=torch.empty(
(EngineCore pid=6849)          ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W419 15:32:43.387655627 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=6771) Traceback (most recent call last):
(APIServer pid=6771)   File "/home/christian/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=6771)     sys.exit(main())
(APIServer pid=6771)              ^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=6771)     args.dispatch_function(args)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=6771)     uvloop.run(run_server(args))
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=6771)     return __asyncio.run(
(APIServer pid=6771)            ^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=6771)     return runner.run(main)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=6771)     return self._loop.run_until_complete(task)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=6771)     return await main
(APIServer pid=6771)            ^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server
(APIServer pid=6771)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker
(APIServer pid=6771)     async with build_async_engine_client(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=6771)     async with build_async_engine_client_from_engine_args(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=6771)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=6771)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=6771)     return cls(
(APIServer pid=6771)            ^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=6771)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=6771)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=6771)     return AsyncMPClient(*client_args)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=6771)     super().__init__(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=6771)     with launch_core_engines(
(APIServer pid=6771)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=6771)     next(self.gen)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=6771)     wait_for_engine_startup(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=6771)     raise RuntimeError(
(APIServer pid=6771) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 7800X3D 8-Core Processor CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU(s) scaling MHz: 79% CPU max MHz: 5053,3770 CPU min MHz: 426,1890 BogoMIPS: 8383,74 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 8 MiB (8 instances) L3 cache: 96 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Mitigation; Clear CPU buffers Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Full error trace

uv run vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192  
WARNING 04-19 15:32:24 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]   █▄█▀ █     █     █     █  model   nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'max_model_len': 8192, 'quantization': 'modelopt'}
(APIServer pid=6771) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=6771) Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/home/christian/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/.no_exist/61521ee452a45ae05ca99b3b19fb44df64d36824/preprocessor_config.json'
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:1678] Using max model len 8192
(APIServer pid=6771) INFO 04-19 15:32:29 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=6771) INFO 04-19 15:32:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=6771) WARNING 04-19 15:32:29 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=6771) INFO 04-19 15:32:29 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=6771) INFO 04-19 15:32:29 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:37 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6849) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.177.60:41027 backend=nccl
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=6849) INFO 04-19 15:32:42 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4...
(EngineCore pid=6849) INFO 04-19 15:32:42 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6849) INFO 04-19 15:32:42 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) INFO 04-19 15:32:42 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [gpu_model_runner.py:4818] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more tips. (original error: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] EngineCore failed to start.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     super().__init__(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self._init_executor()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     raise e
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = model_loader.load_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = initialize_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = Gemma4Model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     old_init(self, *args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                                     ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     + get_offloader().wrap_modules(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return list(modules_generator)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.self_attn = Gemma4Attention(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.o_proj = RowParallelLinear(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.quant_method.create_weights(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     data=torch.empty(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]          ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore pid=6849) Process EngineCore:
(EngineCore pid=6849) Traceback (most recent call last):
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=6849)     self.run()
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=6849)     self._target(*self._args, **self._kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849)     super().__init__(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849)     self._init_executor()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849)     self.driver_worker.load_model()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849)     self.model = model_loader.load_model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849)     model = initialize_model(
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849)     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849)                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849)     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849)     self.model = Gemma4Model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849)     old_init(self, *args, **kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849)                                                     ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849)     + get_offloader().wrap_modules(
(EngineCore pid=6849)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849)     return list(modules_generator)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849)     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849)     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849)     self.self_attn = Gemma4Attention(
(EngineCore pid=6849)                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849)     self.o_proj = RowParallelLinear(
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849)     self.quant_method.create_weights(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849)     data=torch.empty(
(EngineCore pid=6849)          ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W419 15:32:43.387655627 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=6771) Traceback (most recent call last):
(APIServer pid=6771)   File "/home/christian/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=6771)     sys.exit(main())
(APIServer pid=6771)              ^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=6771)     args.dispatch_function(args)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=6771)     uvloop.run(run_server(args))
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=6771)     return __asyncio.run(
(APIServer pid=6771)            ^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=6771)     return runner.run(main)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=6771)     return self._loop.run_until_complete(task)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=6771)     return await main
(APIServer pid=6771)            ^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server
(APIServer pid=6771)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker
(APIServer pid=6771)     async with build_async_engine_client(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=6771)     async with build_async_engine_client_from_engine_args(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=6771)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=6771)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=6771)     return cls(
(APIServer pid=6771)            ^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=6771)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=6771)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=6771)     return AsyncMPClient(*client_args)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=6771)     super().__init__(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=6771)     with launch_core_engines(
(APIServer pid=6771)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=6771)     next(self.gen)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=6771)     wait_for_engine_startup(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=6771)     raise RuntimeError(
(APIServer pid=6771) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Code Example

uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 25.10 (x86_64)
GCC version                  : (Ubuntu 15.2.0-4ubuntu4) 15.2.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.42

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Jan 14 2026, 19:35:58) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.17.0-22-generic-x86_64-with-glibc2.42
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               AuthenticAMD
Model name:                              AMD Ryzen 7 7800X3D 8-Core Processor
CPU family:                              25
Model:                                   97
Thread(s) per core:                      2
Core(s) per socket:                      8
Socket(s):                               1
Stepping:                                2
Frequency boost:                         enabled
CPU(s) scaling MHz:                      79%
CPU max MHz:                             5053,3770
CPU min MHz:                             426,1890
BogoMIPS:                                8383,74
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
Virtualization:                          AMD-V
L1d cache:                               256 KiB (8 instances)
L1i cache:                               256 KiB (8 instances)
L2 cache:                                8 MiB (8 instances)
L3 cache:                                96 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_christian

---

vllm run --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192

---

uv run vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192  
WARNING 04-19 15:32:24 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]   █▄█▀ █     █     █     █  model   nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'max_model_len': 8192, 'quantization': 'modelopt'}
(APIServer pid=6771) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=6771) Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/home/christian/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/.no_exist/61521ee452a45ae05ca99b3b19fb44df64d36824/preprocessor_config.json'
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:1678] Using max model len 8192
(APIServer pid=6771) INFO 04-19 15:32:29 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=6771) INFO 04-19 15:32:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=6771) WARNING 04-19 15:32:29 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=6771) INFO 04-19 15:32:29 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=6771) INFO 04-19 15:32:29 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:37 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6849) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.177.60:41027 backend=nccl
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=6849) INFO 04-19 15:32:42 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4...
(EngineCore pid=6849) INFO 04-19 15:32:42 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6849) INFO 04-19 15:32:42 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) INFO 04-19 15:32:42 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [gpu_model_runner.py:4818] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more tips. (original error: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] EngineCore failed to start.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     super().__init__(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self._init_executor()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     raise e
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = model_loader.load_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = initialize_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = Gemma4Model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     old_init(self, *args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                                     ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     + get_offloader().wrap_modules(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return list(modules_generator)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.self_attn = Gemma4Attention(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.o_proj = RowParallelLinear(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.quant_method.create_weights(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     data=torch.empty(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]          ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore pid=6849) Process EngineCore:
(EngineCore pid=6849) Traceback (most recent call last):
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=6849)     self.run()
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=6849)     self._target(*self._args, **self._kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849)     super().__init__(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849)     self._init_executor()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849)     self.driver_worker.load_model()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849)     self.model = model_loader.load_model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849)     model = initialize_model(
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849)     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849)                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849)     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849)     self.model = Gemma4Model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849)     old_init(self, *args, **kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849)                                                     ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849)     + get_offloader().wrap_modules(
(EngineCore pid=6849)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849)     return list(modules_generator)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849)     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849)     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849)     self.self_attn = Gemma4Attention(
(EngineCore pid=6849)                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849)     self.o_proj = RowParallelLinear(
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849)     self.quant_method.create_weights(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849)     data=torch.empty(
(EngineCore pid=6849)          ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W419 15:32:43.387655627 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=6771) Traceback (most recent call last):
(APIServer pid=6771)   File "/home/christian/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=6771)     sys.exit(main())
(APIServer pid=6771)              ^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=6771)     args.dispatch_function(args)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=6771)     uvloop.run(run_server(args))
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=6771)     return __asyncio.run(
(APIServer pid=6771)            ^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=6771)     return runner.run(main)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=6771)     return self._loop.run_until_complete(task)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=6771)     return await main
(APIServer pid=6771)            ^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server
(APIServer pid=6771)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker
(APIServer pid=6771)     async with build_async_engine_client(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=6771)     async with build_async_engine_client_from_engine_args(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=6771)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=6771)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=6771)     return cls(
(APIServer pid=6771)            ^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=6771)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=6771)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=6771)     return AsyncMPClient(*client_args)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=6771)     super().__init__(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=6771)     with launch_core_engines(
(APIServer pid=6771)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=6771)     next(self.gen)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=6771)     wait_for_engine_startup(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=6771)     raise RuntimeError(
(APIServer pid=6771) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 25.10 (x86_64)
GCC version                  : (Ubuntu 15.2.0-4ubuntu4) 15.2.0
Clang version                : Could not collect
CMake version                : version 3.31.6
Libc version                 : glibc-2.42

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Jan 14 2026, 19:35:58) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.17.0-22-generic-x86_64-with-glibc2.42
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               AuthenticAMD
Model name:                              AMD Ryzen 7 7800X3D 8-Core Processor
CPU family:                              25
Model:                                   97
Thread(s) per core:                      2
Core(s) per socket:                      8
Socket(s):                               1
Stepping:                                2
Frequency boost:                         enabled
CPU(s) scaling MHz:                      79%
CPU max MHz:                             5053,3770
CPU min MHz:                             426,1890
BogoMIPS:                                8383,74
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
Virtualization:                          AMD-V
L1d cache:                               256 KiB (8 instances)
L1i cache:                               256 KiB (8 instances)
L2 cache:                                8 MiB (8 instances)
L3 cache:                                96 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_christian
</details>

🐛 Describe the bug

When trying to run

vllm run --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192

I get an OOM error during create_weights() initialization. At the time of crash, PyTorch has already allocated 29,39 GiB, far exceeding what a 31B model at true NVFP4 should require (~15-16 GiB). The weights appear to be allocated at a much higher precision (maybe BF16?, I'm however not experienced enough to say this with certainty) rather than at their quantized size.

Crash location
linear.py:201 create_weights() → torch.empty(...) ← OOM here
← RowParallelLinear.init (linear.py:1436) [o_proj] ← Gemma4Attention.init (gemma4.py:309)
← Gemma4DecoderLayer.init (gemma4.py:472)
← Gemma4Model.init / make_layers (gemma4.py:931)

Key error
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB.
GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free.
Including non-PyTorch memory, this process has 30.33 GiB memory in use.
Of the allocated memory 29.39 GiB is allocated by PyTorch.

Expected behavior
The 31B NVFP4 model should allocate approximately 15-16 GiB of weight tensors during initialization, fitting comfortably within the 5090's 32 GiB VRAM. The RTX 5090 is a Blackwell GPU and natively supports FP4 tensorcores.

Notes

  • vLLM correctly identifies the quantization as modelopt_fp4 and selects vFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM — the OOM happens in weight allocation before any inference
  • tensor_parallel_size=1 (single GPU)
  • Reproduced across v0.19.1 (local install) and vllm/vllm-openai:cu130-nightly (Docker)
  • The model card for nvidia/Gemma-4-31B-IT-NVFP4 references vLLM as the supported runtime and targets Blackwell architecture
  • Using other frequently recommended options like --enforce-eager, --gpu-memory-utilization, --max-model-len did not change the outcome. The specific numbers (pytorch having allocated 29,39 GiB of VRAM) never changed.

Full error trace

uv run vllm serve --model nvidia/Gemma-4-31B-IT-NVFP4 --quantization modelopt --max-model-len 8192  
WARNING 04-19 15:32:24 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]   █▄█▀ █     █     █     █  model   nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:299] 
(APIServer pid=6771) INFO 04-19 15:32:24 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'max_model_len': 8192, 'quantization': 'modelopt'}
(APIServer pid=6771) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=6771) Could not cache non-existence of file. Will ignore error and continue. Error: [Errno 13] Permission denied: '/home/christian/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/.no_exist/61521ee452a45ae05ca99b3b19fb44df64d36824/preprocessor_config.json'
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=6771) INFO 04-19 15:32:29 [model.py:1678] Using max model len 8192
(APIServer pid=6771) INFO 04-19 15:32:29 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=6771) INFO 04-19 15:32:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=6771) WARNING 04-19 15:32:29 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=6771) INFO 04-19 15:32:29 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=6771) INFO 04-19 15:32:29 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:37 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6849) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.177.60:41027 backend=nccl
(EngineCore pid=6849) INFO 04-19 15:32:41 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=6849) INFO 04-19 15:32:42 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4...
(EngineCore pid=6849) INFO 04-19 15:32:42 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6849) INFO 04-19 15:32:42 [compilation.py:292] Enabled custom fusions: act_quant
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) INFO 04-19 15:32:42 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=6849) INFO 04-19 15:32:42 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [gpu_model_runner.py:4818] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more tips. (original error: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] EngineCore failed to start.
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     super().__init__(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self._init_executor()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     raise e
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = model_loader.load_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = initialize_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.model = Gemma4Model(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                  ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     old_init(self, *args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                                                     ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     + get_offloader().wrap_modules(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return list(modules_generator)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.self_attn = Gemma4Attention(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.o_proj = RowParallelLinear(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     self.quant_method.create_weights(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     data=torch.empty(
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]          ^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) ERROR 04-19 15:32:42 [core.py:1108] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore pid=6849) Process EngineCore:
(EngineCore pid=6849) Traceback (most recent call last):
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=6849)     self.run()
(EngineCore pid=6849)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=6849)     self._target(*self._args, **self._kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=6849)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=6849)     super().__init__(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=6849)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=6849)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=6849)     self._init_executor()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=6849)     self.driver_worker.load_model()
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=6849)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4819, in load_model
(EngineCore pid=6849)     raise e
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=6849)     self.model = model_loader.load_model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore pid=6849)     model = initialize_model(
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 947, in __init__
(EngineCore pid=6849)     self.language_model: Gemma4ForCausalLM = init_vllm_registered_model(
(EngineCore pid=6849)                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 379, in init_vllm_registered_model
(EngineCore pid=6849)     return initialize_model(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model
(EngineCore pid=6849)     model = model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore pid=6849)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1439, in __init__
(EngineCore pid=6849)     self.model = Gemma4Model(
(EngineCore pid=6849)                  ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 379, in __init__
(EngineCore pid=6849)     old_init(self, *args, **kwargs)
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 931, in __init__
(EngineCore pid=6849)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore pid=6849)                                                     ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 652, in make_layers
(EngineCore pid=6849)     + get_offloader().wrap_modules(
(EngineCore pid=6849)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules
(EngineCore pid=6849)     return list(modules_generator)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 653, in <genexpr>
(EngineCore pid=6849)     layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)
(EngineCore pid=6849)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 933, in <lambda>
(EngineCore pid=6849)     lambda prefix: Gemma4DecoderLayer(
(EngineCore pid=6849)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 472, in __init__
(EngineCore pid=6849)     self.self_attn = Gemma4Attention(
(EngineCore pid=6849)                      ^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 309, in __init__
(EngineCore pid=6849)     self.o_proj = RowParallelLinear(
(EngineCore pid=6849)                   ^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1436, in __init__
(EngineCore pid=6849)     self.quant_method.create_weights(
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 201, in create_weights
(EngineCore pid=6849)     data=torch.empty(
(EngineCore pid=6849)          ^^^^^^^^^^^^
(EngineCore pid=6849)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
(EngineCore pid=6849)     return func(*args, **kwargs)
(EngineCore pid=6849)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=6849) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 31.35 GiB of which 93.06 MiB is free. Including non-PyTorch memory, this process has 30.33 GiB memory in use. Of the allocated memory 29.39 GiB is allocated by PyTorch, and 309.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W419 15:32:43.387655627 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=6771) Traceback (most recent call last):
(APIServer pid=6771)   File "/home/christian/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=6771)     sys.exit(main())
(APIServer pid=6771)              ^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=6771)     args.dispatch_function(args)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=6771)     uvloop.run(run_server(args))
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=6771)     return __asyncio.run(
(APIServer pid=6771)            ^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=6771)     return runner.run(main)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=6771)     return self._loop.run_until_complete(task)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=6771)     return await main
(APIServer pid=6771)            ^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 672, in run_server
(APIServer pid=6771)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 686, in run_server_worker
(APIServer pid=6771)     async with build_async_engine_client(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=6771)     async with build_async_engine_client_from_engine_args(
(APIServer pid=6771)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=6771)     return await anext(self.gen)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=6771)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=6771)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=6771)     return cls(
(APIServer pid=6771)            ^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=6771)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=6771)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=6771)     return AsyncMPClient(*client_args)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=6771)     return func(*args, **kwargs)
(APIServer pid=6771)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=6771)     super().__init__(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=6771)     with launch_core_engines(
(APIServer pid=6771)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=6771)   File "/home/christian/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=6771)     next(self.gen)
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=6771)     wait_for_engine_startup(
(APIServer pid=6771)   File "/home/christian/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=6771)     raise RuntimeError(
(APIServer pid=6771) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to set the PYTORCH_ALLOC_CONF environment variable to expandable_segments:True to avoid memory fragmentation and allocate memory more efficiently.

Guidance

  1. Memory Fragmentation: The error message suggests that PyTorch has allocated a large amount of memory (29.39 GiB), but there is still not enough free memory to allocate the required 84.00 MiB. This could be due to memory fragmentation, where free memory is broken into small, non-contiguous chunks.
  2. Environment Variable: Setting PYTORCH_ALLOC_CONF=expandable_segments:True may help alleviate this issue by allowing PyTorch to allocate memory more efficiently.
  3. Model Size: The model size is approximately 15-16 GiB, which should fit within the 32 GiB VRAM of the RTX 5090 GPU. However, the actual memory allocation is much higher, suggesting that there may be an issue with the model's memory usage or the way it is being loaded.
  4. Quantization: The model is using modelopt_fp4 quantization, which should reduce memory usage. However, the error occurs during weight allocation, before any inference is performed.

Example

No code example is provided, as the issue is related to memory allocation and environment variables rather than code.

Notes

  • The PYTORCH_ALLOC_CONF environment variable is specific to PyTorch and may not be applicable to other frameworks or libraries.
  • The issue may be specific to the vLLM framework and the Gemma4 model, and may not be reproducible with other models or frameworks.
  • Further investigation may be needed to determine the root cause of the issue and to find a more permanent solution.

Recommendation

Apply the workaround by setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation and allocate memory more efficiently. This may help resolve the issue, but further investigation may be needed to determine the root cause and find a more permanent solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING