vllm - 💡(How to fix) Fix [Bug]: 0.17.1 - vllm serve deepseek-ai/DeepSeek-OCR-2 on H100 crashes during Capturing CUDA graphs (decode, FULL) [9 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37451Fetched 2026-04-08 00:58:35
View on GitHub
Comments
9
Participants
5
Timeline
10
Reactions
0
Author
Timeline (top)
commented ×9labeled ×1

Error Message

(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] █▄█▀ █ █ █ █ model deepseek-ai/DeepSeek-OCR-2 (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302] (APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:238] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']} (APIServer pid=1016) INFO 03-18 15:30:24 [model.py:531] Resolved architecture: DeepseekOCR2ForCausalLM (APIServer pid=1016) INFO 03-18 15:30:24 [model.py:1554] Using max model len 8192 (APIServer pid=1016) INFO 03-18 15:30:24 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=1016) INFO 03-18 15:30:24 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1016) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead (APIServer pid=1016) warnings.warn( (EngineCore_DP0 pid=1315) INFO 03-18 15:30:34 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=1315) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead (EngineCore_DP0 pid=1315) warnings.warn( (EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:53461 backend=nccl (EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [gpu_model_runner.py:4281] Starting to load model deepseek-ai/DeepSeek-OCR-2... (EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [vllm.py:747] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [flash_attn.py:587] Using FlashAttention version 3 (EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [unquantized.py:186] Using TRITON backend for Unquantized MoE (EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.47s/it] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.47s/it] (EngineCore_DP0 pid=1315) (EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [default_loader.py:293] Loading weights took 1.67 seconds (EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [gpu_model_runner.py:4364] Model loading took 6.33 GiB memory and 2.390824 seconds (EngineCore_DP0 pid=1315) INFO 03-18 15:30:42 [gpu_model_runner.py:5280] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size. (EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8ce8d1a75e9ad02dbca24b480c233d7361a1f20f776e837dbc034337035d5ca8/rank_0_0/model (EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/e3f9d7a251/rank_0_0/backbone for vLLM's torch.compile (EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:976] Dynamo bytecode transform time: 1.07 s (EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:350] Cache the graph of compile range (1, 8192) for later use (EngineCore_DP0 pid=1315) WARNING 03-18 15:30:45 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json (EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [backends.py:366] Compiling a graph for compile range (1, 8192) takes 1.89 s (EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [monitor.py:35] torch.compile takes 3.08 s in total (EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [gpu_worker.py:424] Available KV cache memory: 62.69 GiB (EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1314] GPU KV cache size: 1,095,648 tokens (EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 133.75x (EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,007 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... (EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,024 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 21.63it/s] Capturing CUDA graphs (decode, FULL): 0%| | 0/51 [00:00<?, ?it/s](APIServer pid=1016) Traceback (most recent call last): (APIServer pid=1016) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1016) sys.exit(main()) (APIServer pid=1016) ^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=1016) args.dispatch_function(args) (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=1016) uvloop.run(run_server(args)) (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=1016) return __asyncio.run( (APIServer pid=1016) ^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1016) return runner.run(main) (APIServer pid=1016) ^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1016) return self._loop.run_until_complete(task) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=1016) return await main (APIServer pid=1016) ^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=1016) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=1016) async with build_async_engine_client( (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1016) return await anext(self.gen) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=1016) async with build_async_engine_client_from_engine_args( (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1016) return await anext(self.gen) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=1016) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=1016) return cls( (APIServer pid=1016) ^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=1016) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1016) return func(*args, **kwargs) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client (APIServer pid=1016) return AsyncMPClient(*client_args) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=1016) return func(*args, **kwargs) (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in init (APIServer pid=1016) super().init( (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in init (APIServer pid=1016) with launch_core_engines( (APIServer pid=1016) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1016) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=1016) next(self.gen) (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines (APIServer pid=1016) wait_for_engine_startup( (APIServer pid=1016) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup (APIServer pid=1016) raise RuntimeError( (APIServer pid=1016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Root Cause

(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:238] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:531] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:1554] Using max model len 8192
(APIServer pid=1016) INFO 03-18 15:30:24 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1016) INFO 03-18 15:30:24 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1016) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=1016)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:34 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1315) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=1315)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:53461 backend=nccl
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [gpu_model_runner.py:4281] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [flash_attn.py:587] Using FlashAttention version 3
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore_DP0 pid=1315)
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [default_loader.py:293] Loading weights took 1.67 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [gpu_model_runner.py:4364] Model loading took 6.33 GiB memory and 2.390824 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:42 [gpu_model_runner.py:5280] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8ce8d1a75e9ad02dbca24b480c233d7361a1f20f776e837dbc034337035d5ca8/rank_0_0/model
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/e3f9d7a251/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:976] Dynamo bytecode transform time: 1.07 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:350] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=1315) WARNING 03-18 15:30:45 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [backends.py:366] Compiling a graph for compile range (1, 8192) takes 1.89 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [monitor.py:35] torch.compile takes 3.08 s in total
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [gpu_worker.py:424] Available KV cache memory: 62.69 GiB
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1314] GPU KV cache size: 1,095,648 tokens
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 133.75x
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,007 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,024 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 21.63it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                          | 0/51 [00:00<?, ?it/s](APIServer pid=1016) Traceback (most recent call last):
(APIServer pid=1016)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1016)     sys.exit(main())
(APIServer pid=1016)              ^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1016)     args.dispatch_function(args)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1016)     uvloop.run(run_server(args))
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1016)     return __asyncio.run(
(APIServer pid=1016)            ^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1016)     return runner.run(main)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1016)     return self._loop.run_until_complete(task)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1016)     return await main
(APIServer pid=1016)            ^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1016)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1016)     async with build_async_engine_client(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1016)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1016)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1016)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1016)     return cls(
(APIServer pid=1016)            ^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1016)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1016)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1016)     return AsyncMPClient(*client_args)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1016)     super().__init__(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1016)     with launch_core_engines(
(APIServer pid=1016)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1016)     next(self.gen)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1016)     wait_for_engine_startup(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1016)     raise RuntimeError(
(APIServer pid=1016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
</details>

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-111 Off-line CPU(s) list: 112-223 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 1 Core(s) per socket: 56 Socket(s): 2 Stepping: 8 CPU max MHz: 3800.0000 CPU min MHz: 0.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 5.3 MiB (112 instances) L1i cache: 3.5 MiB (112 instances) L2 cache: 224 MiB (112 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55 NUMA node1 CPU(s): 56-111 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:238] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:531] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:1554] Using max model len 8192
(APIServer pid=1016) INFO 03-18 15:30:24 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1016) INFO 03-18 15:30:24 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1016) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=1016)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:34 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1315) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=1315)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:53461 backend=nccl
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [gpu_model_runner.py:4281] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [flash_attn.py:587] Using FlashAttention version 3
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore_DP0 pid=1315)
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [default_loader.py:293] Loading weights took 1.67 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [gpu_model_runner.py:4364] Model loading took 6.33 GiB memory and 2.390824 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:42 [gpu_model_runner.py:5280] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8ce8d1a75e9ad02dbca24b480c233d7361a1f20f776e837dbc034337035d5ca8/rank_0_0/model
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/e3f9d7a251/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:976] Dynamo bytecode transform time: 1.07 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:350] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=1315) WARNING 03-18 15:30:45 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [backends.py:366] Compiling a graph for compile range (1, 8192) takes 1.89 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [monitor.py:35] torch.compile takes 3.08 s in total
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [gpu_worker.py:424] Available KV cache memory: 62.69 GiB
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1314] GPU KV cache size: 1,095,648 tokens
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 133.75x
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,007 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,024 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 21.63it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                          | 0/51 [00:00<?, ?it/s](APIServer pid=1016) Traceback (most recent call last):
(APIServer pid=1016)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1016)     sys.exit(main())
(APIServer pid=1016)              ^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1016)     args.dispatch_function(args)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1016)     uvloop.run(run_server(args))
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1016)     return __asyncio.run(
(APIServer pid=1016)            ^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1016)     return runner.run(main)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1016)     return self._loop.run_until_complete(task)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1016)     return await main
(APIServer pid=1016)            ^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1016)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1016)     async with build_async_engine_client(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1016)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1016)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1016)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1016)     return cls(
(APIServer pid=1016)            ^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1016)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1016)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1016)     return AsyncMPClient(*client_args)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1016)     super().__init__(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1016)     with launch_core_engines(
(APIServer pid=1016)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1016)     next(self.gen)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1016)     wait_for_engine_startup(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1016)     raise RuntimeError(
(APIServer pid=1016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
</details>
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev49+g8b6325758
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:233] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:533] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:1582] Using max model len 8192
(APIServer pid=9) INFO 03-18 15:41:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=9) INFO 03-18 15:41:40 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:50 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev49+g8b6325758) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:58493 backend=nccl
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=483) INFO 03-18 15:41:55 [gpu_model_runner.py:4506] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore pid=483) INFO 03-18 15:41:55 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:55 [cuda.py:333] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=483) INFO 03-18 15:41:55 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=483) INFO 03-18 15:41:55 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore pid=483)
(EngineCore pid=483) INFO 03-18 15:41:58 [default_loader.py:377] Loading weights took 1.67 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:4591] Model loading took 6.33 GiB memory and 2.484376 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:5513] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/5d22c33852/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:1048] Dynamo bytecode transform time: 2.38 s
(EngineCore pid=483) INFO 03-18 15:42:04 [backends.py:371] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=483) INFO 03-18 15:42:08 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 5.73 s
(EngineCore pid=483) INFO 03-18 15:42:09 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/32fa4ed99d9f179225d7ad412f48f3e75ad592038cbe0ebb3c346fe7f9e79266/rank_0_0/model
(EngineCore pid=483) INFO 03-18 15:42:09 [monitor.py:48] torch.compile took 8.62 s in total
(EngineCore pid=483) /usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning:
(EngineCore pid=483) Online softmax is disabled on the fly since Inductor decides to
(EngineCore pid=483) split the reduction. Cut an issue to PyTorch if this is an
(EngineCore pid=483) important use case and you want to speed it up with online
(EngineCore pid=483) softmax.
(EngineCore pid=483)
(EngineCore pid=483)   warnings.warn(
(EngineCore pid=483) WARNING 03-18 15:42:11 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore pid=483) INFO 03-18 15:42:12 [monitor.py:76] Initial profiling/warmup run took 3.32 s
(EngineCore pid=483) INFO 03-18 15:42:18 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=483) INFO 03-18 15:42:18 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(APIServer pid=9) Traceback (most recent call last):
(APIServer pid=9)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=9)     sys.exit(main())
(APIServer pid=9)              ^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=9)     args.dispatch_function(args)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=9)     uvloop.run(run_server(args))
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=9)     return __asyncio.run(
(APIServer pid=9)            ^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=9)     return runner.run(main)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=9)     return self._loop.run_until_complete(task)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=9)     return await main
(APIServer pid=9)            ^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 675, in run_server
(APIServer pid=9)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker
(APIServer pid=9)     async with build_async_engine_client(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client
(APIServer pid=9)     async with build_async_engine_client_from_engine_args(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args
(APIServer pid=9)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=9)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=9)     return cls(
(APIServer pid=9)            ^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=9)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=9)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=9)     return AsyncMPClient(*client_args)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=9)     super().__init__(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=9)     with launch_core_engines(
(APIServer pid=9)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=9)     next(self.gen)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=9)     wait_for_engine_startup(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=9)     raise RuntimeError(
(APIServer pid=9) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
</details>

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 560.35.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             224
On-line CPU(s) list:                0-111
Off-line CPU(s) list:               112-223
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8480+
CPU family:                         6
Model:                              143
Thread(s) per core:                 1
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3800.0000
CPU min MHz:                        0.0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           210 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-55
NUMA node1 CPU(s):                  56-111
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	NIC10	NIC11	NIC12	NIC13	NIC14	NIC15	NIC16	NIC17	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PXB	PXB	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	PXB	PXB	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	NODE	NODE	NODE	NODE	NODE	56-111	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	PXB	NODE	NODE	NODE	NODE	56-111	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PXB	PXB	NODE	NODE	56-111	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PXB	PXB	56-111	1		N/A
NIC0	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC1	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC2	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC3	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC4	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC5	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC6	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC7	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC8	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC9	SYS	SYS	SYS	SYS	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE				
NIC10	SYS	SYS	SYS	SYS	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	NODE	NODE	NODE				
NIC11	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE	NODE	NODE	NODE	NODE				
NIC12	SYS	SYS	SYS	SYS	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	NODE	NODE	NODE	NODE				
NIC13	SYS	SYS	SYS	SYS	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	NODE	NODE	NODE	NODE				
NIC14	SYS	SYS	SYS	SYS	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE	NODE				
NIC15	SYS	SYS	SYS	SYS	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE	NODE				
NIC16	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX				
NIC17	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

vllm serve deepseek-ai/DeepSeek-OCR-2 --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor

---

(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:238] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:531] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:1554] Using max model len 8192
(APIServer pid=1016) INFO 03-18 15:30:24 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1016) INFO 03-18 15:30:24 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1016) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=1016)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:34 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1315) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=1315)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:53461 backend=nccl
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [gpu_model_runner.py:4281] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [flash_attn.py:587] Using FlashAttention version 3
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore_DP0 pid=1315)
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [default_loader.py:293] Loading weights took 1.67 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [gpu_model_runner.py:4364] Model loading took 6.33 GiB memory and 2.390824 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:42 [gpu_model_runner.py:5280] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8ce8d1a75e9ad02dbca24b480c233d7361a1f20f776e837dbc034337035d5ca8/rank_0_0/model
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/e3f9d7a251/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:976] Dynamo bytecode transform time: 1.07 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:350] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=1315) WARNING 03-18 15:30:45 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [backends.py:366] Compiling a graph for compile range (1, 8192) takes 1.89 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [monitor.py:35] torch.compile takes 3.08 s in total
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [gpu_worker.py:424] Available KV cache memory: 62.69 GiB
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1314] GPU KV cache size: 1,095,648 tokens
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 133.75x
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,007 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,024 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 21.63it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                          | 0/51 [00:00<?, ?it/s](APIServer pid=1016) Traceback (most recent call last):
(APIServer pid=1016)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1016)     sys.exit(main())
(APIServer pid=1016)              ^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1016)     args.dispatch_function(args)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1016)     uvloop.run(run_server(args))
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1016)     return __asyncio.run(
(APIServer pid=1016)            ^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1016)     return runner.run(main)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1016)     return self._loop.run_until_complete(task)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1016)     return await main
(APIServer pid=1016)            ^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1016)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1016)     async with build_async_engine_client(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1016)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1016)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1016)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1016)     return cls(
(APIServer pid=1016)            ^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1016)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1016)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1016)     return AsyncMPClient(*client_args)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1016)     super().__init__(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1016)     with launch_core_engines(
(APIServer pid=1016)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1016)     next(self.gen)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1016)     wait_for_engine_startup(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1016)     raise RuntimeError(
(APIServer pid=1016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

---

(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev49+g8b6325758
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:233] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:533] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:1582] Using max model len 8192
(APIServer pid=9) INFO 03-18 15:41:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=9) INFO 03-18 15:41:40 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:50 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev49+g8b6325758) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:58493 backend=nccl
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=483) INFO 03-18 15:41:55 [gpu_model_runner.py:4506] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore pid=483) INFO 03-18 15:41:55 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:55 [cuda.py:333] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=483) INFO 03-18 15:41:55 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=483) INFO 03-18 15:41:55 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore pid=483)
(EngineCore pid=483) INFO 03-18 15:41:58 [default_loader.py:377] Loading weights took 1.67 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:4591] Model loading took 6.33 GiB memory and 2.484376 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:5513] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/5d22c33852/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:1048] Dynamo bytecode transform time: 2.38 s
(EngineCore pid=483) INFO 03-18 15:42:04 [backends.py:371] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=483) INFO 03-18 15:42:08 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 5.73 s
(EngineCore pid=483) INFO 03-18 15:42:09 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/32fa4ed99d9f179225d7ad412f48f3e75ad592038cbe0ebb3c346fe7f9e79266/rank_0_0/model
(EngineCore pid=483) INFO 03-18 15:42:09 [monitor.py:48] torch.compile took 8.62 s in total
(EngineCore pid=483) /usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning:
(EngineCore pid=483) Online softmax is disabled on the fly since Inductor decides to
(EngineCore pid=483) split the reduction. Cut an issue to PyTorch if this is an
(EngineCore pid=483) important use case and you want to speed it up with online
(EngineCore pid=483) softmax.
(EngineCore pid=483)
(EngineCore pid=483)   warnings.warn(
(EngineCore pid=483) WARNING 03-18 15:42:11 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore pid=483) INFO 03-18 15:42:12 [monitor.py:76] Initial profiling/warmup run took 3.32 s
(EngineCore pid=483) INFO 03-18 15:42:18 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=483) INFO 03-18 15:42:18 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(APIServer pid=9) Traceback (most recent call last):
(APIServer pid=9)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=9)     sys.exit(main())
(APIServer pid=9)              ^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=9)     args.dispatch_function(args)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=9)     uvloop.run(run_server(args))
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=9)     return __asyncio.run(
(APIServer pid=9)            ^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=9)     return runner.run(main)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=9)     return self._loop.run_until_complete(task)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=9)     return await main
(APIServer pid=9)            ^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 675, in run_server
(APIServer pid=9)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker
(APIServer pid=9)     async with build_async_engine_client(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client
(APIServer pid=9)     async with build_async_engine_client_from_engine_args(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args
(APIServer pid=9)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=9)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=9)     return cls(
(APIServer pid=9)            ^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=9)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=9)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=9)     return AsyncMPClient(*client_args)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=9)     super().__init__(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=9)     with launch_core_engines(
(APIServer pid=9)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=9)     next(self.gen)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=9)     wait_for_engine_startup(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=9)     raise RuntimeError(
(APIServer pid=9) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

---

(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]        █     █     █▄   ▄█
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.16.0
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:223] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'api_server_count': 1, 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor'], 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': False, 'mm_processor_cache_gb': 0.0}
(APIServer pid=5323) WARNING 03-18 08:09:30 [system_utils.py:271] Found ulimit of 65356 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like `OSError: [Errno 24] Too many open files`. Consider increasing with ulimit -n
(APIServer pid=5323) INFO 03-18 08:09:37 [model.py:529] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=5323) INFO 03-18 08:09:37 [model.py:1549] Using max model len 8192
(APIServer pid=5323) INFO 03-18 08:09:37 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=5323) INFO 03-18 08:09:37 [vllm.py:689] Asynchronous scheduling is enabled.
(APIServer pid=5323) /home/ray/anaconda3/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=5323)   warnings.warn(
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:49 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=5925) /home/ray/anaconda3/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=5925)   warnings.warn(
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:50 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.5.191:49763 backend=nccl
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:51 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [gpu_model_runner.py:4124] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [vllm.py:689] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [unquantized.py:131] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=5925) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=5925) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.43s/it]
(EngineCore_DP0 pid=5925)
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:56 [default_loader.py:293] Loading weights took 1.63 seconds
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:57 [gpu_model_runner.py:4221] Model loading took 6.33 GiB memory and 2.357183 seconds
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:57 [gpu_model_runner.py:5140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:01 [backends.py:916] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/758d034a64/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:01 [backends.py:976] Dynamo bytecode transform time: 2.25 s
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:03 [backends.py:351] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=5925) WARNING 03-18 08:10:07 [fused_moe.py:1087] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:09 [backends.py:368] Compiling a graph for compile range (1, 8192) takes 7.08 s
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:09 [monitor.py:34] torch.compile takes 9.33 s in total
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [gpu_worker.py:373] Available KV cache memory: 43.91 GiB
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [kv_cache_utils.py:1307] GPU KV cache size: 767,392 tokens
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [kv_cache_utils.py:1312] Maximum concurrency for 8,192 tokens per request: 93.68x
(EngineCore_DP0 pid=5925) 2026-03-18 08:10:10,101 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=5925) 2026-03-18 08:10:10,120 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████| 51/51 [00:03<00:00, 13.06it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:01<00:00, 31.86it/s]
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:16 [gpu_model_runner.py:5246] Graph capturing finished in 6 secs, took 0.24 GiB
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:16 [core.py:278] init engine (profile, create kv cache, warmup model) took 19.02 seconds
(APIServer pid=5323) INFO 03-18 08:10:16 [api_server.py:481] Supported tasks: ['generate']
(APIServer pid=5323) INFO 03-18 08:10:17 [serving.py:188] Warming up chat template processing...
(APIServer pid=5323) INFO 03-18 08:10:18 [hf.py:138] Loading chat template fallback for deepseek-ai/DeepSeek-OCR-2 as there isn't one defined on HF Hub.
(APIServer pid=5323) INFO 03-18 08:10:18 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=5323) INFO 03-18 08:10:18 [serving.py:213] Chat template warmup completed in 1428.3ms
(APIServer pid=5323) INFO 03-18 08:10:18 [api_server.py:486] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:38] Available routes are:
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=5323) INFO:     Started server process [5323]
(APIServer pid=5323) INFO:     Waiting for application startup.
(APIServer pid=5323) INFO:     Application startup complete.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 560.35.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             224
On-line CPU(s) list:                0-111
Off-line CPU(s) list:               112-223
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8480+
CPU family:                         6
Model:                              143
Thread(s) per core:                 1
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3800.0000
CPU min MHz:                        0.0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           210 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-55
NUMA node1 CPU(s):                  56-111
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu129
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	NIC10	NIC11	NIC12	NIC13	NIC14	NIC15	NIC16	NIC17	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PXB	PXB	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	PXB	PXB	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	0-55	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	NODE	NODE	NODE	NODE	NODE	56-111	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	PXB	NODE	NODE	NODE	NODE	56-111	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PXB	PXB	NODE	NODE	56-111	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PXB	PXB	56-111	1		N/A
NIC0	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC1	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC2	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC3	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC4	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC5	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC6	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC7	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC8	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS				
NIC9	SYS	SYS	SYS	SYS	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	NODE	NODE	NODE	NODE	NODE				
NIC10	SYS	SYS	SYS	SYS	PXB	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	NODE	NODE	NODE	NODE	NODE				
NIC11	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE	NODE	NODE	NODE	NODE				
NIC12	SYS	SYS	SYS	SYS	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	NODE	NODE	NODE	NODE				
NIC13	SYS	SYS	SYS	SYS	NODE	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	NODE	NODE	NODE	NODE				
NIC14	SYS	SYS	SYS	SYS	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	 X 	PIX	NODE	NODE				
NIC15	SYS	SYS	SYS	SYS	NODE	NODE	PXB	NODE	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	PIX	 X 	NODE	NODE				
NIC16	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	 X 	PIX				
NIC17	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	NODE	NODE	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.9 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571
TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.9.1
VLLM_ENABLE_CUDA_COMPATIBILITY=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

I'm trying to experiment with deepseek-ai/DeepSeek-OCR-2 on 0.17.1.

Running this from the vllm/vllm-openai:v0.17.1 docker image results in an error during Capturing CUDA graphs (decode, FULL)::

vllm serve deepseek-ai/DeepSeek-OCR-2 --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor
<details> <summary>0.17.1 output</summary>
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:302]
(APIServer pid=1016) INFO 03-18 15:30:23 [utils.py:238] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:531] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=1016) INFO 03-18 15:30:24 [model.py:1554] Using max model len 8192
(APIServer pid=1016) INFO 03-18 15:30:24 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1016) INFO 03-18 15:30:24 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1016) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=1016)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:34 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1315) /usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=1315)   warnings.warn(
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:53461 backend=nccl
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:38 [gpu_model_runner.py:4281] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [flash_attn.py:587] Using FlashAttention version 3
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:39 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1315) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore_DP0 pid=1315)
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [default_loader.py:293] Loading weights took 1.67 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:41 [gpu_model_runner.py:4364] Model loading took 6.33 GiB memory and 2.390824 seconds
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:42 [gpu_model_runner.py:5280] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [decorators.py:465] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/8ce8d1a75e9ad02dbca24b480c233d7361a1f20f776e837dbc034337035d5ca8/rank_0_0/model
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/e3f9d7a251/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:976] Dynamo bytecode transform time: 1.07 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:44 [backends.py:350] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=1315) WARNING 03-18 15:30:45 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [backends.py:366] Compiling a graph for compile range (1, 8192) takes 1.89 s
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [monitor.py:35] torch.compile takes 3.08 s in total
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [gpu_worker.py:424] Available KV cache memory: 62.69 GiB
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1314] GPU KV cache size: 1,095,648 tokens
(EngineCore_DP0 pid=1315) INFO 03-18 15:30:46 [kv_cache_utils.py:1319] Maximum concurrency for 8,192 tokens per request: 133.75x
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,007 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1315) 2026-03-18 15:30:47,024 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:02<00:00, 21.63it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                          | 0/51 [00:00<?, ?it/s](APIServer pid=1016) Traceback (most recent call last):
(APIServer pid=1016)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1016)     sys.exit(main())
(APIServer pid=1016)              ^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1016)     args.dispatch_function(args)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1016)     uvloop.run(run_server(args))
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1016)     return __asyncio.run(
(APIServer pid=1016)            ^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1016)     return runner.run(main)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1016)     return self._loop.run_until_complete(task)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1016)     return await main
(APIServer pid=1016)            ^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1016)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1016)     async with build_async_engine_client(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1016)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1016)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1016)     return await anext(self.gen)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1016)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1016)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1016)     return cls(
(APIServer pid=1016)            ^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1016)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1016)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1016)     return AsyncMPClient(*client_args)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1016)     return func(*args, **kwargs)
(APIServer pid=1016)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1016)     super().__init__(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1016)     with launch_core_engines(
(APIServer pid=1016)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1016)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1016)     next(self.gen)
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1016)     wait_for_engine_startup(
(APIServer pid=1016)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1016)     raise RuntimeError(
(APIServer pid=1016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
</details>

On 0.17.2rc1.dev49+g8b6325758 (nightly docker image from today), I get a different but similar failure:

<details> <summary> nightly output</summary>
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev49+g8b6325758
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:297]
(APIServer pid=9) INFO 03-18 15:41:34 [utils.py:233] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor']}
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:533] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=9) INFO 03-18 15:41:40 [model.py:1582] Using max model len 8192
(APIServer pid=9) INFO 03-18 15:41:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=9) INFO 03-18 15:41:40 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:50 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev49+g8b6325758) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.15.0.2:58493 backend=nccl
(EngineCore pid=483) INFO 03-18 15:41:51 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=483) INFO 03-18 15:41:55 [gpu_model_runner.py:4506] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore pid=483) INFO 03-18 15:41:55 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=483) INFO 03-18 15:41:55 [cuda.py:333] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=483) INFO 03-18 15:41:55 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=483) INFO 03-18 15:41:55 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=483) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.47s/it]
(EngineCore pid=483)
(EngineCore pid=483) INFO 03-18 15:41:58 [default_loader.py:377] Loading weights took 1.67 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:4591] Model loading took 6.33 GiB memory and 2.484376 seconds
(EngineCore pid=483) INFO 03-18 15:41:58 [gpu_model_runner.py:5513] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/5d22c33852/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=483) INFO 03-18 15:42:02 [backends.py:1048] Dynamo bytecode transform time: 2.38 s
(EngineCore pid=483) INFO 03-18 15:42:04 [backends.py:371] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=483) INFO 03-18 15:42:08 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 5.73 s
(EngineCore pid=483) INFO 03-18 15:42:09 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/32fa4ed99d9f179225d7ad412f48f3e75ad592038cbe0ebb3c346fe7f9e79266/rank_0_0/model
(EngineCore pid=483) INFO 03-18 15:42:09 [monitor.py:48] torch.compile took 8.62 s in total
(EngineCore pid=483) /usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning:
(EngineCore pid=483) Online softmax is disabled on the fly since Inductor decides to
(EngineCore pid=483) split the reduction. Cut an issue to PyTorch if this is an
(EngineCore pid=483) important use case and you want to speed it up with online
(EngineCore pid=483) softmax.
(EngineCore pid=483)
(EngineCore pid=483)   warnings.warn(
(EngineCore pid=483) WARNING 03-18 15:42:11 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore pid=483) INFO 03-18 15:42:12 [monitor.py:76] Initial profiling/warmup run took 3.32 s
(EngineCore pid=483) INFO 03-18 15:42:18 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=483) INFO 03-18 15:42:18 [gpu_model_runner.py:5632] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(APIServer pid=9) Traceback (most recent call last):
(APIServer pid=9)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=9)     sys.exit(main())
(APIServer pid=9)              ^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=9)     args.dispatch_function(args)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=9)     uvloop.run(run_server(args))
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=9)     return __asyncio.run(
(APIServer pid=9)            ^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=9)     return runner.run(main)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=9)     return self._loop.run_until_complete(task)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=9)     return await main
(APIServer pid=9)            ^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 675, in run_server
(APIServer pid=9)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker
(APIServer pid=9)     async with build_async_engine_client(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client
(APIServer pid=9)     async with build_async_engine_client_from_engine_args(
(APIServer pid=9)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=9)     return await anext(self.gen)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args
(APIServer pid=9)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=9)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=9)     return cls(
(APIServer pid=9)            ^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=9)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=9)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=9)     return AsyncMPClient(*client_args)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=9)     return func(*args, **kwargs)
(APIServer pid=9)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=9)     super().__init__(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=9)     with launch_core_engines(
(APIServer pid=9)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=9)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=9)     next(self.gen)
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=9)     wait_for_engine_startup(
(APIServer pid=9)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=9)     raise RuntimeError(
(APIServer pid=9) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
</details>

Running the same thing on 0.16.0 works.

<details> <summary> 0.16.0 output</summary>
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]        █     █     █▄   ▄█
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.16.0
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-OCR-2
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:287]
(APIServer pid=5323) INFO 03-18 08:09:30 [utils.py:223] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-OCR-2', 'api_server_count': 1, 'model': 'deepseek-ai/DeepSeek-OCR-2', 'logits_processors': ['vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor'], 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': False, 'mm_processor_cache_gb': 0.0}
(APIServer pid=5323) WARNING 03-18 08:09:30 [system_utils.py:271] Found ulimit of 65356 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like `OSError: [Errno 24] Too many open files`. Consider increasing with ulimit -n
(APIServer pid=5323) INFO 03-18 08:09:37 [model.py:529] Resolved architecture: DeepseekOCR2ForCausalLM
(APIServer pid=5323) INFO 03-18 08:09:37 [model.py:1549] Using max model len 8192
(APIServer pid=5323) INFO 03-18 08:09:37 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=5323) INFO 03-18 08:09:37 [vllm.py:689] Asynchronous scheduling is enabled.
(APIServer pid=5323) /home/ray/anaconda3/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(APIServer pid=5323)   warnings.warn(
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:49 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='deepseek-ai/DeepSeek-OCR-2', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-OCR-2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-OCR-2, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=5925) /home/ray/anaconda3/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
(EngineCore_DP0 pid=5925)   warnings.warn(
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:50 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.5.191:49763 backend=nccl
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:51 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [gpu_model_runner.py:4124] Starting to load model deepseek-ai/DeepSeek-OCR-2...
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [vllm.py:689] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:54 [unquantized.py:131] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=5925) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=5925) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.43s/it]
(EngineCore_DP0 pid=5925)
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:56 [default_loader.py:293] Loading weights took 1.63 seconds
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:57 [gpu_model_runner.py:4221] Model loading took 6.33 GiB memory and 2.357183 seconds
(EngineCore_DP0 pid=5925) INFO 03-18 08:09:57 [gpu_model_runner.py:5140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 9 image items of the maximum feature size.
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:01 [backends.py:916] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/758d034a64/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:01 [backends.py:976] Dynamo bytecode transform time: 2.25 s
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:03 [backends.py:351] Cache the graph of compile range (1, 8192) for later use
(EngineCore_DP0 pid=5925) WARNING 03-18 08:10:07 [fused_moe.py:1087] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/ray/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=896,device_name=NVIDIA_H100_80GB_HBM3.json
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:09 [backends.py:368] Compiling a graph for compile range (1, 8192) takes 7.08 s
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:09 [monitor.py:34] torch.compile takes 9.33 s in total
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [gpu_worker.py:373] Available KV cache memory: 43.91 GiB
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [kv_cache_utils.py:1307] GPU KV cache size: 767,392 tokens
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:10 [kv_cache_utils.py:1312] Maximum concurrency for 8,192 tokens per request: 93.68x
(EngineCore_DP0 pid=5925) 2026-03-18 08:10:10,101 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=5925) 2026-03-18 08:10:10,120 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████| 51/51 [00:03<00:00, 13.06it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:01<00:00, 31.86it/s]
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:16 [gpu_model_runner.py:5246] Graph capturing finished in 6 secs, took 0.24 GiB
(EngineCore_DP0 pid=5925) INFO 03-18 08:10:16 [core.py:278] init engine (profile, create kv cache, warmup model) took 19.02 seconds
(APIServer pid=5323) INFO 03-18 08:10:16 [api_server.py:481] Supported tasks: ['generate']
(APIServer pid=5323) INFO 03-18 08:10:17 [serving.py:188] Warming up chat template processing...
(APIServer pid=5323) INFO 03-18 08:10:18 [hf.py:138] Loading chat template fallback for deepseek-ai/DeepSeek-OCR-2 as there isn't one defined on HF Hub.
(APIServer pid=5323) INFO 03-18 08:10:18 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=5323) INFO 03-18 08:10:18 [serving.py:213] Chat template warmup completed in 1428.3ms
(APIServer pid=5323) INFO 03-18 08:10:18 [api_server.py:486] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:38] Available routes are:
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=5323) INFO 03-18 08:10:18 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=5323) INFO:     Started server process [5323]
(APIServer pid=5323) INFO:     Waiting for application startup.
(APIServer pid=5323) INFO:     Application startup complete.
</details>

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the CUDA graph capture process. To fix this, we can try the following steps:

  • Update CUDA and cuDNN versions: Ensure that the CUDA and cuDNN versions are compatible with the PyTorch version being used.
  • Disable CUDA graph capture: Try disabling the CUDA graph capture by setting the cudagraph_mode to NONE in the compilation_config.
  • Reduce the model size: If the model is too large, try reducing the model size to see if it resolves the issue.
  • Check for GPU memory issues: Ensure that the GPU has sufficient memory to handle the model and the input data.

Here's an example of how to disable CUDA graph capture:

compilation_config = {
    # ... other config options ...
    'cudagraph_mode': 'NONE',
}

You can also try reducing the model size by setting the max_seq_len to a smaller value:

model_config = {
    # ... other config options ...
    'max_seq_len': 4096,
}

Verification

To verify that the fix worked, you can try running the vllm serve command again with the updated configuration. If the issue is resolved, the command should complete successfully without any errors.

Extra Tips

  • Make sure to check the GPU memory usage and adjust the model size accordingly to avoid running out of memory.
  • If you're using a large model, consider using a smaller model or a different architecture that requires less memory.
  • You can also try using a different CUDA graph capture mode, such as PIECEWISE, to see if it resolves the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: 0.17.1 - vllm serve deepseek-ai/DeepSeek-OCR-2 on H100 crashes during Capturing CUDA graphs (decode, FULL) [9 comments, 5 participants]