vllm - 💡(How to fix) Fix [Bug]: gemma4 31B MTP Avg Draft acceptance rate: 0.2% [6 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41789Fetched 2026-05-07 03:32:54
View on GitHub
Comments
6
Participants
4
Timeline
11
Reactions
0
Author
Timeline (top)
commented ×6mentioned ×2subscribed ×2labeled ×1

Root Cause

WARNING 05-06 06:05:03 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev49+g9b4e83934
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]   █▄█▀ █     █     █     █  model   /model
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:233] non-default args: {'model_tag': '/model', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0' , 'model': '/model', 'max_model_len': 65536, 'served_model_name': ['gpt'], 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.93, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'video': 0, 'image': 0, 'audio': 0}, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'async_scheduling': True, 'speculative_config': {'model': 'google/gemma-4-31B-it-assistant', 'num_speculative_tokens': 4}, 'performance_mode': 'throughput'}
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:563] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:1692] Using max model len 65536
(APIServer pid=1) INFO 05-06 06:05:14 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:563] Resolved architecture: Gemma4MTPModel
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-06 06:05:23 [speculative.py:671] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-06 06:05:23 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:723] Performance mode set to 'throughput'.
(APIServer pid=1) INFO 05-06 06:05:23 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:844] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 06:05:23 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) WARNING 05-06 06:05:23 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) WARNING 05-06 06:05:23 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=1) INFO 05-06 06:05:26 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 05-06 06:05:38 [nixl_utils.py:32] NIXL is available
(EngineCore pid=129) INFO 05-06 06:05:38 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev49+g9b4e83934) with config: model='/model', speculative_config=SpeculativeConfig(method='mtp', model='google/gemma-4-31B-it-assistant', num_spec_tokens=4), tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gpt, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 320, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=129) INFO 05-06 06:05:41 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.109:60739 backend=nccl
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=129) INFO 05-06 06:05:42 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=129) WARNING 05-06 06:05:42 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=129) INFO 05-06 06:05:42 [gpu_model_runner.py:4828] Starting to load model /model...
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:723] Performance mode set to 'throughput'.
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:844] Asynchronous scheduling is enabled.
(EngineCore pid=129) INFO 05-06 06:05:42 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) WARNING 05-06 06:05:42 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=129) INFO 05-06 06:05:42 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 19.04 GiB. Available RAM: 27.04 GiB.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (BTRFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.47s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.63s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.58s/it]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:53 [default_loader.py:391] Loading weights took 10.40 seconds
(EngineCore pid=129) INFO 05-06 06:05:54 [gpu_model_runner.py:4852] Loading drafter model...
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:659] No model.safetensors.index.json found in remote.
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 0.87 GiB. Available RAM: 27.03 GiB.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:55 [default_loader.py:391] Loading weights took 0.37 seconds
(EngineCore pid=129) INFO 05-06 06:05:55 [llm_base_proposer.py:1486] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:171] Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.59.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gpu_model_runner.py:4930] Model loading took 19.33 GiB memory and 12.725887 seconds
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1148] Dynamo bytecode transform time: 20.55 s
(EngineCore pid=129) INFO 05-06 06:06:34 [backends.py:378] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=129) INFO 05-06 06:07:06 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 48.36 s
(EngineCore pid=129) INFO 05-06 06:07:17 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/16e2c5ceda6327de3d24e565ce7c36d843ca9ff0868b5cfb12568b6ecdc2ad7b/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:07:17 [monitor.py:53] torch.compile took 81.85 s in total
(EngineCore pid=129) INFO 05-06 06:07:49 [monitor.py:81] Initial profiling/warmup run took 31.65 s
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1148] Dynamo bytecode transform time: 1.48 s
(EngineCore pid=129) INFO 05-06 06:07:59 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 8.49 s
(EngineCore pid=129) INFO 05-06 06:08:00 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/693b2504aba733248032e0e2aa5342b64e1055d0109c878d1a356a5cbd6ffc64/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:53] torch.compile took 10.66 s in total
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:81] Initial profiling/warmup run took 0.17 s
(EngineCore pid=129) INFO 05-06 06:08:12 [gpu_model_runner.py:6034] Profiling CUDA graph memory: PIECEWISE=37 (largest=320), FULL=21 (largest=160)
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_model_runner.py:6113] Estimated CUDA graph memory: 0.50 GiB total
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:460] Available KV cache memory: 6.26 GiB
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:475] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9300 is equivalent to --gpu-memory-utilization=0.9141 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9459. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1710] GPU KV cache size: 91,962 tokens
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1711] Maximum concurrency for 65,536 tokens per request: 1.40x
(EngineCore pid=129) INFO 05-06 06:08:19 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:04<00:00,  8.39it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:03<00:00,  5.48it/s]
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_model_runner.py:6204] Graph capturing finished in 9 secs, took 0.46 GiB
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_worker.py:619] CUDA graph pool memory: 0.46 GiB (actual), 0.5 GiB (estimated), difference: 0.04 GiB (8.0%).
(EngineCore pid=129) INFO 05-06 06:08:29 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=129) INFO 05-06 06:08:29 [core.py:299] init engine (profile, create kv cache, warmup model) took 153.19 s (compilation: 92.51 s)
(EngineCore pid=129) INFO 05-06 06:08:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) INFO 05-06 06:08:29 [api_server.py:613] Supported tasks: ['generate']
(APIServer pid=1) INFO 05-06 06:08:30 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 05-06 06:08:30 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-06 06:08:32 [hf.py:483] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 05-06 06:08:32 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Code Example

WARNING 05-06 06:05:03 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev49+g9b4e83934
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]   █▄█▀ █     █     █     █  model   /model
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:233] non-default args: {'model_tag': '/model', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0' , 'model': '/model', 'max_model_len': 65536, 'served_model_name': ['gpt'], 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.93, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'video': 0, 'image': 0, 'audio': 0}, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'async_scheduling': True, 'speculative_config': {'model': 'google/gemma-4-31B-it-assistant', 'num_speculative_tokens': 4}, 'performance_mode': 'throughput'}
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:563] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:1692] Using max model len 65536
(APIServer pid=1) INFO 05-06 06:05:14 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:563] Resolved architecture: Gemma4MTPModel
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-06 06:05:23 [speculative.py:671] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-06 06:05:23 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:723] Performance mode set to 'throughput'.
(APIServer pid=1) INFO 05-06 06:05:23 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:844] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 06:05:23 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) WARNING 05-06 06:05:23 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) WARNING 05-06 06:05:23 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=1) INFO 05-06 06:05:26 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 05-06 06:05:38 [nixl_utils.py:32] NIXL is available
(EngineCore pid=129) INFO 05-06 06:05:38 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev49+g9b4e83934) with config: model='/model', speculative_config=SpeculativeConfig(method='mtp', model='google/gemma-4-31B-it-assistant', num_spec_tokens=4), tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gpt, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 320, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=129) INFO 05-06 06:05:41 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.109:60739 backend=nccl
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=129) INFO 05-06 06:05:42 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=129) WARNING 05-06 06:05:42 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=129) INFO 05-06 06:05:42 [gpu_model_runner.py:4828] Starting to load model /model...
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:723] Performance mode set to 'throughput'.
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:844] Asynchronous scheduling is enabled.
(EngineCore pid=129) INFO 05-06 06:05:42 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) WARNING 05-06 06:05:42 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=129) INFO 05-06 06:05:42 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 19.04 GiB. Available RAM: 27.04 GiB.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (BTRFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.47s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.63s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.58s/it]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:53 [default_loader.py:391] Loading weights took 10.40 seconds
(EngineCore pid=129) INFO 05-06 06:05:54 [gpu_model_runner.py:4852] Loading drafter model...
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:659] No model.safetensors.index.json found in remote.
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 0.87 GiB. Available RAM: 27.03 GiB.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:55 [default_loader.py:391] Loading weights took 0.37 seconds
(EngineCore pid=129) INFO 05-06 06:05:55 [llm_base_proposer.py:1486] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:171] Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.59.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gpu_model_runner.py:4930] Model loading took 19.33 GiB memory and 12.725887 seconds
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1148] Dynamo bytecode transform time: 20.55 s
(EngineCore pid=129) INFO 05-06 06:06:34 [backends.py:378] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=129) INFO 05-06 06:07:06 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 48.36 s
(EngineCore pid=129) INFO 05-06 06:07:17 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/16e2c5ceda6327de3d24e565ce7c36d843ca9ff0868b5cfb12568b6ecdc2ad7b/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:07:17 [monitor.py:53] torch.compile took 81.85 s in total
(EngineCore pid=129) INFO 05-06 06:07:49 [monitor.py:81] Initial profiling/warmup run took 31.65 s
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1148] Dynamo bytecode transform time: 1.48 s
(EngineCore pid=129) INFO 05-06 06:07:59 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 8.49 s
(EngineCore pid=129) INFO 05-06 06:08:00 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/693b2504aba733248032e0e2aa5342b64e1055d0109c878d1a356a5cbd6ffc64/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:53] torch.compile took 10.66 s in total
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:81] Initial profiling/warmup run took 0.17 s
(EngineCore pid=129) INFO 05-06 06:08:12 [gpu_model_runner.py:6034] Profiling CUDA graph memory: PIECEWISE=37 (largest=320), FULL=21 (largest=160)
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_model_runner.py:6113] Estimated CUDA graph memory: 0.50 GiB total
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:460] Available KV cache memory: 6.26 GiB
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:475] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9300 is equivalent to --gpu-memory-utilization=0.9141 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9459. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1710] GPU KV cache size: 91,962 tokens
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1711] Maximum concurrency for 65,536 tokens per request: 1.40x
(EngineCore pid=129) INFO 05-06 06:08:19 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:04<00:00,  8.39it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:03<00:00,  5.48it/s]
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_model_runner.py:6204] Graph capturing finished in 9 secs, took 0.46 GiB
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_worker.py:619] CUDA graph pool memory: 0.46 GiB (actual), 0.5 GiB (estimated), difference: 0.04 GiB (8.0%).
(EngineCore pid=129) INFO 05-06 06:08:29 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=129) INFO 05-06 06:08:29 [core.py:299] init engine (profile, create kv cache, warmup model) took 153.19 s (compilation: 92.51 s)
(EngineCore pid=129) INFO 05-06 06:08:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) INFO 05-06 06:08:29 [api_server.py:613] Supported tasks: ['generate']
(APIServer pid=1) INFO 05-06 06:08:30 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 05-06 06:08:30 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-06 06:08:32 [hf.py:483] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 05-06 06:08:32 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.



(EngineCore pid=129) WARNING 05-06 06:09:38 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:39 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_next_token_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:40 [jit_monitor.py:103] Triton kernel JIT compilation during inference: kernel_unified_attention. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:40 [jit_monitor.py:103] Triton kernel JIT compilation during inference: reduce_segments. This causes a latency spike; consider extending warmup to cover this shape/config.


(EngineCore pid=129) WARNING 05-06 06:09:48 [jit_monitor.py:103] Triton kernel JIT compilation during inference: expand_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:56 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _topk_topp_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:56 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_inputs_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO 05-06 06:10:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1245.6 tokens/s, Avg generation throughput: 133.0 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 39.7%, Prefix cache hit rate: 35.3%
(APIServer pid=1) INFO 05-06 06:10:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 0.31 tokens/s, Drafted throughput: 55.47 tokens/s, Accepted: 29 tokens, Drafted: 5184 tokens, Per-position acceptance rate: 0.022, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.6%
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:13 [loggers.py:271] Engine 000: Avg prompt throughput: 162.6 tokens/s, Avg generation throughput: 187.8 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.3%, Prefix cache hit rate: 39.2%
(APIServer pid=1) INFO 05-06 06:10:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 2.30 tokens/s, Drafted throughput: 740.73 tokens/s, Accepted: 23 tokens, Drafted: 7408 tokens, Per-position acceptance rate: 0.012, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.3%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:23 [loggers.py:271] Engine 000: Avg prompt throughput: 28.7 tokens/s, Avg generation throughput: 187.1 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.7%, Prefix cache hit rate: 44.2%
(APIServer pid=1) INFO 05-06 06:10:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 2.60 tokens/s, Drafted throughput: 737.56 tokens/s, Accepted: 26 tokens, Drafted: 7376 tokens, Per-position acceptance rate: 0.014, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.4%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:33 [loggers.py:271] Engine 000: Avg prompt throughput: 142.6 tokens/s, Avg generation throughput: 148.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.8%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.50 tokens/s, Drafted throughput: 593.19 tokens/s, Accepted: 5 tokens, Drafted: 5932 tokens, Per-position acceptance rate: 0.003, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.1%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 126.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.0%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 503.95 tokens/s, Accepted: 1 tokens, Drafted: 5040 tokens, Per-position acceptance rate: 0.001, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO 05-06 06:10:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 117.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.0%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 467.96 tokens/s, Accepted: 1 tokens, Drafted: 4680 tokens, Per-position acceptance rate: 0.001, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%

(APIServer pid=1) INFO 05-06 06:11:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 116.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.6%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:11:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 464.38 tokens/s, Accepted: 0 tokens, Drafted: 4644 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO:     192.168.1.108:39396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:13 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.4 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 59.1%, Prefix cache hit rate: 27.4%
(APIServer pid=1) INFO 05-06 06:11:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 233.20 tokens/s, Accepted: 1 tokens, Drafted: 2332 tokens, Per-position acceptance rate: 0.002, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO 05-06 06:11:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1436.3 tokens/s, Avg generation throughput: 93.9 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.1%, Prefix cache hit rate: 27.4%
(APIServer pid=1) INFO 05-06 06:11:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 2.50 tokens/s, Drafted throughput: 365.14 tokens/s, Accepted: 25 tokens, Drafted: 3652 tokens, Per-position acceptance rate: 0.027, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.7%
(APIServer pid=1) INFO:     192.168.1.108:39394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1401.3 tokens/s, Avg generation throughput: 44.7 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.5%, Prefix cache hit rate: 20.5%
(APIServer pid=1) INFO 05-06 06:11:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 1.50 tokens/s, Drafted throughput: 172.41 tokens/s, Accepted: 15 tokens, Drafted: 1724 tokens, Per-position acceptance rate: 0.035, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.9%
(APIServer pid=1) INFO 05-06 06:11:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 102.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 42.2%, Prefix cache hit rate: 20.5%
(APIServer pid=1) INFO 05-06 06:11:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 3.30 tokens/s, Drafted throughput: 397.18 tokens/s, Accepted: 33 tokens, Drafted: 3972 tokens, Per-position acceptance rate: 0.033, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.8%
(APIServer pid=1) INFO:     192.168.1.108:39400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:53 [loggers.py:271] Engine 000: Avg prompt throughput: 1437.5 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.4%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:11:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 1.40 tokens/s, Drafted throughput: 163.99 tokens/s, Accepted: 14 tokens, Drafted: 1640 tokens, Per-position acceptance rate: 0.034, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.9%
(APIServer pid=1) INFO 05-06 06:12:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 101.2 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.0%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 3.10 tokens/s, Drafted throughput: 392.34 tokens/s, Accepted: 31 tokens, Drafted: 3924 tokens, Per-position acceptance rate: 0.032, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.8%
(APIServer pid=1) INFO 05-06 06:12:13 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 98.8 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.5%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 1.60 tokens/s, Drafted throughput: 388.79 tokens/s, Accepted: 16 tokens, Drafted: 3888 tokens, Per-position acceptance rate: 0.016, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.4%
(APIServer pid=1) INFO 05-06 06:12:23 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 97.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.0%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 0.70 tokens/s, Drafted throughput: 387.55 tokens/s, Accepted: 7 tokens, Drafted: 3876 tokens, Per-position acceptance rate: 0.007, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.2%
(APIServer pid=1) INFO:     192.168.1.108:39394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
RAW_BUFFERClick to expand / collapse

Your current environment

5090 32G

cyankiwi/gemma-4-31B-it-AWQ-4bit

docker run -itd --name gemma4
--ipc=host
--network host
--shm-size 16G
--gpus '"device=0"'
-v /home/ma/work/models/gemma-4-31B-it-AWQ-4bit:/model
-v /home/ma/.cache/huggingface:/root/.cache/huggingface
-e HF_HOME=/root/.cache/huggingface
-e HF_ENDPOINT=https://hf-mirror.com/
vllm/vllm-openai:gemma4-0505-cu130
--model /model
--served-model-name gpt
--tensor-parallel-size 1
--max-num-seqs 32
--max-model-len 65536
--enable-auto-tool-choice
--tool-call-parser gemma4
--reasoning-parser gemma4
--gpu-memory-utilization 0.93
--async-scheduling
--performance-mode throughput
--enable-chunked-prefill
--host 0.0.0.0
--enable-prefix-caching
--kv-cache-dtype fp8
--limit-mm-per-prompt '{"video":0,"image":0,"audio":0}'
--port 8000
--speculative-config '{"model": "google/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'

🐛 Describe the bug

Avg Draft acceptance rate: 0.2%

WARNING 05-06 06:05:03 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev49+g9b4e83934
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]   █▄█▀ █     █     █     █  model   /model
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:299]
(APIServer pid=1) INFO 05-06 06:05:03 [utils.py:233] non-default args: {'model_tag': '/model', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0' , 'model': '/model', 'max_model_len': 65536, 'served_model_name': ['gpt'], 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.93, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'video': 0, 'image': 0, 'audio': 0}, 'max_num_seqs': 32, 'enable_chunked_prefill': True, 'async_scheduling': True, 'speculative_config': {'model': 'google/gemma-4-31B-it-assistant', 'num_speculative_tokens': 4}, 'performance_mode': 'throughput'}
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 05-06 06:05:03 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-06 06:05:14 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:563] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 06:05:14 [model.py:1692] Using max model len 65536
(APIServer pid=1) INFO 05-06 06:05:14 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:563] Resolved architecture: Gemma4MTPModel
(APIServer pid=1) INFO 05-06 06:05:23 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-06 06:05:23 [speculative.py:671] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-06 06:05:23 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:723] Performance mode set to 'throughput'.
(APIServer pid=1) INFO 05-06 06:05:23 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) INFO 05-06 06:05:23 [vllm.py:844] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 06:05:23 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) WARNING 05-06 06:05:23 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) WARNING 05-06 06:05:23 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=1) INFO 05-06 06:05:26 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 05-06 06:05:38 [nixl_utils.py:32] NIXL is available
(EngineCore pid=129) INFO 05-06 06:05:38 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev49+g9b4e83934) with config: model='/model', speculative_config=SpeculativeConfig(method='mtp', model='google/gemma-4-31B-it-assistant', num_spec_tokens=4), tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gpt, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 320, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=129) INFO 05-06 06:05:41 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.109:60739 backend=nccl
(EngineCore pid=129) INFO 05-06 06:05:41 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=129) INFO 05-06 06:05:42 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=129) WARNING 05-06 06:05:42 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=129) INFO 05-06 06:05:42 [gpu_model_runner.py:4828] Starting to load model /model...
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:723] Performance mode set to 'throughput'.
(EngineCore pid=129) INFO 05-06 06:05:42 [vllm.py:844] Asynchronous scheduling is enabled.
(EngineCore pid=129) INFO 05-06 06:05:42 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) WARNING 05-06 06:05:42 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=129) INFO 05-06 06:05:42 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:42 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 19.04 GiB. Available RAM: 27.04 GiB.
(EngineCore pid=129) INFO 05-06 06:05:43 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (BTRFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.47s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.63s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.58s/it]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:53 [default_loader.py:391] Loading weights took 10.40 seconds
(EngineCore pid=129) INFO 05-06 06:05:54 [gpu_model_runner.py:4852] Loading drafter model...
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:659] No model.safetensors.index.json found in remote.
(EngineCore pid=129) INFO 05-06 06:05:54 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 0.87 GiB. Available RAM: 27.03 GiB.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.78it/s]
(EngineCore pid=129)
(EngineCore pid=129) INFO 05-06 06:05:55 [default_loader.py:391] Loading weights took 0.37 seconds
(EngineCore pid=129) INFO 05-06 06:05:55 [llm_base_proposer.py:1486] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:171] Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gemma4.py:330] Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.59.self_attn.attn
(EngineCore pid=129) INFO 05-06 06:05:55 [gpu_model_runner.py:4930] Model loading took 19.33 GiB memory and 12.725887 seconds
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:06:16 [backends.py:1148] Dynamo bytecode transform time: 20.55 s
(EngineCore pid=129) INFO 05-06 06:06:34 [backends.py:378] Cache the graph of compile range (1, 4096) for later use
(EngineCore pid=129) INFO 05-06 06:07:06 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 48.36 s
(EngineCore pid=129) INFO 05-06 06:07:17 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/16e2c5ceda6327de3d24e565ce7c36d843ca9ff0868b5cfb12568b6ecdc2ad7b/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:07:17 [monitor.py:53] torch.compile took 81.85 s in total
(EngineCore pid=129) INFO 05-06 06:07:49 [monitor.py:81] Initial profiling/warmup run took 31.65 s
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/3b56cf98a7/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=129) INFO 05-06 06:07:51 [backends.py:1148] Dynamo bytecode transform time: 1.48 s
(EngineCore pid=129) INFO 05-06 06:07:59 [backends.py:393] Compiling a graph for compile range (1, 4096) takes 8.49 s
(EngineCore pid=129) INFO 05-06 06:08:00 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/693b2504aba733248032e0e2aa5342b64e1055d0109c878d1a356a5cbd6ffc64/rank_0_0/model
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:53] torch.compile took 10.66 s in total
(EngineCore pid=129) INFO 05-06 06:08:00 [monitor.py:81] Initial profiling/warmup run took 0.17 s
(EngineCore pid=129) INFO 05-06 06:08:12 [gpu_model_runner.py:6034] Profiling CUDA graph memory: PIECEWISE=37 (largest=320), FULL=21 (largest=160)
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_model_runner.py:6113] Estimated CUDA graph memory: 0.50 GiB total
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:460] Available KV cache memory: 6.26 GiB
(EngineCore pid=129) INFO 05-06 06:08:19 [gpu_worker.py:475] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9300 is equivalent to --gpu-memory-utilization=0.9141 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9459. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1710] GPU KV cache size: 91,962 tokens
(EngineCore pid=129) INFO 05-06 06:08:19 [kv_cache_utils.py:1711] Maximum concurrency for 65,536 tokens per request: 1.40x
(EngineCore pid=129) INFO 05-06 06:08:19 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:04<00:00,  8.39it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:03<00:00,  5.48it/s]
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_model_runner.py:6204] Graph capturing finished in 9 secs, took 0.46 GiB
(EngineCore pid=129) INFO 05-06 06:08:29 [gpu_worker.py:619] CUDA graph pool memory: 0.46 GiB (actual), 0.5 GiB (estimated), difference: 0.04 GiB (8.0%).
(EngineCore pid=129) INFO 05-06 06:08:29 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=129) INFO 05-06 06:08:29 [core.py:299] init engine (profile, create kv cache, warmup model) took 153.19 s (compilation: 92.51 s)
(EngineCore pid=129) INFO 05-06 06:08:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) INFO 05-06 06:08:29 [api_server.py:613] Supported tasks: ['generate']
(APIServer pid=1) INFO 05-06 06:08:30 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 05-06 06:08:30 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-06 06:08:32 [hf.py:483] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 05-06 06:08:32 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 05-06 06:08:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.



(EngineCore pid=129) WARNING 05-06 06:09:38 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:39 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_next_token_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:40 [jit_monitor.py:103] Triton kernel JIT compilation during inference: kernel_unified_attention. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:40 [jit_monitor.py:103] Triton kernel JIT compilation during inference: reduce_segments. This causes a latency spike; consider extending warmup to cover this shape/config.


(EngineCore pid=129) WARNING 05-06 06:09:48 [jit_monitor.py:103] Triton kernel JIT compilation during inference: expand_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:56 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _topk_topp_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=129) WARNING 05-06 06:09:56 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_inputs_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO 05-06 06:10:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1245.6 tokens/s, Avg generation throughput: 133.0 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 39.7%, Prefix cache hit rate: 35.3%
(APIServer pid=1) INFO 05-06 06:10:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 0.31 tokens/s, Drafted throughput: 55.47 tokens/s, Accepted: 29 tokens, Drafted: 5184 tokens, Per-position acceptance rate: 0.022, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.6%
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:13 [loggers.py:271] Engine 000: Avg prompt throughput: 162.6 tokens/s, Avg generation throughput: 187.8 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.3%, Prefix cache hit rate: 39.2%
(APIServer pid=1) INFO 05-06 06:10:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 2.30 tokens/s, Drafted throughput: 740.73 tokens/s, Accepted: 23 tokens, Drafted: 7408 tokens, Per-position acceptance rate: 0.012, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.3%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.1.108:39410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:23 [loggers.py:271] Engine 000: Avg prompt throughput: 28.7 tokens/s, Avg generation throughput: 187.1 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.7%, Prefix cache hit rate: 44.2%
(APIServer pid=1) INFO 05-06 06:10:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 2.60 tokens/s, Drafted throughput: 737.56 tokens/s, Accepted: 26 tokens, Drafted: 7376 tokens, Per-position acceptance rate: 0.014, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.4%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:33 [loggers.py:271] Engine 000: Avg prompt throughput: 142.6 tokens/s, Avg generation throughput: 148.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 34.8%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.50 tokens/s, Drafted throughput: 593.19 tokens/s, Accepted: 5 tokens, Drafted: 5932 tokens, Per-position acceptance rate: 0.003, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.1%
(APIServer pid=1) INFO:     192.168.1.108:39404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:10:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 126.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.0%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 503.95 tokens/s, Accepted: 1 tokens, Drafted: 5040 tokens, Per-position acceptance rate: 0.001, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO 05-06 06:10:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 117.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.0%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:10:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 467.96 tokens/s, Accepted: 1 tokens, Drafted: 4680 tokens, Per-position acceptance rate: 0.001, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%

(APIServer pid=1) INFO 05-06 06:11:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 116.1 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.6%, Prefix cache hit rate: 41.9%
(APIServer pid=1) INFO 05-06 06:11:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 464.38 tokens/s, Accepted: 0 tokens, Drafted: 4644 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO:     192.168.1.108:39396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:13 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.4 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 59.1%, Prefix cache hit rate: 27.4%
(APIServer pid=1) INFO 05-06 06:11:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.10 tokens/s, Drafted throughput: 233.20 tokens/s, Accepted: 1 tokens, Drafted: 2332 tokens, Per-position acceptance rate: 0.002, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=1) INFO 05-06 06:11:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1436.3 tokens/s, Avg generation throughput: 93.9 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.1%, Prefix cache hit rate: 27.4%
(APIServer pid=1) INFO 05-06 06:11:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 2.50 tokens/s, Drafted throughput: 365.14 tokens/s, Accepted: 25 tokens, Drafted: 3652 tokens, Per-position acceptance rate: 0.027, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.7%
(APIServer pid=1) INFO:     192.168.1.108:39394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1401.3 tokens/s, Avg generation throughput: 44.7 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.5%, Prefix cache hit rate: 20.5%
(APIServer pid=1) INFO 05-06 06:11:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 1.50 tokens/s, Drafted throughput: 172.41 tokens/s, Accepted: 15 tokens, Drafted: 1724 tokens, Per-position acceptance rate: 0.035, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.9%
(APIServer pid=1) INFO 05-06 06:11:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 102.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 42.2%, Prefix cache hit rate: 20.5%
(APIServer pid=1) INFO 05-06 06:11:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 3.30 tokens/s, Drafted throughput: 397.18 tokens/s, Accepted: 33 tokens, Drafted: 3972 tokens, Per-position acceptance rate: 0.033, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.8%
(APIServer pid=1) INFO:     192.168.1.108:39400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-06 06:11:53 [loggers.py:271] Engine 000: Avg prompt throughput: 1437.5 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.4%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:11:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 1.40 tokens/s, Drafted throughput: 163.99 tokens/s, Accepted: 14 tokens, Drafted: 1640 tokens, Per-position acceptance rate: 0.034, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.9%
(APIServer pid=1) INFO 05-06 06:12:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 101.2 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.0%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.03, Accepted throughput: 3.10 tokens/s, Drafted throughput: 392.34 tokens/s, Accepted: 31 tokens, Drafted: 3924 tokens, Per-position acceptance rate: 0.032, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.8%
(APIServer pid=1) INFO 05-06 06:12:13 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 98.8 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 47.5%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 1.60 tokens/s, Drafted throughput: 388.79 tokens/s, Accepted: 16 tokens, Drafted: 3888 tokens, Per-position acceptance rate: 0.016, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.4%
(APIServer pid=1) INFO 05-06 06:12:23 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 97.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.0%, Prefix cache hit rate: 16.3%
(APIServer pid=1) INFO 05-06 06:12:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.01, Accepted throughput: 0.70 tokens/s, Drafted throughput: 387.55 tokens/s, Accepted: 7 tokens, Drafted: 3876 tokens, Per-position acceptance rate: 0.007, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.2%
(APIServer pid=1) INFO:     192.168.1.108:39394 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING