vllm - 💡(How to fix) Fix [Performance]: MTP seems to be very slow

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Looking at the benchmark results, Experiment 1 (NO MTP) is faster in most practical metrics:

MetricExp 1 (No MTP)Exp 2 (MTP)Winner
Benchmark duration393s464sExp 1
Output tok/s1042881Exp 1
Total tok/s52144410Exp 1
Mean TTFT153s213sExp 1
Mean TPOT17.41ms9.38msExp 2
Peak output tok/s1659458Exp 1

Code Example

model: RedHatAI/Qwen3.6-35B-A3B-NVFP4
dtype: bfloat16
kv-cache-dtype: fp8
gpu-memory-utilization: 0.95
max-model-len: 262144
max-num-batched-tokens: 4096
max-num-seqs: 200
max-cudagraph-capture-size: 209
enable-prefix-caching: true
reasoning-parser: qwen3
trust-remote-code: true
enable-auto-tool-choice: true
tool-call-parser: qwen3_coder
default-chat-template-kwargs: '{"enable_thinking": false}'
#speculative-config: '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' # with or without this line
download-dir: /workspace/models
host: 0.0.0.0
port: 18000

---

vllm bench serve --base-url "http://0.0.0.0:18000" --backend openai-chat --endpoint "/v1/chat/completions" --model "RedHatAI/Qwen3.6-35B-A3B-NVFP4" --dataset-name random --random-input-len 16384 --random-output-len 4096 --num-prompts 100 --request-rate 20

---

(APIServer pid=3646) INFO 05-19 10:10:01 [loggers.py:271] Engine 000: Avg prompt throughput: 26227.4 tokens/s, Avg generation throughput: 107.0 tokens/s, Running: 19 reqs, Waiting: 81 reqs, GPU KV cache usage: 84.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:11 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 930.2 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1533.6 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1511.5 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1130.3 tokens/s, Running: 18 reqs, Waiting: 82 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1371.2 tokens/s, Running: 18 reqs, Waiting: 82 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1207.7 tokens/s, Running: 18 reqs, Waiting: 80 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:11 [loggers.py:271] Engine 000: Avg prompt throughput: 21313.2 tokens/s, Avg generation throughput: 171.2 tokens/s, Running: 18 reqs, Waiting: 65 reqs, GPU KV cache usage: 82.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:21 [loggers.py:271] Engine 000: Avg prompt throughput: 8196.1 tokens/s, Avg generation throughput: 837.2 tokens/s, Running: 19 reqs, Waiting: 63 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1458.8 tokens/s, Running: 19 reqs, Waiting: 63 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.0 tokens/s, Avg generation throughput: 1274.1 tokens/s, Running: 20 reqs, Waiting: 60 reqs, GPU KV cache usage: 98.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1249.5 tokens/s, Running: 19 reqs, Waiting: 61 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1258.3 tokens/s, Running: 18 reqs, Waiting: 62 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1260.4 tokens/s, Running: 18 reqs, Waiting: 61 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:21 [loggers.py:271] Engine 000: Avg prompt throughput: 16394.9 tokens/s, Avg generation throughput: 408.8 tokens/s, Running: 17 reqs, Waiting: 49 reqs, GPU KV cache usage: 81.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:31 [loggers.py:271] Engine 000: Avg prompt throughput: 13115.9 tokens/s, Avg generation throughput: 741.6 tokens/s, Running: 21 reqs, Waiting: 42 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1357.4 tokens/s, Running: 19 reqs, Waiting: 44 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1279.0 tokens/s, Running: 19 reqs, Waiting: 43 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1352.3 tokens/s, Running: 20 reqs, Waiting: 42 reqs, GPU KV cache usage: 96.7%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1055.3 tokens/s, Running: 18 reqs, Waiting: 43 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1226.3 tokens/s, Running: 18 reqs, Waiting: 42 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:31 [loggers.py:271] Engine 000: Avg prompt throughput: 12068.8 tokens/s, Avg generation throughput: 734.9 tokens/s, Running: 18 reqs, Waiting: 33 reqs, GPU KV cache usage: 87.7%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:41 [loggers.py:271] Engine 000: Avg prompt throughput: 18032.5 tokens/s, Avg generation throughput: 589.8 tokens/s, Running: 21 reqs, Waiting: 23 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1242.7 tokens/s, Running: 20 reqs, Waiting: 23 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1517.5 tokens/s, Running: 20 reqs, Waiting: 23 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1491.6 tokens/s, Running: 19 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 878.6 tokens/s, Running: 18 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1381.9 tokens/s, Running: 18 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:41 [loggers.py:271] Engine 000: Avg prompt throughput: 9838.1 tokens/s, Avg generation throughput: 764.5 tokens/s, Running: 18 reqs, Waiting: 16 reqs, GPU KV cache usage: 88.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:51 [loggers.py:271] Engine 000: Avg prompt throughput: 22953.9 tokens/s, Avg generation throughput: 241.6 tokens/s, Running: 22 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1350.0 tokens/s, Running: 20 reqs, Waiting: 4 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1503.7 tokens/s, Running: 20 reqs, Waiting: 4 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1400.9 tokens/s, Running: 20 reqs, Waiting: 3 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1108.9 tokens/s, Running: 18 reqs, Waiting: 5 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1371.1 tokens/s, Running: 18 reqs, Waiting: 5 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.3 tokens/s, Avg generation throughput: 933.7 tokens/s, Running: 13 reqs, Waiting: 0 reqs, GPU KV cache usage: 66.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 619.8 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 577.1 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 486.2 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO:     127.0.0.1:34394 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=3646) INFO 05-19 10:16:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

---

Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [06:33<00:00,  3.93s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           20.00     
Benchmark duration (s):                  393.02    
Total input tokens:                      1639637   
Total generated tokens:                  409600    
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         1042.18   
Peak output token throughput (tok/s):    1659.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          5214.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          153886.28 
Median TTFT (ms):                        149339.37 
P99 TTFT (ms):                           356431.02 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.41     
Median TPOT (ms):                        17.08     
P99 TPOT (ms):                           26.87     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.41     
Median ITL (ms):                         13.17     
P99 ITL (ms):                            93.00     
==================================================

---

(APIServer pid=5020) INFO 05-19 10:24:13 [loggers.py:271] Engine 000: Avg prompt throughput: 6558.4 tokens/s, Avg generation throughput: 12.3 tokens/s, Running: 5 reqs, Waiting: 95 reqs, GPU KV cache usage: 50.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.92, Accepted throughput: 0.46 tokens/s, Drafted throughput: 1.00 tokens/s, Accepted: 57 tokens, Drafted: 124 tokens, Per-position acceptance rate: 0.613, 0.306, Avg Draft acceptance rate: 46.0%
(EngineCore pid=5469) WARNING 05-19 10:24:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _topk_topp_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=5020) INFO 05-19 10:24:23 [loggers.py:271] Engine 000: Avg prompt throughput: 8196.9 tokens/s, Avg generation throughput: 84.2 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 89.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.16, Accepted throughput: 44.90 tokens/s, Drafted throughput: 77.59 tokens/s, Accepted: 449 tokens, Drafted: 776 tokens, Per-position acceptance rate: 0.698, 0.459, Avg Draft acceptance rate: 57.9%
(APIServer pid=5020) INFO 05-19 10:24:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1108.0 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 662.59 tokens/s, Drafted throughput: 890.85 tokens/s, Accepted: 6627 tokens, Drafted: 8910 tokens, Per-position acceptance rate: 0.819, 0.668, Avg Draft acceptance rate: 74.4%
(APIServer pid=5020) INFO 05-19 10:24:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1139.0 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 700.86 tokens/s, Drafted throughput: 876.42 tokens/s, Accepted: 7010 tokens, Drafted: 8766 tokens, Per-position acceptance rate: 0.862, 0.737, Avg Draft acceptance rate: 80.0%
(APIServer pid=5020) INFO 05-19 10:24:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 933.0 tokens/s, Running: 9 reqs, Waiting: 90 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.74, Accepted throughput: 591.89 tokens/s, Drafted throughput: 682.19 tokens/s, Accepted: 5919 tokens, Drafted: 6822 tokens, Per-position acceptance rate: 0.916, 0.820, Avg Draft acceptance rate: 86.8%
(APIServer pid=5020) INFO 05-19 10:25:03 [loggers.py:271] Engine 000: Avg prompt throughput: 8558.6 tokens/s, Avg generation throughput: 620.5 tokens/s, Running: 9 reqs, Waiting: 85 reqs, GPU KV cache usage: 93.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.47, Accepted throughput: 369.51 tokens/s, Drafted throughput: 501.68 tokens/s, Accepted: 3696 tokens, Drafted: 5018 tokens, Per-position acceptance rate: 0.824, 0.649, Avg Draft acceptance rate: 73.7%
(APIServer pid=5020) INFO 05-19 10:25:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 843.6 tokens/s, Running: 9 reqs, Waiting: 82 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.44, Accepted throughput: 497.70 tokens/s, Drafted throughput: 691.66 tokens/s, Accepted: 4978 tokens, Drafted: 6918 tokens, Per-position acceptance rate: 0.805, 0.634, Avg Draft acceptance rate: 72.0%
(APIServer pid=5020) INFO 05-19 10:25:23 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1137.0 tokens/s, Running: 9 reqs, Waiting: 82 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 697.17 tokens/s, Drafted throughput: 880.03 tokens/s, Accepted: 6973 tokens, Drafted: 8802 tokens, Per-position acceptance rate: 0.858, 0.726, Avg Draft acceptance rate: 79.2%
(APIServer pid=5020) INFO 05-19 10:25:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.2 tokens/s, Avg generation throughput: 972.4 tokens/s, Running: 9 reqs, Waiting: 81 reqs, GPU KV cache usage: 98.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 607.57 tokens/s, Drafted throughput: 729.85 tokens/s, Accepted: 6077 tokens, Drafted: 7300 tokens, Per-position acceptance rate: 0.897, 0.768, Avg Draft acceptance rate: 83.2%
(APIServer pid=5020) INFO 05-19 10:25:43 [loggers.py:271] Engine 000: Avg prompt throughput: 8197.2 tokens/s, Avg generation throughput: 683.6 tokens/s, Running: 9 reqs, Waiting: 76 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.45, Accepted throughput: 405.16 tokens/s, Drafted throughput: 557.15 tokens/s, Accepted: 4052 tokens, Drafted: 5572 tokens, Per-position acceptance rate: 0.808, 0.646, Avg Draft acceptance rate: 72.7%
(APIServer pid=5020) INFO 05-19 10:25:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.9 tokens/s, Avg generation throughput: 895.6 tokens/s, Running: 8 reqs, Waiting: 74 reqs, GPU KV cache usage: 83.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.42, Accepted throughput: 525.53 tokens/s, Drafted throughput: 740.10 tokens/s, Accepted: 5256 tokens, Drafted: 7402 tokens, Per-position acceptance rate: 0.792, 0.628, Avg Draft acceptance rate: 71.0%
(APIServer pid=5020) INFO 05-19 10:26:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.2 tokens/s, Avg generation throughput: 1026.6 tokens/s, Running: 9 reqs, Waiting: 73 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.52, Accepted throughput: 619.72 tokens/s, Drafted throughput: 813.69 tokens/s, Accepted: 6198 tokens, Drafted: 8138 tokens, Per-position acceptance rate: 0.827, 0.696, Avg Draft acceptance rate: 76.2%
(APIServer pid=5020) INFO 05-19 10:26:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.7 tokens/s, Avg generation throughput: 978.9 tokens/s, Running: 9 reqs, Waiting: 71 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 607.14 tokens/s, Drafted throughput: 743.73 tokens/s, Accepted: 6072 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.868, 0.765, Avg Draft acceptance rate: 81.6%
(APIServer pid=5020) INFO 05-19 10:26:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.7 tokens/s, Avg generation throughput: 802.3 tokens/s, Running: 9 reqs, Waiting: 67 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.50, Accepted throughput: 480.89 tokens/s, Drafted throughput: 642.38 tokens/s, Accepted: 4809 tokens, Drafted: 6424 tokens, Per-position acceptance rate: 0.826, 0.672, Avg Draft acceptance rate: 74.9%
(APIServer pid=5020) INFO 05-19 10:26:33 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.5 tokens/s, Avg generation throughput: 940.9 tokens/s, Running: 9 reqs, Waiting: 65 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 581.05 tokens/s, Drafted throughput: 719.42 tokens/s, Accepted: 5812 tokens, Drafted: 7196 tokens, Per-position acceptance rate: 0.868, 0.748, Avg Draft acceptance rate: 80.8%
(APIServer pid=5020) INFO 05-19 10:26:43 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.1 tokens/s, Avg generation throughput: 1048.9 tokens/s, Running: 9 reqs, Waiting: 64 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 649.24 tokens/s, Drafted throughput: 799.20 tokens/s, Accepted: 6494 tokens, Drafted: 7994 tokens, Per-position acceptance rate: 0.875, 0.750, Avg Draft acceptance rate: 81.2%
(APIServer pid=5020) INFO 05-19 10:26:53 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.6 tokens/s, Avg generation throughput: 868.0 tokens/s, Running: 9 reqs, Waiting: 61 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.57, Accepted throughput: 530.56 tokens/s, Drafted throughput: 675.02 tokens/s, Accepted: 5307 tokens, Drafted: 6752 tokens, Per-position acceptance rate: 0.863, 0.709, Avg Draft acceptance rate: 78.6%
(APIServer pid=5020) INFO 05-19 10:27:03 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.8 tokens/s, Avg generation throughput: 889.9 tokens/s, Running: 9 reqs, Waiting: 58 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.51, Accepted throughput: 535.49 tokens/s, Drafted throughput: 708.78 tokens/s, Accepted: 5355 tokens, Drafted: 7088 tokens, Per-position acceptance rate: 0.843, 0.668, Avg Draft acceptance rate: 75.6%
(APIServer pid=5020) INFO 05-19 10:27:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.3 tokens/s, Avg generation throughput: 929.6 tokens/s, Running: 9 reqs, Waiting: 56 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 569.05 tokens/s, Drafted throughput: 721.21 tokens/s, Accepted: 5692 tokens, Drafted: 7214 tokens, Per-position acceptance rate: 0.859, 0.719, Avg Draft acceptance rate: 78.9%
(APIServer pid=5020) INFO 05-19 10:27:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.1 tokens/s, Avg generation throughput: 1021.1 tokens/s, Running: 9 reqs, Waiting: 55 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 619.69 tokens/s, Drafted throughput: 802.86 tokens/s, Accepted: 6198 tokens, Drafted: 8030 tokens, Per-position acceptance rate: 0.841, 0.702, Avg Draft acceptance rate: 77.2%
(APIServer pid=5020) INFO 05-19 10:27:33 [loggers.py:271] Engine 000: Avg prompt throughput: 3641.5 tokens/s, Avg generation throughput: 858.6 tokens/s, Running: 9 reqs, Waiting: 52 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.52, Accepted throughput: 518.23 tokens/s, Drafted throughput: 680.71 tokens/s, Accepted: 5183 tokens, Drafted: 6808 tokens, Per-position acceptance rate: 0.840, 0.682, Avg Draft acceptance rate: 76.1%
(APIServer pid=5020) INFO 05-19 10:27:43 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.6 tokens/s, Avg generation throughput: 904.2 tokens/s, Running: 9 reqs, Waiting: 49 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 550.93 tokens/s, Drafted throughput: 706.91 tokens/s, Accepted: 5510 tokens, Drafted: 7070 tokens, Per-position acceptance rate: 0.851, 0.708, Avg Draft acceptance rate: 77.9%
(APIServer pid=5020) INFO 05-19 10:27:53 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 973.2 tokens/s, Running: 9 reqs, Waiting: 47 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.69, Accepted throughput: 611.62 tokens/s, Drafted throughput: 723.10 tokens/s, Accepted: 6117 tokens, Drafted: 7232 tokens, Per-position acceptance rate: 0.902, 0.790, Avg Draft acceptance rate: 84.6%
(APIServer pid=5020) INFO 05-19 10:28:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.5 tokens/s, Avg generation throughput: 1047.1 tokens/s, Running: 9 reqs, Waiting: 45 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.77, Accepted throughput: 669.79 tokens/s, Drafted throughput: 754.99 tokens/s, Accepted: 6698 tokens, Drafted: 7550 tokens, Per-position acceptance rate: 0.929, 0.845, Avg Draft acceptance rate: 88.7%
(APIServer pid=5020) INFO 05-19 10:28:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.8 tokens/s, Avg generation throughput: 922.1 tokens/s, Running: 9 reqs, Waiting: 42 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 575.39 tokens/s, Drafted throughput: 693.47 tokens/s, Accepted: 5755 tokens, Drafted: 6936 tokens, Per-position acceptance rate: 0.892, 0.768, Avg Draft acceptance rate: 83.0%
(APIServer pid=5020) INFO 05-19 10:28:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 983.1 tokens/s, Running: 9 reqs, Waiting: 40 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 612.31 tokens/s, Drafted throughput: 741.49 tokens/s, Accepted: 6124 tokens, Drafted: 7416 tokens, Per-position acceptance rate: 0.902, 0.749, Avg Draft acceptance rate: 82.6%
(APIServer pid=5020) INFO 05-19 10:28:33 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 920.4 tokens/s, Running: 9 reqs, Waiting: 37 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.67, Accepted throughput: 575.25 tokens/s, Drafted throughput: 690.33 tokens/s, Accepted: 5753 tokens, Drafted: 6904 tokens, Per-position acceptance rate: 0.907, 0.760, Avg Draft acceptance rate: 83.3%
(APIServer pid=5020) INFO 05-19 10:28:43 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 1037.9 tokens/s, Running: 9 reqs, Waiting: 35 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.75, Accepted throughput: 660.70 tokens/s, Drafted throughput: 754.08 tokens/s, Accepted: 6608 tokens, Drafted: 7542 tokens, Per-position acceptance rate: 0.940, 0.812, Avg Draft acceptance rate: 87.6%
(APIServer pid=5020) INFO 05-19 10:28:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.5 tokens/s, Avg generation throughput: 994.8 tokens/s, Running: 9 reqs, Waiting: 33 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.68, Accepted throughput: 623.11 tokens/s, Drafted throughput: 743.69 tokens/s, Accepted: 6232 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.908, 0.768, Avg Draft acceptance rate: 83.8%
(APIServer pid=5020) INFO 05-19 10:29:03 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.3 tokens/s, Avg generation throughput: 929.4 tokens/s, Running: 9 reqs, Waiting: 30 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.72, Accepted throughput: 588.14 tokens/s, Drafted throughput: 682.73 tokens/s, Accepted: 5882 tokens, Drafted: 6828 tokens, Per-position acceptance rate: 0.922, 0.801, Avg Draft acceptance rate: 86.1%
(APIServer pid=5020) INFO 05-19 10:29:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.5 tokens/s, Avg generation throughput: 987.3 tokens/s, Running: 9 reqs, Waiting: 28 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 615.45 tokens/s, Drafted throughput: 743.62 tokens/s, Accepted: 6156 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.898, 0.757, Avg Draft acceptance rate: 82.8%
(APIServer pid=5020) INFO 05-19 10:29:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 901.5 tokens/s, Running: 9 reqs, Waiting: 25 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 561.21 tokens/s, Drafted throughput: 680.89 tokens/s, Accepted: 5613 tokens, Drafted: 6810 tokens, Per-position acceptance rate: 0.886, 0.762, Avg Draft acceptance rate: 82.4%
(APIServer pid=5020) INFO 05-19 10:29:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.4 tokens/s, Avg generation throughput: 1035.9 tokens/s, Running: 9 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.55, Accepted throughput: 630.02 tokens/s, Drafted throughput: 811.90 tokens/s, Accepted: 6301 tokens, Drafted: 8120 tokens, Per-position acceptance rate: 0.843, 0.709, Avg Draft acceptance rate: 77.6%
(APIServer pid=5020) INFO 05-19 10:29:43 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 876.8 tokens/s, Running: 9 reqs, Waiting: 21 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 534.51 tokens/s, Drafted throughput: 684.49 tokens/s, Accepted: 5346 tokens, Drafted: 6846 tokens, Per-position acceptance rate: 0.850, 0.712, Avg Draft acceptance rate: 78.1%
(APIServer pid=5020) INFO 05-19 10:29:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.4 tokens/s, Avg generation throughput: 941.9 tokens/s, Running: 9 reqs, Waiting: 19 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 570.99 tokens/s, Drafted throughput: 741.85 tokens/s, Accepted: 5711 tokens, Drafted: 7420 tokens, Per-position acceptance rate: 0.845, 0.695, Avg Draft acceptance rate: 77.0%
(APIServer pid=5020) INFO 05-19 10:30:03 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.4 tokens/s, Avg generation throughput: 995.7 tokens/s, Running: 9 reqs, Waiting: 17 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.70, Accepted throughput: 626.75 tokens/s, Drafted throughput: 738.22 tokens/s, Accepted: 6269 tokens, Drafted: 7384 tokens, Per-position acceptance rate: 0.906, 0.792, Avg Draft acceptance rate: 84.9%
(APIServer pid=5020) INFO 05-19 10:30:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 923.9 tokens/s, Running: 9 reqs, Waiting: 14 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 572.58 tokens/s, Drafted throughput: 703.37 tokens/s, Accepted: 5726 tokens, Drafted: 7034 tokens, Per-position acceptance rate: 0.876, 0.752, Avg Draft acceptance rate: 81.4%
(APIServer pid=5020) INFO 05-19 10:30:23 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.6 tokens/s, Avg generation throughput: 939.6 tokens/s, Running: 9 reqs, Waiting: 12 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 568.72 tokens/s, Drafted throughput: 741.89 tokens/s, Accepted: 5688 tokens, Drafted: 7420 tokens, Per-position acceptance rate: 0.840, 0.694, Avg Draft acceptance rate: 76.7%
(APIServer pid=5020) INFO 05-19 10:30:33 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.8 tokens/s, Avg generation throughput: 913.8 tokens/s, Running: 9 reqs, Waiting: 10 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 553.18 tokens/s, Drafted throughput: 721.24 tokens/s, Accepted: 5533 tokens, Drafted: 7214 tokens, Per-position acceptance rate: 0.832, 0.702, Avg Draft acceptance rate: 76.7%
(APIServer pid=5020) INFO 05-19 10:30:43 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 1004.4 tokens/s, Running: 8 reqs, Waiting: 7 reqs, GPU KV cache usage: 82.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.77, Accepted throughput: 641.77 tokens/s, Drafted throughput: 725.37 tokens/s, Accepted: 6418 tokens, Drafted: 7254 tokens, Per-position acceptance rate: 0.929, 0.840, Avg Draft acceptance rate: 88.5%
(APIServer pid=5020) INFO 05-19 10:30:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.9 tokens/s, Avg generation throughput: 1024.8 tokens/s, Running: 9 reqs, Waiting: 6 reqs, GPU KV cache usage: 97.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.67, Accepted throughput: 640.46 tokens/s, Drafted throughput: 768.35 tokens/s, Accepted: 6405 tokens, Drafted: 7684 tokens, Per-position acceptance rate: 0.901, 0.767, Avg Draft acceptance rate: 83.4%
(APIServer pid=5020) INFO 05-19 10:31:03 [loggers.py:271] Engine 000: Avg prompt throughput: 6558.0 tokens/s, Avg generation throughput: 776.7 tokens/s, Running: 9 reqs, Waiting: 1 reqs, GPU KV cache usage: 90.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.57, Accepted throughput: 474.09 tokens/s, Drafted throughput: 605.39 tokens/s, Accepted: 4741 tokens, Drafted: 6054 tokens, Per-position acceptance rate: 0.863, 0.704, Avg Draft acceptance rate: 78.3%
(APIServer pid=5020) INFO 05-19 10:31:13 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.4 tokens/s, Avg generation throughput: 1046.5 tokens/s, Running: 9 reqs, Waiting: 1 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 634.07 tokens/s, Drafted throughput: 824.76 tokens/s, Accepted: 6341 tokens, Drafted: 8248 tokens, Per-position acceptance rate: 0.842, 0.695, Avg Draft acceptance rate: 76.9%
(APIServer pid=5020) INFO 05-19 10:31:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.3 tokens/s, Avg generation throughput: 1020.4 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 75.4%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.70, Accepted throughput: 642.56 tokens/s, Drafted throughput: 756.44 tokens/s, Accepted: 6427 tokens, Drafted: 7566 tokens, Per-position acceptance rate: 0.906, 0.792, Avg Draft acceptance rate: 84.9%
(APIServer pid=5020) INFO 05-19 10:31:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 788.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.84, Accepted throughput: 511.33 tokens/s, Drafted throughput: 555.53 tokens/s, Accepted: 5114 tokens, Drafted: 5556 tokens, Per-position acceptance rate: 0.971, 0.870, Avg Draft acceptance rate: 92.0%
(APIServer pid=5020) INFO 05-19 10:31:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 206.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.87, Accepted throughput: 134.69 tokens/s, Drafted throughput: 144.19 tokens/s, Accepted: 1347 tokens, Drafted: 1442 tokens, Per-position acceptance rate: 0.988, 0.881, Avg Draft acceptance rate: 93.4%
(APIServer pid=5020) INFO:     127.0.0.1:49326 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=5020) INFO 05-19 10:31:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 40.20 tokens/s, Drafted throughput: 41.80 tokens/s, Accepted: 402 tokens, Drafted: 418 tokens, Per-position acceptance rate: 1.000, 0.923, Avg Draft acceptance rate: 96.2%
(APIServer pid=5020) INFO 05-19 10:32:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

---

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           20.00     
Benchmark duration (s):                  464.66    
Total input tokens:                      1639637   
Total generated tokens:                  409600    
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         881.51    
Peak output token throughput (tok/s):    458.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4410.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          213962.97 
Median TTFT (ms):                        216368.88 
P99 TTFT (ms):                           416589.59 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.38      
Median TPOT (ms):                        9.14      
P99 TPOT (ms):                           13.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.44     
Median ITL (ms):                         20.58     
P99 ITL (ms):                            114.47    
---------------Speculative Decoding---------------
Acceptance rate (%):                     80.31     
Acceptance length:                       2.61      
Drafts:                                  157158    
Draft tokens:                            314316    
Accepted tokens:                         252431    
Per-position acceptance (%):
  Position 0:                            87.04     
  Position 1:                            73.58     
==================================================

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

I have tried with your quantized model an I found other issues regarding MTP.

The MTP model is clearly slower, am I doing something wrong?

config.yaml used:

model: RedHatAI/Qwen3.6-35B-A3B-NVFP4
dtype: bfloat16
kv-cache-dtype: fp8
gpu-memory-utilization: 0.95
max-model-len: 262144
max-num-batched-tokens: 4096
max-num-seqs: 200
max-cudagraph-capture-size: 209
enable-prefix-caching: true
reasoning-parser: qwen3
trust-remote-code: true
enable-auto-tool-choice: true
tool-call-parser: qwen3_coder
default-chat-template-kwargs: '{"enable_thinking": false}'
#speculative-config: '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' # with or without this line
download-dir: /workspace/models
host: 0.0.0.0
port: 18000

Command for benchmarks:

vllm bench serve --base-url "http://0.0.0.0:18000" --backend openai-chat --endpoint "/v1/chat/completions" --model "RedHatAI/Qwen3.6-35B-A3B-NVFP4" --dataset-name random --random-input-len 16384 --random-output-len 4096 --num-prompts 100 --request-rate 20

Experiment 1 (NO MTP):

(APIServer pid=3646) INFO 05-19 10:10:01 [loggers.py:271] Engine 000: Avg prompt throughput: 26227.4 tokens/s, Avg generation throughput: 107.0 tokens/s, Running: 19 reqs, Waiting: 81 reqs, GPU KV cache usage: 84.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:11 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 930.2 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1533.6 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1511.5 tokens/s, Running: 20 reqs, Waiting: 80 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1130.3 tokens/s, Running: 18 reqs, Waiting: 82 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:10:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1371.2 tokens/s, Running: 18 reqs, Waiting: 82 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1207.7 tokens/s, Running: 18 reqs, Waiting: 80 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:11 [loggers.py:271] Engine 000: Avg prompt throughput: 21313.2 tokens/s, Avg generation throughput: 171.2 tokens/s, Running: 18 reqs, Waiting: 65 reqs, GPU KV cache usage: 82.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:21 [loggers.py:271] Engine 000: Avg prompt throughput: 8196.1 tokens/s, Avg generation throughput: 837.2 tokens/s, Running: 19 reqs, Waiting: 63 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1458.8 tokens/s, Running: 19 reqs, Waiting: 63 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.0 tokens/s, Avg generation throughput: 1274.1 tokens/s, Running: 20 reqs, Waiting: 60 reqs, GPU KV cache usage: 98.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1249.5 tokens/s, Running: 19 reqs, Waiting: 61 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1258.3 tokens/s, Running: 18 reqs, Waiting: 62 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1260.4 tokens/s, Running: 18 reqs, Waiting: 61 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:21 [loggers.py:271] Engine 000: Avg prompt throughput: 16394.9 tokens/s, Avg generation throughput: 408.8 tokens/s, Running: 17 reqs, Waiting: 49 reqs, GPU KV cache usage: 81.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:31 [loggers.py:271] Engine 000: Avg prompt throughput: 13115.9 tokens/s, Avg generation throughput: 741.6 tokens/s, Running: 21 reqs, Waiting: 42 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1357.4 tokens/s, Running: 19 reqs, Waiting: 44 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:12:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1279.0 tokens/s, Running: 19 reqs, Waiting: 43 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1352.3 tokens/s, Running: 20 reqs, Waiting: 42 reqs, GPU KV cache usage: 96.7%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1055.3 tokens/s, Running: 18 reqs, Waiting: 43 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1226.3 tokens/s, Running: 18 reqs, Waiting: 42 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:31 [loggers.py:271] Engine 000: Avg prompt throughput: 12068.8 tokens/s, Avg generation throughput: 734.9 tokens/s, Running: 18 reqs, Waiting: 33 reqs, GPU KV cache usage: 87.7%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:41 [loggers.py:271] Engine 000: Avg prompt throughput: 18032.5 tokens/s, Avg generation throughput: 589.8 tokens/s, Running: 21 reqs, Waiting: 23 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:13:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1242.7 tokens/s, Running: 20 reqs, Waiting: 23 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1517.5 tokens/s, Running: 20 reqs, Waiting: 23 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1491.6 tokens/s, Running: 19 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 878.6 tokens/s, Running: 18 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1381.9 tokens/s, Running: 18 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:41 [loggers.py:271] Engine 000: Avg prompt throughput: 9838.1 tokens/s, Avg generation throughput: 764.5 tokens/s, Running: 18 reqs, Waiting: 16 reqs, GPU KV cache usage: 88.5%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:14:51 [loggers.py:271] Engine 000: Avg prompt throughput: 22953.9 tokens/s, Avg generation throughput: 241.6 tokens/s, Running: 22 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1350.0 tokens/s, Running: 20 reqs, Waiting: 4 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1503.7 tokens/s, Running: 20 reqs, Waiting: 4 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1400.9 tokens/s, Running: 20 reqs, Waiting: 3 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1108.9 tokens/s, Running: 18 reqs, Waiting: 5 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1371.1 tokens/s, Running: 18 reqs, Waiting: 5 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.3 tokens/s, Avg generation throughput: 933.7 tokens/s, Running: 13 reqs, Waiting: 0 reqs, GPU KV cache usage: 66.4%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 619.8 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 577.1 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.3%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 486.2 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO:     127.0.0.1:34394 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=3646) INFO 05-19 10:16:31 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3646) INFO 05-19 10:16:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [06:33<00:00,  3.93s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           20.00     
Benchmark duration (s):                  393.02    
Total input tokens:                      1639637   
Total generated tokens:                  409600    
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         1042.18   
Peak output token throughput (tok/s):    1659.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          5214.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          153886.28 
Median TTFT (ms):                        149339.37 
P99 TTFT (ms):                           356431.02 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.41     
Median TPOT (ms):                        17.08     
P99 TPOT (ms):                           26.87     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.41     
Median ITL (ms):                         13.17     
P99 ITL (ms):                            93.00     
==================================================

Experiment 2 (with MTP):

(APIServer pid=5020) INFO 05-19 10:24:13 [loggers.py:271] Engine 000: Avg prompt throughput: 6558.4 tokens/s, Avg generation throughput: 12.3 tokens/s, Running: 5 reqs, Waiting: 95 reqs, GPU KV cache usage: 50.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.92, Accepted throughput: 0.46 tokens/s, Drafted throughput: 1.00 tokens/s, Accepted: 57 tokens, Drafted: 124 tokens, Per-position acceptance rate: 0.613, 0.306, Avg Draft acceptance rate: 46.0%
(EngineCore pid=5469) WARNING 05-19 10:24:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _topk_topp_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=5020) INFO 05-19 10:24:23 [loggers.py:271] Engine 000: Avg prompt throughput: 8196.9 tokens/s, Avg generation throughput: 84.2 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 89.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.16, Accepted throughput: 44.90 tokens/s, Drafted throughput: 77.59 tokens/s, Accepted: 449 tokens, Drafted: 776 tokens, Per-position acceptance rate: 0.698, 0.459, Avg Draft acceptance rate: 57.9%
(APIServer pid=5020) INFO 05-19 10:24:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1108.0 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 662.59 tokens/s, Drafted throughput: 890.85 tokens/s, Accepted: 6627 tokens, Drafted: 8910 tokens, Per-position acceptance rate: 0.819, 0.668, Avg Draft acceptance rate: 74.4%
(APIServer pid=5020) INFO 05-19 10:24:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1139.0 tokens/s, Running: 9 reqs, Waiting: 91 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 700.86 tokens/s, Drafted throughput: 876.42 tokens/s, Accepted: 7010 tokens, Drafted: 8766 tokens, Per-position acceptance rate: 0.862, 0.737, Avg Draft acceptance rate: 80.0%
(APIServer pid=5020) INFO 05-19 10:24:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 933.0 tokens/s, Running: 9 reqs, Waiting: 90 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:24:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.74, Accepted throughput: 591.89 tokens/s, Drafted throughput: 682.19 tokens/s, Accepted: 5919 tokens, Drafted: 6822 tokens, Per-position acceptance rate: 0.916, 0.820, Avg Draft acceptance rate: 86.8%
(APIServer pid=5020) INFO 05-19 10:25:03 [loggers.py:271] Engine 000: Avg prompt throughput: 8558.6 tokens/s, Avg generation throughput: 620.5 tokens/s, Running: 9 reqs, Waiting: 85 reqs, GPU KV cache usage: 93.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.47, Accepted throughput: 369.51 tokens/s, Drafted throughput: 501.68 tokens/s, Accepted: 3696 tokens, Drafted: 5018 tokens, Per-position acceptance rate: 0.824, 0.649, Avg Draft acceptance rate: 73.7%
(APIServer pid=5020) INFO 05-19 10:25:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 843.6 tokens/s, Running: 9 reqs, Waiting: 82 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.44, Accepted throughput: 497.70 tokens/s, Drafted throughput: 691.66 tokens/s, Accepted: 4978 tokens, Drafted: 6918 tokens, Per-position acceptance rate: 0.805, 0.634, Avg Draft acceptance rate: 72.0%
(APIServer pid=5020) INFO 05-19 10:25:23 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1137.0 tokens/s, Running: 9 reqs, Waiting: 82 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 697.17 tokens/s, Drafted throughput: 880.03 tokens/s, Accepted: 6973 tokens, Drafted: 8802 tokens, Per-position acceptance rate: 0.858, 0.726, Avg Draft acceptance rate: 79.2%
(APIServer pid=5020) INFO 05-19 10:25:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.2 tokens/s, Avg generation throughput: 972.4 tokens/s, Running: 9 reqs, Waiting: 81 reqs, GPU KV cache usage: 98.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 607.57 tokens/s, Drafted throughput: 729.85 tokens/s, Accepted: 6077 tokens, Drafted: 7300 tokens, Per-position acceptance rate: 0.897, 0.768, Avg Draft acceptance rate: 83.2%
(APIServer pid=5020) INFO 05-19 10:25:43 [loggers.py:271] Engine 000: Avg prompt throughput: 8197.2 tokens/s, Avg generation throughput: 683.6 tokens/s, Running: 9 reqs, Waiting: 76 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.45, Accepted throughput: 405.16 tokens/s, Drafted throughput: 557.15 tokens/s, Accepted: 4052 tokens, Drafted: 5572 tokens, Per-position acceptance rate: 0.808, 0.646, Avg Draft acceptance rate: 72.7%
(APIServer pid=5020) INFO 05-19 10:25:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.9 tokens/s, Avg generation throughput: 895.6 tokens/s, Running: 8 reqs, Waiting: 74 reqs, GPU KV cache usage: 83.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:25:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.42, Accepted throughput: 525.53 tokens/s, Drafted throughput: 740.10 tokens/s, Accepted: 5256 tokens, Drafted: 7402 tokens, Per-position acceptance rate: 0.792, 0.628, Avg Draft acceptance rate: 71.0%
(APIServer pid=5020) INFO 05-19 10:26:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.2 tokens/s, Avg generation throughput: 1026.6 tokens/s, Running: 9 reqs, Waiting: 73 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.52, Accepted throughput: 619.72 tokens/s, Drafted throughput: 813.69 tokens/s, Accepted: 6198 tokens, Drafted: 8138 tokens, Per-position acceptance rate: 0.827, 0.696, Avg Draft acceptance rate: 76.2%
(APIServer pid=5020) INFO 05-19 10:26:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.7 tokens/s, Avg generation throughput: 978.9 tokens/s, Running: 9 reqs, Waiting: 71 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 607.14 tokens/s, Drafted throughput: 743.73 tokens/s, Accepted: 6072 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.868, 0.765, Avg Draft acceptance rate: 81.6%
(APIServer pid=5020) INFO 05-19 10:26:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.7 tokens/s, Avg generation throughput: 802.3 tokens/s, Running: 9 reqs, Waiting: 67 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.50, Accepted throughput: 480.89 tokens/s, Drafted throughput: 642.38 tokens/s, Accepted: 4809 tokens, Drafted: 6424 tokens, Per-position acceptance rate: 0.826, 0.672, Avg Draft acceptance rate: 74.9%
(APIServer pid=5020) INFO 05-19 10:26:33 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.5 tokens/s, Avg generation throughput: 940.9 tokens/s, Running: 9 reqs, Waiting: 65 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 581.05 tokens/s, Drafted throughput: 719.42 tokens/s, Accepted: 5812 tokens, Drafted: 7196 tokens, Per-position acceptance rate: 0.868, 0.748, Avg Draft acceptance rate: 80.8%
(APIServer pid=5020) INFO 05-19 10:26:43 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.1 tokens/s, Avg generation throughput: 1048.9 tokens/s, Running: 9 reqs, Waiting: 64 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 649.24 tokens/s, Drafted throughput: 799.20 tokens/s, Accepted: 6494 tokens, Drafted: 7994 tokens, Per-position acceptance rate: 0.875, 0.750, Avg Draft acceptance rate: 81.2%
(APIServer pid=5020) INFO 05-19 10:26:53 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.6 tokens/s, Avg generation throughput: 868.0 tokens/s, Running: 9 reqs, Waiting: 61 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:26:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.57, Accepted throughput: 530.56 tokens/s, Drafted throughput: 675.02 tokens/s, Accepted: 5307 tokens, Drafted: 6752 tokens, Per-position acceptance rate: 0.863, 0.709, Avg Draft acceptance rate: 78.6%
(APIServer pid=5020) INFO 05-19 10:27:03 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.8 tokens/s, Avg generation throughput: 889.9 tokens/s, Running: 9 reqs, Waiting: 58 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.51, Accepted throughput: 535.49 tokens/s, Drafted throughput: 708.78 tokens/s, Accepted: 5355 tokens, Drafted: 7088 tokens, Per-position acceptance rate: 0.843, 0.668, Avg Draft acceptance rate: 75.6%
(APIServer pid=5020) INFO 05-19 10:27:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.3 tokens/s, Avg generation throughput: 929.6 tokens/s, Running: 9 reqs, Waiting: 56 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 569.05 tokens/s, Drafted throughput: 721.21 tokens/s, Accepted: 5692 tokens, Drafted: 7214 tokens, Per-position acceptance rate: 0.859, 0.719, Avg Draft acceptance rate: 78.9%
(APIServer pid=5020) INFO 05-19 10:27:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.1 tokens/s, Avg generation throughput: 1021.1 tokens/s, Running: 9 reqs, Waiting: 55 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 619.69 tokens/s, Drafted throughput: 802.86 tokens/s, Accepted: 6198 tokens, Drafted: 8030 tokens, Per-position acceptance rate: 0.841, 0.702, Avg Draft acceptance rate: 77.2%
(APIServer pid=5020) INFO 05-19 10:27:33 [loggers.py:271] Engine 000: Avg prompt throughput: 3641.5 tokens/s, Avg generation throughput: 858.6 tokens/s, Running: 9 reqs, Waiting: 52 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.52, Accepted throughput: 518.23 tokens/s, Drafted throughput: 680.71 tokens/s, Accepted: 5183 tokens, Drafted: 6808 tokens, Per-position acceptance rate: 0.840, 0.682, Avg Draft acceptance rate: 76.1%
(APIServer pid=5020) INFO 05-19 10:27:43 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.6 tokens/s, Avg generation throughput: 904.2 tokens/s, Running: 9 reqs, Waiting: 49 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 550.93 tokens/s, Drafted throughput: 706.91 tokens/s, Accepted: 5510 tokens, Drafted: 7070 tokens, Per-position acceptance rate: 0.851, 0.708, Avg Draft acceptance rate: 77.9%
(APIServer pid=5020) INFO 05-19 10:27:53 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 973.2 tokens/s, Running: 9 reqs, Waiting: 47 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:27:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.69, Accepted throughput: 611.62 tokens/s, Drafted throughput: 723.10 tokens/s, Accepted: 6117 tokens, Drafted: 7232 tokens, Per-position acceptance rate: 0.902, 0.790, Avg Draft acceptance rate: 84.6%
(APIServer pid=5020) INFO 05-19 10:28:03 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.5 tokens/s, Avg generation throughput: 1047.1 tokens/s, Running: 9 reqs, Waiting: 45 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.77, Accepted throughput: 669.79 tokens/s, Drafted throughput: 754.99 tokens/s, Accepted: 6698 tokens, Drafted: 7550 tokens, Per-position acceptance rate: 0.929, 0.845, Avg Draft acceptance rate: 88.7%
(APIServer pid=5020) INFO 05-19 10:28:13 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.8 tokens/s, Avg generation throughput: 922.1 tokens/s, Running: 9 reqs, Waiting: 42 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 575.39 tokens/s, Drafted throughput: 693.47 tokens/s, Accepted: 5755 tokens, Drafted: 6936 tokens, Per-position acceptance rate: 0.892, 0.768, Avg Draft acceptance rate: 83.0%
(APIServer pid=5020) INFO 05-19 10:28:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 983.1 tokens/s, Running: 9 reqs, Waiting: 40 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 612.31 tokens/s, Drafted throughput: 741.49 tokens/s, Accepted: 6124 tokens, Drafted: 7416 tokens, Per-position acceptance rate: 0.902, 0.749, Avg Draft acceptance rate: 82.6%
(APIServer pid=5020) INFO 05-19 10:28:33 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 920.4 tokens/s, Running: 9 reqs, Waiting: 37 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.67, Accepted throughput: 575.25 tokens/s, Drafted throughput: 690.33 tokens/s, Accepted: 5753 tokens, Drafted: 6904 tokens, Per-position acceptance rate: 0.907, 0.760, Avg Draft acceptance rate: 83.3%
(APIServer pid=5020) INFO 05-19 10:28:43 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.0 tokens/s, Avg generation throughput: 1037.9 tokens/s, Running: 9 reqs, Waiting: 35 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.75, Accepted throughput: 660.70 tokens/s, Drafted throughput: 754.08 tokens/s, Accepted: 6608 tokens, Drafted: 7542 tokens, Per-position acceptance rate: 0.940, 0.812, Avg Draft acceptance rate: 87.6%
(APIServer pid=5020) INFO 05-19 10:28:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.5 tokens/s, Avg generation throughput: 994.8 tokens/s, Running: 9 reqs, Waiting: 33 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:28:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.68, Accepted throughput: 623.11 tokens/s, Drafted throughput: 743.69 tokens/s, Accepted: 6232 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.908, 0.768, Avg Draft acceptance rate: 83.8%
(APIServer pid=5020) INFO 05-19 10:29:03 [loggers.py:271] Engine 000: Avg prompt throughput: 4918.3 tokens/s, Avg generation throughput: 929.4 tokens/s, Running: 9 reqs, Waiting: 30 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.72, Accepted throughput: 588.14 tokens/s, Drafted throughput: 682.73 tokens/s, Accepted: 5882 tokens, Drafted: 6828 tokens, Per-position acceptance rate: 0.922, 0.801, Avg Draft acceptance rate: 86.1%
(APIServer pid=5020) INFO 05-19 10:29:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.5 tokens/s, Avg generation throughput: 987.3 tokens/s, Running: 9 reqs, Waiting: 28 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 615.45 tokens/s, Drafted throughput: 743.62 tokens/s, Accepted: 6156 tokens, Drafted: 7438 tokens, Per-position acceptance rate: 0.898, 0.757, Avg Draft acceptance rate: 82.8%
(APIServer pid=5020) INFO 05-19 10:29:23 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 901.5 tokens/s, Running: 9 reqs, Waiting: 25 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 561.21 tokens/s, Drafted throughput: 680.89 tokens/s, Accepted: 5613 tokens, Drafted: 6810 tokens, Per-position acceptance rate: 0.886, 0.762, Avg Draft acceptance rate: 82.4%
(APIServer pid=5020) INFO 05-19 10:29:33 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.4 tokens/s, Avg generation throughput: 1035.9 tokens/s, Running: 9 reqs, Waiting: 24 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.55, Accepted throughput: 630.02 tokens/s, Drafted throughput: 811.90 tokens/s, Accepted: 6301 tokens, Drafted: 8120 tokens, Per-position acceptance rate: 0.843, 0.709, Avg Draft acceptance rate: 77.6%
(APIServer pid=5020) INFO 05-19 10:29:43 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.9 tokens/s, Avg generation throughput: 876.8 tokens/s, Running: 9 reqs, Waiting: 21 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 534.51 tokens/s, Drafted throughput: 684.49 tokens/s, Accepted: 5346 tokens, Drafted: 6846 tokens, Per-position acceptance rate: 0.850, 0.712, Avg Draft acceptance rate: 78.1%
(APIServer pid=5020) INFO 05-19 10:29:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.4 tokens/s, Avg generation throughput: 941.9 tokens/s, Running: 9 reqs, Waiting: 19 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:29:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 570.99 tokens/s, Drafted throughput: 741.85 tokens/s, Accepted: 5711 tokens, Drafted: 7420 tokens, Per-position acceptance rate: 0.845, 0.695, Avg Draft acceptance rate: 77.0%
(APIServer pid=5020) INFO 05-19 10:30:03 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.4 tokens/s, Avg generation throughput: 995.7 tokens/s, Running: 9 reqs, Waiting: 17 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.70, Accepted throughput: 626.75 tokens/s, Drafted throughput: 738.22 tokens/s, Accepted: 6269 tokens, Drafted: 7384 tokens, Per-position acceptance rate: 0.906, 0.792, Avg Draft acceptance rate: 84.9%
(APIServer pid=5020) INFO 05-19 10:30:13 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 923.9 tokens/s, Running: 9 reqs, Waiting: 14 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 572.58 tokens/s, Drafted throughput: 703.37 tokens/s, Accepted: 5726 tokens, Drafted: 7034 tokens, Per-position acceptance rate: 0.876, 0.752, Avg Draft acceptance rate: 81.4%
(APIServer pid=5020) INFO 05-19 10:30:23 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.6 tokens/s, Avg generation throughput: 939.6 tokens/s, Running: 9 reqs, Waiting: 12 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 568.72 tokens/s, Drafted throughput: 741.89 tokens/s, Accepted: 5688 tokens, Drafted: 7420 tokens, Per-position acceptance rate: 0.840, 0.694, Avg Draft acceptance rate: 76.7%
(APIServer pid=5020) INFO 05-19 10:30:33 [loggers.py:271] Engine 000: Avg prompt throughput: 4917.8 tokens/s, Avg generation throughput: 913.8 tokens/s, Running: 9 reqs, Waiting: 10 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 553.18 tokens/s, Drafted throughput: 721.24 tokens/s, Accepted: 5533 tokens, Drafted: 7214 tokens, Per-position acceptance rate: 0.832, 0.702, Avg Draft acceptance rate: 76.7%
(APIServer pid=5020) INFO 05-19 10:30:43 [loggers.py:271] Engine 000: Avg prompt throughput: 3279.0 tokens/s, Avg generation throughput: 1004.4 tokens/s, Running: 8 reqs, Waiting: 7 reqs, GPU KV cache usage: 82.5%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.77, Accepted throughput: 641.77 tokens/s, Drafted throughput: 725.37 tokens/s, Accepted: 6418 tokens, Drafted: 7254 tokens, Per-position acceptance rate: 0.929, 0.840, Avg Draft acceptance rate: 88.5%
(APIServer pid=5020) INFO 05-19 10:30:53 [loggers.py:271] Engine 000: Avg prompt throughput: 3278.9 tokens/s, Avg generation throughput: 1024.8 tokens/s, Running: 9 reqs, Waiting: 6 reqs, GPU KV cache usage: 97.7%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:30:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.67, Accepted throughput: 640.46 tokens/s, Drafted throughput: 768.35 tokens/s, Accepted: 6405 tokens, Drafted: 7684 tokens, Per-position acceptance rate: 0.901, 0.767, Avg Draft acceptance rate: 83.4%
(APIServer pid=5020) INFO 05-19 10:31:03 [loggers.py:271] Engine 000: Avg prompt throughput: 6558.0 tokens/s, Avg generation throughput: 776.7 tokens/s, Running: 9 reqs, Waiting: 1 reqs, GPU KV cache usage: 90.6%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:03 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.57, Accepted throughput: 474.09 tokens/s, Drafted throughput: 605.39 tokens/s, Accepted: 4741 tokens, Drafted: 6054 tokens, Per-position acceptance rate: 0.863, 0.704, Avg Draft acceptance rate: 78.3%
(APIServer pid=5020) INFO 05-19 10:31:13 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.4 tokens/s, Avg generation throughput: 1046.5 tokens/s, Running: 9 reqs, Waiting: 1 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:13 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.54, Accepted throughput: 634.07 tokens/s, Drafted throughput: 824.76 tokens/s, Accepted: 6341 tokens, Drafted: 8248 tokens, Per-position acceptance rate: 0.842, 0.695, Avg Draft acceptance rate: 76.9%
(APIServer pid=5020) INFO 05-19 10:31:23 [loggers.py:271] Engine 000: Avg prompt throughput: 1639.3 tokens/s, Avg generation throughput: 1020.4 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 75.4%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:23 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.70, Accepted throughput: 642.56 tokens/s, Drafted throughput: 756.44 tokens/s, Accepted: 6427 tokens, Drafted: 7566 tokens, Per-position acceptance rate: 0.906, 0.792, Avg Draft acceptance rate: 84.9%
(APIServer pid=5020) INFO 05-19 10:31:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 788.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.9%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.84, Accepted throughput: 511.33 tokens/s, Drafted throughput: 555.53 tokens/s, Accepted: 5114 tokens, Drafted: 5556 tokens, Per-position acceptance rate: 0.971, 0.870, Avg Draft acceptance rate: 92.0%
(APIServer pid=5020) INFO 05-19 10:31:43 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 206.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:43 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.87, Accepted throughput: 134.69 tokens/s, Drafted throughput: 144.19 tokens/s, Accepted: 1347 tokens, Drafted: 1442 tokens, Per-position acceptance rate: 0.988, 0.881, Avg Draft acceptance rate: 93.4%
(APIServer pid=5020) INFO:     127.0.0.1:49326 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=5020) INFO 05-19 10:31:53 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=5020) INFO 05-19 10:31:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 40.20 tokens/s, Drafted throughput: 41.80 tokens/s, Accepted: 402 tokens, Drafted: 418 tokens, Per-position acceptance rate: 1.000, 0.923, Avg Draft acceptance rate: 96.2%
(APIServer pid=5020) INFO 05-19 10:32:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           20.00     
Benchmark duration (s):                  464.66    
Total input tokens:                      1639637   
Total generated tokens:                  409600    
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         881.51    
Peak output token throughput (tok/s):    458.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4410.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          213962.97 
Median TTFT (ms):                        216368.88 
P99 TTFT (ms):                           416589.59 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.38      
Median TPOT (ms):                        9.14      
P99 TPOT (ms):                           13.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.44     
Median ITL (ms):                         20.58     
P99 ITL (ms):                            114.47    
---------------Speculative Decoding---------------
Acceptance rate (%):                     80.31     
Acceptance length:                       2.61      
Drafts:                                  157158    
Draft tokens:                            314316    
Accepted tokens:                         252431    
Per-position acceptance (%):
  Position 0:                            87.04     
  Position 1:                            73.58     
==================================================

Analysis

Looking at the benchmark results, Experiment 1 (NO MTP) is faster in most practical metrics:

MetricExp 1 (No MTP)Exp 2 (MTP)Winner
Benchmark duration393s464sExp 1
Output tok/s1042881Exp 1
Total tok/s52144410Exp 1
Mean TTFT153s213sExp 1
Mean TPOT17.41ms9.38msExp 2
Peak output tok/s1659458Exp 1

Report of performance regression

Explained above

Misc discussion on performance

Explained above

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING