vllm - 💡(How to fix) Fix [Performance]: Significant TTFT Regression with Speculative Decoding (EAGLE3) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39790Fetched 2026-04-16 06:36:35
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Code Example

vllm serve /path/to/Qwen3-8B \
    --served-model-name qwen3-8b \
    --port 8188 \
    --enforce-eager \
    --speculative-config '{"method": "eagle3", "model": "/path/to/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 15}'

---

vllm bench serve \
    --model /path/to/Qwen3-8B \
    --base-url http://127.0.0.1:8188 \
    --served-model-name qwen3-8b \
    --dataset-name custom \
    --dataset-path /path/to/dataset.jsonl \
    --num-prompts 50 \
    --max-concurrency 1 \
    --temperature 0
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

When enabling speculative decoding (tested with both EAGLE3 and DFlash methods), I observe a substantial increase in Time-To-First-Token (TTFT), particularly at the P99 percentile. While the expected improvement in Time-Per-Output-Token (TPOT) is confirmed, the TTFT degradation appears to be an inherent overhead of the speculative decoding process that might be more pronounced than previously understood.

Report of performance regression

I started a vLLM server with a target model and enable a draft model via --speculative-config. Example for EAGLE3 on GPU:

vllm serve /path/to/Qwen3-8B \
    --served-model-name qwen3-8b \
    --port 8188 \
    --enforce-eager \
    --speculative-config '{"method": "eagle3", "model": "/path/to/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 15}'

Then I ran a benchmark using the vllm bench serve command with a custom dataset (MATH-500):

vllm bench serve \
    --model /path/to/Qwen3-8B \
    --base-url http://127.0.0.1:8188 \
    --served-model-name qwen3-8b \
    --dataset-name custom \
    --dataset-path /path/to/dataset.jsonl \
    --num-prompts 50 \
    --max-concurrency 1 \
    --temperature 0

Baseline (No Speculative Decoding):

  • Mean TPOT: ~28.20 ms
  • Mean TTFT: ~73.97 ms
  • P99 TTFT: ~249.90 ms

With Speculative Decoding (EAGLE3):

  • Mean TPOT: ~17.11 ms (a 39.3% improvement, which is great)
  • Mean TTFT: ~158.44 ms (a 114% increase)
  • P99 TTFT: ~1078.03 ms (a 331% increase).

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

My env is: NVIDIA V100 vllm 0.17.0 Model: Qwen3-8B(Target) & corresponding EAGLE-3 draft model

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adjusting the num_speculative_tokens parameter in the speculative config may help mitigate the increase in Time-To-First-Token (TTFT) when using speculative decoding with EAGLE3 or DFlash methods.

Guidance

  • Review the current value of num_speculative_tokens (set to 15 in the example) and consider reducing it to find a balance between TPOT improvement and TTFT degradation.
  • Experiment with different values for num_speculative_tokens to observe the impact on TTFT and TPOT.
  • Verify the performance metrics (Mean TTFT, P99 TTFT, and Mean TPOT) after adjusting the num_speculative_tokens parameter to ensure the desired trade-off between TTFT and TPOT.
  • Consider exploring other speculative decoding methods, such as DFlash, to compare their performance characteristics.

Example

No code snippet is provided as the issue focuses on configuration adjustments rather than code changes.

Notes

The optimal value for num_speculative_tokens may depend on the specific use case, model, and hardware configuration. Further experimentation and benchmarking may be necessary to find the best balance between TTFT and TPOT.

Recommendation

Apply a workaround by adjusting the num_speculative_tokens parameter to mitigate the TTFT increase, as the root cause appears to be an inherent overhead of the speculative decoding process.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Significant TTFT Regression with Speculative Decoding (EAGLE3) [1 participants]