vllm - 💡(How to fix) Fix [Performance]: Significant TTFT Regression with Speculative Decoding (EAGLE3) [1 participants]

vllm2026-04-14 11:44:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39790•Fetched 2026-04-16 06:36:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

KlyzhenkoVadim

Participants

KlyzhenkoVadim

Timeline (top)

labeled ×1

Code Example

vllm serve /path/to/Qwen3-8B \
    --served-model-name qwen3-8b \
    --port 8188 \
    --enforce-eager \
    --speculative-config '{"method": "eagle3", "model": "/path/to/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 15}'

---

vllm bench serve \
    --model /path/to/Qwen3-8B \
    --base-url http://127.0.0.1:8188 \
    --served-model-name qwen3-8b \
    --dataset-name custom \
    --dataset-path /path/to/dataset.jsonl \
    --num-prompts 50 \
    --max-concurrency 1 \
    --temperature 0

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

When enabling speculative decoding (tested with both EAGLE3 and DFlash methods), I observe a substantial increase in Time-To-First-Token (TTFT), particularly at the P99 percentile. While the expected improvement in Time-Per-Output-Token (TPOT) is confirmed, the TTFT degradation appears to be an inherent overhead of the speculative decoding process that might be more pronounced than previously understood.

Report of performance regression

I started a vLLM server with a target model and enable a draft model via --speculative-config. Example for EAGLE3 on GPU:

vllm serve /path/to/Qwen3-8B \
    --served-model-name qwen3-8b \
    --port 8188 \
    --enforce-eager \
    --speculative-config '{"method": "eagle3", "model": "/path/to/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 15}'

Then I ran a benchmark using the vllm bench serve command with a custom dataset (MATH-500):

vllm bench serve \
    --model /path/to/Qwen3-8B \
    --base-url http://127.0.0.1:8188 \
    --served-model-name qwen3-8b \
    --dataset-name custom \
    --dataset-path /path/to/dataset.jsonl \
    --num-prompts 50 \
    --max-concurrency 1 \
    --temperature 0

Baseline (No Speculative Decoding):

Mean TPOT: ~28.20 ms
Mean TTFT: ~73.97 ms
P99 TTFT: ~249.90 ms

With Speculative Decoding (EAGLE3):

Mean TPOT: ~17.11 ms (a 39.3% improvement, which is great)
Mean TTFT: ~158.44 ms (a 114% increase)
P99 TTFT: ~1078.03 ms (a 331% increase).

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

My env is: NVIDIA V100 vllm 0.17.0 Model: Qwen3-8B(Target) & corresponding EAGLE-3 draft model

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adjusting the num_speculative_tokens parameter in the speculative config may help mitigate the increase in Time-To-First-Token (TTFT) when using speculative decoding with EAGLE3 or DFlash methods.

Guidance

Review the current value of num_speculative_tokens (set to 15 in the example) and consider reducing it to find a balance between TPOT improvement and TTFT degradation.
Experiment with different values for num_speculative_tokens to observe the impact on TTFT and TPOT.
Verify the performance metrics (Mean TTFT, P99 TTFT, and Mean TPOT) after adjusting the num_speculative_tokens parameter to ensure the desired trade-off between TTFT and TPOT.
Consider exploring other speculative decoding methods, such as DFlash, to compare their performance characteristics.

Example

No code snippet is provided as the issue focuses on configuration adjustments rather than code changes.

Notes

The optimal value for num_speculative_tokens may depend on the specific use case, model, and hardware configuration. Further experimentation and benchmarking may be necessary to find the best balance between TTFT and TPOT.

Recommendation

Apply a workaround by adjusting the num_speculative_tokens parameter to mitigate the TTFT increase, as the root cause appears to be an inherent overhead of the speculative decoding process.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: Significant TTFT Regression with Speculative Decoding (EAGLE3) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Significant TTFT Regression with Speculative Decoding (EAGLE3) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING