vllm - 💡(How to fix) Fix [Feature]: Support Dynamic Pruning for Speculative Decoding Draft Trees in EAGLE-3 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41823Fetched 2026-05-07 03:32:41
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Code Example

@dataclass
class SpeculativeConfig:
    # existing fields ...
    speculative_tree_prune_min_prob: float = 0.0
    """Per-node pruning threshold. A node is not expanded if its
    argmax token probability is below this value. 0.0 disables pruning."""

    speculative_tree_prune_path_prob: float = 0.0
    """Path-probability pruning threshold. A node is not expanded if
    the product of top-1 probabilities along its root-to-node path
    falls below this value. 0.0 disables pruning."""

---

--speculative-tree-prune-min-prob 0.1
--speculative-tree-prune-path-prob 0.05
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I would like to request support for dynamic pruning in vLLM's speculative decoding draft tree (--speculative-token-tree) in EAGLE-3. Today, the tree topology is static at model load time, which makes the number of draft branches fixed during inference. A pruning mechanism would allow vLLM to trim low-value branches at runtime based on token probabilities, improving efficiency without requiring users to hand-tune multiple static tree shapes.

Alternatives

<html> <body> <!--StartFragment--><!DOCTYPE html><h2 cid="n5" mdtype="heading" class="md-end-block md-heading md-focus" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.75em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.225; cursor: text; border-bottom: 1px solid rgb(238, 238, 238); color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain md-expand" style="box-sizing: border-box;">Motivation</span></h2><h3 cid="n6" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Current behavior</span></h3><p cid="n7" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">The draft tree topology is configured via </span><span md-inline="code" spellcheck="false" class="md-pair-s" style="box-sizing: border-box;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">speculative_token_tree</code></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;"> (for example, </span><span md-inline="code" spellcheck="false" class="md-pair-s" style="box-sizing: border-box;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">"[(0,), (1,), (0,0), (0,1), (1,0), (1,1)]"</code></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">) and is resolved to fixed per-level child counts stored in </span><span md-inline="code" spellcheck="false" class="md-pair-s" style="box-sizing: border-box;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">child_drafts_per_level</code></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;"> (see </span><span md-inline="code" spellcheck="false" class="md-pair-s" style="box-sizing: border-box;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">vllm/v1/spec_decode/llm_base_proposer.py:282–302</code></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">).</span></p><p cid="n8" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">During the </span><span md-inline="code" spellcheck="false" class="md-pair-s" style="box-sizing: border-box;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">propose_tree</code></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;"> loop, each level expands every parent node by a fixed number of children:</span></p><pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="python" cid="n9" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"><span class="cm-comment" style="box-sizing: border-box; color: rgb(170, 85, 0);"># llm_base_proposer.py ~L1006</span></span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"><span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">num_children</span> <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">=</span> <span class="cm-variable-2" style="box-sizing: border-box; color: rgb(0, 85, 170);">self</span>.<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">child_drafts_per_level</span>[<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">level</span>]</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"><span class="cm-keyword" style="box-sizing: border-box; color: rgb(119, 0, 136);">if</span> <span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">num_children</span> <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">==</span> <span class="cm-number" style="box-sizing: border-box; color: rgb(17, 102, 68);">1</span>:</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp;<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">draft_token_ids</span> <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">=</span> <span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">logits</span>.<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">argmax</span>(<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">dim</span><span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">=-</span><span class="cm-number" style="box-sizing: border-box; color: rgb(17, 102, 68);">1</span>).<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">view</span>(<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">batch_size</span>, <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">-</span><span class="cm-number" style="box-sizing: border-box; color: rgb(17, 102, 68);">1</span>)</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"><span class="cm-keyword" style="box-sizing: border-box; color: rgb(119, 0, 136);">else</span>:</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp;<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">draft_token_ids</span> <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">=</span> <span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">torch</span>.<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">topk</span>(<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">logits</span>, <span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">num_children</span>, <span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">dim</span><span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">=-</span><span class="cm-number" style="box-sizing: border-box; color: rgb(17, 102, 68);">1</span>).<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">indices</span>.<span class="cm-property" style="box-sizing: border-box; color: rgb(0, 0, 0);">view</span>(</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &nbsp;<span class="cm-variable" style="box-sizing: border-box; color: rgb(0, 0, 0);">batch_size</span>, <span class="cm-operator" style="box-sizing: border-box; color: rgb(152, 26, 26);">-</span><span class="cm-number" style="box-sizing: border-box; color: rgb(17, 102, 68);">1</span></span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; )</span></pre><p cid="n10" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">This means:</span></p><ul class="ul-list" cid="n11" mdtype="list" data-mark="-" style="box-sizing: border-box; margin: 0.8em 0px; padding-left: 30px; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li class="md-list-item" cid="n12" mdtype="list_item" style="box-sizing: border-box; margin: 0px; position: relative;"><p cid="n13" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 1; margin: 0px 0px 0.5rem; white-space: pre-wrap; position: relative;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">A high-confidence branch can still spend compute expanding siblings that are unlikely to be accepted.</span></p></li><li class="md-list-item" cid="n14" mdtype="list_item" style="box-sizing: border-box; margin: 0px; position: relative;"><p cid="n15" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 1; margin: 0px 0px 0.5rem; white-space: pre-wrap; position: relative;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">A low-confidence branch cannot request more budget dynamically.</span></p></li><li class="md-list-item" cid="n16" mdtype="list_item" style="box-sizing: border-box; margin: 0px; position: relative;"><p cid="n17" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 1; margin: 0px 0px 0.5rem; white-space: pre-wrap; position: relative;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">The number of draft tokens remains constant per step, regardless of input difficulty.</span></p></li></ul><h3 cid="n18" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Why this feature is useful</span></h3><figure class="md-table-fig table-figure" cid="n19" mdtype="table" style="box-sizing: border-box; margin: 1.2em 0px; overflow-x: auto; max-width: calc(100% + 16px); padding: 0px; cursor: default; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Scenario | Static tree | With dynamic pruning -- | -- | -- Easy prefix (repetitive or formulaic text) | Expands all branches; many are unnecessary | Prunes low-value branches early and reduces compute Hard / ambiguous text | Fixed branching budget | Can avoid wasting work on low-probability paths Latency-sensitive serving | Constant overhead per step | Fewer draft tokens and lower average latency Mixed-difficulty batch | Same tree shape for all requests | Branching can adapt per request </figure><p cid="n40" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Dynamic pruning has been explored in speculative decoding literature such as </span><span md-inline="strong" class="md-pair-s " style="box-sizing: border-box;"><strong style="box-sizing: border-box;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">SpecInfer</span></strong></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;"> and </span><span md-inline="strong" class="md-pair-s " style="box-sizing: border-box;"><strong style="box-sizing: border-box;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">EAGLE-2</span></strong></span><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">, where adaptive branch selection helps reduce wasted draft-token computation while preserving acceptance quality.</span></p><div tabindex="-1" cid="n41" mdtype="hr" class="md-hr md-end-block" style="box-sizing: border-box; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><hr style="box-sizing: content-box; height: 2px; margin: 16px 0px; border: 0px none; padding: 0px; background-color: rgb(231, 231, 231); overflow: hidden;"></div><h2 cid="n42" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.75em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.225; cursor: text; border-bottom: 1px solid rgb(238, 238, 238); color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Requested Feature</span></h2><p cid="n43" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Add an optional dynamic pruning pass during speculative draft-tree construction so that a node is not expanded when its probability is below a configurable threshold.</span></p><p cid="n44" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">Two possible pruning criteria:</span></p><h3 cid="n45" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">1. Per-node probability threshold</span></h3><pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="text" cid="n46" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">Prune node v at level L if max_prob(logits[v]) &lt; p_min</span></pre><p cid="n47" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">If the most likely next token from a node is already too unlikely, the branch can be stopped early.</span></p><h3 cid="n48" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain" style="box-sizing: border-box;">2. Cumulative path probability threshold</span></h3><pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="text" cid="n49" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">path_prob[v] = product of accepted top-1 probabilities along root → v</span><br><span role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">Prune node v if path_prob[v] &lt; p_budget</span></pre><p cid="n50" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, &quot;Segoe UI Emoji&quot;, &quot;SF Pro&quot;, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span md-inline="plain" class="md-plain md-expand" style="box-sizing: border-box;">This keeps only branches whose joint probability justifies the compute cost.</span></p><!--EndFragment--> </body> </html>

Additional context

API Proposal

Add optional parameters to SpeculativeConfig:

@dataclass
class SpeculativeConfig:
    # existing fields ...
    speculative_tree_prune_min_prob: float = 0.0
    """Per-node pruning threshold. A node is not expanded if its
    argmax token probability is below this value. 0.0 disables pruning."""

    speculative_tree_prune_path_prob: float = 0.0
    """Path-probability pruning threshold. A node is not expanded if
    the product of top-1 probabilities along its root-to-node path
    falls below this value. 0.0 disables pruning."""

Suggested CLI flags:

--speculative-tree-prune-min-prob 0.1
--speculative-tree-prune-path-prob 0.05

A unified --speculative-tree-prune-threshold flag could also be considered if maintainers prefer a simpler interface.

Requested Next Steps

If this feature is acceptable, I would appreciate guidance on:

  • preferred API shape (p_min / p_budget vs. a single pruning threshold)
  • the right place to integrate pruning in the tree proposer
  • how to represent variable-length tree attention metadata
  • whether the verifier interface should be updated in the same change set

I am also willing to contribute a prototype for the pruning logic and tree-metadata changes.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Support Dynamic Pruning for Speculative Decoding Draft Trees in EAGLE-3 [1 participants]