vllm - ✅(Solved) Fix Speculative/MTP draft config appears to drop target --hf-overrides (breaks long-context YaRN/RoPE extension) [1 pull requests, 5 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37435Fetched 2026-04-08 00:58:41
View on GitHub
Comments
5
Participants
1
Timeline
13
Reactions
0
Author
Participants
Timeline (top)
referenced ×6commented ×5cross-referenced ×2

Fix Action

Fix / Workaround

  1. target hf_overrides are merged into the draft ModelConfig
  2. a draft-side override field exists in SpeculativeConfig
  3. a CLI mechanism exists to pass speculative/draft HF overrides separately

PR fix notes

PR #37443: [Bugfix][Core] Preserve target hf_overrides in MTP draft config

Description (problem / solution / changelog)

Summary

This PR fixes #37435 by preserving mapping-style target-model hf_overrides when vLLM rebuilds the speculative/MTP draft ModelConfig.

The updated implementation keeps the draft model derived from its own base config, then reapplies only mapping-style target hf_overrides before the internal speculative hf_config_override runs.

That preserves long-context / YaRN / RoPE overrides for the real user path while avoiding broader target-only config mutations from arbitrary callable overrides leaking into the derived draft model.

Test coverage

The focused regression coverage now includes:

  • nested dict YaRN/RoPE-style overrides
  • top-level dict-valued override replacement semantics, aligned with ModelConfig
  • mapping-like hf_overrides inputs, so they are not silently dropped
  • empty/default target hf_overrides, to ensure the draft hf_config_override still runs
  • pickling of mapping-derived draft overrides
  • scalar replacement for nested config attributes
  • callable target hf_overrides are intentionally not propagated into the draft model

Validation

  • python3 -m py_compile vllm/config/speculative.py tests/test_config.py
  • VLLM_TARGET_DEVICE=cpu PYTHONPATH=. .venv/bin/python -m pytest -q tests/test_config.py -k 'test_mtp_draft_model_config_preserves_target_hf_overrides or test_mtp_draft_model_config_does_not_propagate_callable_target_hf_overrides or test_mtp_draft_model_config_matches_top_level_dict_override_semantics or test_mtp_draft_model_config_preserves_mapping_target_hf_overrides or test_mtp_draft_model_config_keeps_default_hf_config_override or test_mtp_draft_model_config_mapping_hf_overrides_is_picklable or test_mtp_draft_model_config_allows_non_dict_nested_override_values'

Why this is not duplicating an existing PR

I checked for existing open PRs linked to #37435 and for likely duplicates in this area when opening the branch:

  • gh pr list --repo vllm-project/vllm --state open --search "37435 in:body"
  • gh pr list --repo vllm-project/vllm --state open --search "speculative hf_overrides mtp"

AI assistance

AI assistance was used to prepare this PR.

Changed files

  • tests/test_config.py (modified, +342/-0)
  • vllm/config/speculative.py (modified, +76/-1)
  • vllm/model_executor/models/config.py (modified, +2/-2)

Code Example

hf_overrides=SpeculativeConfig.hf_config_override,

---

self.draft_model_config = ModelConfig(
    model=self.model,
    runner="draft",
    tokenizer=self.target_model_config.tokenizer,
    tokenizer_mode=self.target_model_config.tokenizer_mode,
    trust_remote_code=self.target_model_config.trust_remote_code,
    allowed_local_media_path=self.target_model_config.allowed_local_media_path,
    allowed_media_domains=self.target_model_config.allowed_media_domains,
    dtype=self.target_model_config.dtype,
    seed=self.target_model_config.seed,
    revision=self.revision,
    code_revision=self.code_revision,
    tokenizer_revision=self.target_model_config.tokenizer_revision,
    spec_target_max_model_len=self.target_model_config.max_model_len,
    quantization=self.quantization,
    enforce_eager=self.target_model_config.enforce_eager,
    max_logprobs=self.target_model_config.max_logprobs,
    hf_overrides=SpeculativeConfig.hf_config_override,
    config_format=self.target_model_config.config_format,
)

---

podman run --rm -it \
  --name vllm-qwen35-397b \
  --replace \
  --device nvidia.com/gpu=all \
  --ipc=host \
  -p 8000:8000 \
  -v /mnt/vault/llm/huggingface:/root/.cache/huggingface:Z \
  -v /mnt/vault/llm/vllm+lmcache/configs:/configs:ro,Z \
  -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e OMP_NUM_THREADS=4 \
  -e NCCL_IB_DISABLE=1 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TUNED_CONFIG_FOLDER=/configs \
  docker.io/voipmonitor/vllm:dev-cu130 \
    --host 0.0.0.0 \
    --port 8000 \
    --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
    --served-model-name qwen35-397b \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --trust-remote-code \
    --max_num_seqs 32 \
    --max-num-batched-tokens 4096 \
    --max-model-len 524288 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

---

SpecDecoding metrics: Mean acceptance length: 2.60 ... Avg Draft acceptance rate: 80.0%
SpecDecoding metrics: Mean acceptance length: 3.00 ... Avg Draft acceptance rate: 100.0%
SpecDecoding metrics: Mean acceptance length: 2.78 ... Avg Draft acceptance rate: 88.9%
SpecDecoding metrics: Mean acceptance length: 2.88 ... Avg Draft acceptance rate: 93.8%
SpecDecoding metrics: Mean acceptance length: 2.71 ... Avg Draft acceptance rate: 85.7%

---

SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 16 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 676 tokens ... Avg Draft acceptance rate: 0.1%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 890 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.01 ... Accepted: 2 tokens, Drafted: 422 tokens ... Avg Draft acceptance rate: 0.5%
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: main-ish / dev container image voipmonitor/vllm:dev-cu130
  • Serving stack: 4x Blackwell on a single node
  • Model: lukealonso/Qwen3.5-397B-A17B-NVFP4
  • Speculative decoding: native MTP (--speculative-config '{"method":"mtp","num_speculative_tokens":2}')
  • Context: 524288
  • RoPE/YaRN extension: applied via --hf-overrides

Describe the bug

I believe speculative/MTP draft config is not inheriting target-model --hf-overrides, which can break long-context speculation when RoPE/YaRN scaling is injected at runtime rather than baked into the checkpoint config.

In my setup, MTP works very well at shorter / ordinary contexts, but when requests get into very large prompts (for example large OpenClaw prompts well beyond the model's original context), the draft acceptance rate can collapse to ~0% while the target model still appears usable.

That pattern looks like a draft/target positional-config mismatch:

  • target model sees --hf-overrides with YaRN factor 2
  • draft/MTP side appears to rebuild its own ModelConfig
  • draft side does not appear to receive the same hf_overrides
  • beyond original context size, draft starts proposing nonsense / verifier rejects everything

Code path that looks suspicious

In vllm/config/speculative.py, draft config is rebuilt with a fresh ModelConfig(...) and hardcodes:

hf_overrides=SpeculativeConfig.hf_config_override,

instead of preserving / merging the target model's hf_overrides.

The relevant block currently looks like this:

self.draft_model_config = ModelConfig(
    model=self.model,
    runner="draft",
    tokenizer=self.target_model_config.tokenizer,
    tokenizer_mode=self.target_model_config.tokenizer_mode,
    trust_remote_code=self.target_model_config.trust_remote_code,
    allowed_local_media_path=self.target_model_config.allowed_local_media_path,
    allowed_media_domains=self.target_model_config.allowed_media_domains,
    dtype=self.target_model_config.dtype,
    seed=self.target_model_config.seed,
    revision=self.revision,
    code_revision=self.code_revision,
    tokenizer_revision=self.target_model_config.tokenizer_revision,
    spec_target_max_model_len=self.target_model_config.max_model_len,
    quantization=self.quantization,
    enforce_eager=self.target_model_config.enforce_eager,
    max_logprobs=self.target_model_config.max_logprobs,
    hf_overrides=SpeculativeConfig.hf_config_override,
    config_format=self.target_model_config.config_format,
)

Unless I am missing another propagation path, this seems to discard user-provided target-side HF overrides for the draft/MTP model.

Exact launch command

podman run --rm -it \
  --name vllm-qwen35-397b \
  --replace \
  --device nvidia.com/gpu=all \
  --ipc=host \
  -p 8000:8000 \
  -v /mnt/vault/llm/huggingface:/root/.cache/huggingface:Z \
  -v /mnt/vault/llm/vllm+lmcache/configs:/configs:ro,Z \
  -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e OMP_NUM_THREADS=4 \
  -e NCCL_IB_DISABLE=1 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TUNED_CONFIG_FOLDER=/configs \
  docker.io/voipmonitor/vllm:dev-cu130 \
    --host 0.0.0.0 \
    --port 8000 \
    --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
    --served-model-name qwen35-397b \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --trust-remote-code \
    --max_num_seqs 32 \
    --max-num-batched-tokens 4096 \
    --max-model-len 524288 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

Observed behavior from logs

I can reproduce both healthy and catastrophic acceptance on the same running server.

Healthy examples:

SpecDecoding metrics: Mean acceptance length: 2.60 ... Avg Draft acceptance rate: 80.0%
SpecDecoding metrics: Mean acceptance length: 3.00 ... Avg Draft acceptance rate: 100.0%
SpecDecoding metrics: Mean acceptance length: 2.78 ... Avg Draft acceptance rate: 88.9%
SpecDecoding metrics: Mean acceptance length: 2.88 ... Avg Draft acceptance rate: 93.8%
SpecDecoding metrics: Mean acceptance length: 2.71 ... Avg Draft acceptance rate: 85.7%

Then, for larger prompts / long-context workloads, it can collapse hard:

SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 16 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 676 tokens ... Avg Draft acceptance rate: 0.1%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 890 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.01 ... Accepted: 2 tokens, Drafted: 422 tokens ... Avg Draft acceptance rate: 0.5%

This is not just "MTP is mediocre" — it flips between very good acceptance and nearly total rejection depending on prompt shape/length.

Why I suspect hf_overrides

My YaRN factor=2 is not baked into the checkpoint config. It is applied via --hf-overrides.

So if target and draft configs are resolved differently, the most likely failure mode is:

  • target path uses extended RoPE/YaRN params
  • draft/MTP path uses default/base config
  • once positions exceed original context, draft predictions become garbage
  • verifier rejects almost everything

Expected behavior

For native MTP / speculative decoding using the same base model, the draft-side config should preserve the target model's HF config overrides, or at least provide a supported way to pass draft-side HF overrides explicitly.

At minimum, I would expect one of these:

  1. target hf_overrides are merged into the draft ModelConfig
  2. a draft-side override field exists in SpeculativeConfig
  3. a CLI mechanism exists to pass speculative/draft HF overrides separately

Additional note

I could not find a user-facing way to pass draft-side hf_overrides through --speculative-config. I also did not see hf_overrides, rope_scaling, or yarn as fields on SpeculativeConfig.

Question

Is this a real bug / missing propagation path, or am I missing some other place where target-side --hf-overrides are intentionally forwarded into the MTP draft config?

extent analysis

Fix Plan

To resolve the issue, we need to modify the vllm/config/speculative.py file to preserve the target model's hf_overrides when building the draft model config.

Here are the steps:

  • Modify the ModelConfig initialization in vllm/config/speculative.py to include the target model's hf_overrides.
  • Update the SpeculativeConfig to accept draft-side HF overrides explicitly.

Code Changes

# In vllm/config/speculative.py

# ...

self.draft_model_config = ModelConfig(
    # ...
    hf_overrides=self.target_model_config.hf_overrides,  # Preserve target model's hf_overrides
    # ...
)

# Add a new field to SpeculativeConfig to accept draft-side HF overrides
class SpeculativeConfig:
    # ...
    draft_hf_overrides = None  # New field to accept draft-side HF overrides

# Update the ModelConfig initialization to use draft_hf_overrides if provided
self.draft_model_config = ModelConfig(
    # ...
    hf_overrides=self.speculative_config.draft_hf_overrides or self.target_model_config.hf_overrides,
    # ...
)

CLI Mechanism

To pass draft-side HF overrides through the CLI, add a new argument to the --speculative-config option.

# In vllm/cli.py

# ...

parser.add_argument(
    '--speculative-config',
    type=json.loads,
    help='Speculative decoding configuration',
    default='{}'
)

# Update the SpeculativeConfig to accept draft_hf_overrides from the CLI
speculative_config = parser.parse_args().speculative_config
speculative_config['draft_hf_overrides'] = speculative_config.get('draft_hf_overrides')

# Pass the draft_hf_overrides to the SpeculativeConfig
speculative_config = SpeculativeConfig(
    # ...
    draft_hf_overrides=speculative_config['draft_hf_overrides'],
    # ...
)

Example Usage

To pass draft-side HF overrides through the CLI, use the --speculative-config option with the draft_hf_overrides field.

podman run --rm -it \
  # ...
  --speculative-config '{"method":"mtp","num_speculative_tokens":2,"draft_hf_overrides":{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}}' \
  # ...

Verification

To verify that the fix worked, run the same workload with the updated code and check the acceptance rate. The acceptance rate should no longer collapse to 0

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For native MTP / speculative decoding using the same base model, the draft-side config should preserve the target model's HF config overrides, or at least provide a supported way to pass draft-side HF overrides explicitly.

At minimum, I would expect one of these:

  1. target hf_overrides are merged into the draft ModelConfig
  2. a draft-side override field exists in SpeculativeConfig
  3. a CLI mechanism exists to pass speculative/draft HF overrides separately

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Speculative/MTP draft config appears to drop target --hf-overrides (breaks long-context YaRN/RoPE extension) [1 pull requests, 5 comments, 1 participants]