vllm - ✅(Solved) Fix Speculative/MTP draft config appears to drop target --hf-overrides (breaks long-context YaRN/RoPE extension) [1 pull requests, 5 comments, 1 participants]

vllm2026-03-18 13:34:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37435•Fetched 2026-04-08 00:58:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

malaiwah

Participants

malaiwah

Timeline (top)

referenced ×6commented ×5cross-referenced ×2

Fix Action

Fix / Workaround

target hf_overrides are merged into the draft ModelConfig
a draft-side override field exists in SpeculativeConfig
a CLI mechanism exists to pass speculative/draft HF overrides separately

PR fix notes

PR #37443: [Bugfix][Core] Preserve target hf_overrides in MTP draft config

Repository: vllm-project/vllm
Author: malaiwah
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37443

Description (problem / solution / changelog)

Summary

This PR fixes #37435 by preserving mapping-style target-model hf_overrides when vLLM rebuilds the speculative/MTP draft ModelConfig.

The updated implementation keeps the draft model derived from its own base config, then reapplies only mapping-style target hf_overrides before the internal speculative hf_config_override runs.

That preserves long-context / YaRN / RoPE overrides for the real user path while avoiding broader target-only config mutations from arbitrary callable overrides leaking into the derived draft model.

Test coverage

The focused regression coverage now includes:

nested dict YaRN/RoPE-style overrides
top-level dict-valued override replacement semantics, aligned with ModelConfig
mapping-like hf_overrides inputs, so they are not silently dropped
empty/default target hf_overrides, to ensure the draft hf_config_override still runs
pickling of mapping-derived draft overrides
scalar replacement for nested config attributes
callable target hf_overrides are intentionally not propagated into the draft model

Validation

python3 -m py_compile vllm/config/speculative.py tests/test_config.py
VLLM_TARGET_DEVICE=cpu PYTHONPATH=. .venv/bin/python -m pytest -q tests/test_config.py -k 'test_mtp_draft_model_config_preserves_target_hf_overrides or test_mtp_draft_model_config_does_not_propagate_callable_target_hf_overrides or test_mtp_draft_model_config_matches_top_level_dict_override_semantics or test_mtp_draft_model_config_preserves_mapping_target_hf_overrides or test_mtp_draft_model_config_keeps_default_hf_config_override or test_mtp_draft_model_config_mapping_hf_overrides_is_picklable or test_mtp_draft_model_config_allows_non_dict_nested_override_values'

Why this is not duplicating an existing PR

I checked for existing open PRs linked to #37435 and for likely duplicates in this area when opening the branch:

gh pr list --repo vllm-project/vllm --state open --search "37435 in:body"
gh pr list --repo vllm-project/vllm --state open --search "speculative hf_overrides mtp"

AI assistance

AI assistance was used to prepare this PR.

Changed files

tests/test_config.py (modified, +342/-0)
vllm/config/speculative.py (modified, +76/-1)
vllm/model_executor/models/config.py (modified, +2/-2)

Code Example

hf_overrides=SpeculativeConfig.hf_config_override,

---

self.draft_model_config = ModelConfig(
    model=self.model,
    runner="draft",
    tokenizer=self.target_model_config.tokenizer,
    tokenizer_mode=self.target_model_config.tokenizer_mode,
    trust_remote_code=self.target_model_config.trust_remote_code,
    allowed_local_media_path=self.target_model_config.allowed_local_media_path,
    allowed_media_domains=self.target_model_config.allowed_media_domains,
    dtype=self.target_model_config.dtype,
    seed=self.target_model_config.seed,
    revision=self.revision,
    code_revision=self.code_revision,
    tokenizer_revision=self.target_model_config.tokenizer_revision,
    spec_target_max_model_len=self.target_model_config.max_model_len,
    quantization=self.quantization,
    enforce_eager=self.target_model_config.enforce_eager,
    max_logprobs=self.target_model_config.max_logprobs,
    hf_overrides=SpeculativeConfig.hf_config_override,
    config_format=self.target_model_config.config_format,
)

---

podman run --rm -it \
  --name vllm-qwen35-397b \
  --replace \
  --device nvidia.com/gpu=all \
  --ipc=host \
  -p 8000:8000 \
  -v /mnt/vault/llm/huggingface:/root/.cache/huggingface:Z \
  -v /mnt/vault/llm/vllm+lmcache/configs:/configs:ro,Z \
  -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e OMP_NUM_THREADS=4 \
  -e NCCL_IB_DISABLE=1 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TUNED_CONFIG_FOLDER=/configs \
  docker.io/voipmonitor/vllm:dev-cu130 \
    --host 0.0.0.0 \
    --port 8000 \
    --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
    --served-model-name qwen35-397b \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --trust-remote-code \
    --max_num_seqs 32 \
    --max-num-batched-tokens 4096 \
    --max-model-len 524288 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

---

SpecDecoding metrics: Mean acceptance length: 2.60 ... Avg Draft acceptance rate: 80.0%
SpecDecoding metrics: Mean acceptance length: 3.00 ... Avg Draft acceptance rate: 100.0%
SpecDecoding metrics: Mean acceptance length: 2.78 ... Avg Draft acceptance rate: 88.9%
SpecDecoding metrics: Mean acceptance length: 2.88 ... Avg Draft acceptance rate: 93.8%
SpecDecoding metrics: Mean acceptance length: 2.71 ... Avg Draft acceptance rate: 85.7%

---

SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 16 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 676 tokens ... Avg Draft acceptance rate: 0.1%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 890 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.01 ... Accepted: 2 tokens, Drafted: 422 tokens ... Avg Draft acceptance rate: 0.5%

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: main-ish / dev container image voipmonitor/vllm:dev-cu130
Serving stack: 4x Blackwell on a single node
Model: lukealonso/Qwen3.5-397B-A17B-NVFP4
Speculative decoding: native MTP (--speculative-config '{"method":"mtp","num_speculative_tokens":2}')
Context: 524288
RoPE/YaRN extension: applied via --hf-overrides

Describe the bug

I believe speculative/MTP draft config is not inheriting target-model --hf-overrides, which can break long-context speculation when RoPE/YaRN scaling is injected at runtime rather than baked into the checkpoint config.

In my setup, MTP works very well at shorter / ordinary contexts, but when requests get into very large prompts (for example large OpenClaw prompts well beyond the model's original context), the draft acceptance rate can collapse to ~0% while the target model still appears usable.

That pattern looks like a draft/target positional-config mismatch:

target model sees --hf-overrides with YaRN factor 2
draft/MTP side appears to rebuild its own ModelConfig
draft side does not appear to receive the same hf_overrides
beyond original context size, draft starts proposing nonsense / verifier rejects everything

Code path that looks suspicious

In vllm/config/speculative.py, draft config is rebuilt with a fresh ModelConfig(...) and hardcodes:

hf_overrides=SpeculativeConfig.hf_config_override,

instead of preserving / merging the target model's hf_overrides.

The relevant block currently looks like this:

self.draft_model_config = ModelConfig(
    model=self.model,
    runner="draft",
    tokenizer=self.target_model_config.tokenizer,
    tokenizer_mode=self.target_model_config.tokenizer_mode,
    trust_remote_code=self.target_model_config.trust_remote_code,
    allowed_local_media_path=self.target_model_config.allowed_local_media_path,
    allowed_media_domains=self.target_model_config.allowed_media_domains,
    dtype=self.target_model_config.dtype,
    seed=self.target_model_config.seed,
    revision=self.revision,
    code_revision=self.code_revision,
    tokenizer_revision=self.target_model_config.tokenizer_revision,
    spec_target_max_model_len=self.target_model_config.max_model_len,
    quantization=self.quantization,
    enforce_eager=self.target_model_config.enforce_eager,
    max_logprobs=self.target_model_config.max_logprobs,
    hf_overrides=SpeculativeConfig.hf_config_override,
    config_format=self.target_model_config.config_format,
)

Unless I am missing another propagation path, this seems to discard user-provided target-side HF overrides for the draft/MTP model.

Exact launch command

podman run --rm -it \
  --name vllm-qwen35-397b \
  --replace \
  --device nvidia.com/gpu=all \
  --ipc=host \
  -p 8000:8000 \
  -v /mnt/vault/llm/huggingface:/root/.cache/huggingface:Z \
  -v /mnt/vault/llm/vllm+lmcache/configs:/configs:ro,Z \
  -e HF_TOKEN="$HF_TOKEN" \
  -e VLLM_SLEEP_WHEN_IDLE=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e OMP_NUM_THREADS=4 \
  -e NCCL_IB_DISABLE=1 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TUNED_CONFIG_FOLDER=/configs \
  docker.io/voipmonitor/vllm:dev-cu130 \
    --host 0.0.0.0 \
    --port 8000 \
    --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
    --served-model-name qwen35-397b \
    --hf-overrides '{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype auto \
    --enable-prefix-caching \
    --trust-remote-code \
    --max_num_seqs 32 \
    --max-num-batched-tokens 4096 \
    --max-model-len 524288 \
    --enable-auto-tool-choice \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

Observed behavior from logs

I can reproduce both healthy and catastrophic acceptance on the same running server.

Healthy examples:

SpecDecoding metrics: Mean acceptance length: 2.60 ... Avg Draft acceptance rate: 80.0%
SpecDecoding metrics: Mean acceptance length: 3.00 ... Avg Draft acceptance rate: 100.0%
SpecDecoding metrics: Mean acceptance length: 2.78 ... Avg Draft acceptance rate: 88.9%
SpecDecoding metrics: Mean acceptance length: 2.88 ... Avg Draft acceptance rate: 93.8%
SpecDecoding metrics: Mean acceptance length: 2.71 ... Avg Draft acceptance rate: 85.7%

Then, for larger prompts / long-context workloads, it can collapse hard:

SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 16 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 676 tokens ... Avg Draft acceptance rate: 0.1%
SpecDecoding metrics: Mean acceptance length: 1.00 ... Accepted: 0 tokens, Drafted: 890 tokens ... Avg Draft acceptance rate: 0.0%
SpecDecoding metrics: Mean acceptance length: 1.01 ... Accepted: 2 tokens, Drafted: 422 tokens ... Avg Draft acceptance rate: 0.5%

This is not just "MTP is mediocre" — it flips between very good acceptance and nearly total rejection depending on prompt shape/length.

Why I suspect `hf_overrides`

My YaRN factor=2 is not baked into the checkpoint config. It is applied via --hf-overrides.

So if target and draft configs are resolved differently, the most likely failure mode is:

target path uses extended RoPE/YaRN params
draft/MTP path uses default/base config
once positions exceed original context, draft predictions become garbage
verifier rejects almost everything

Expected behavior

For native MTP / speculative decoding using the same base model, the draft-side config should preserve the target model's HF config overrides, or at least provide a supported way to pass draft-side HF overrides explicitly.

At minimum, I would expect one of these:

target hf_overrides are merged into the draft ModelConfig
a draft-side override field exists in SpeculativeConfig
a CLI mechanism exists to pass speculative/draft HF overrides separately

Additional note

I could not find a user-facing way to pass draft-side hf_overrides through --speculative-config. I also did not see hf_overrides, rope_scaling, or yarn as fields on SpeculativeConfig.

Question

Is this a real bug / missing propagation path, or am I missing some other place where target-side --hf-overrides are intentionally forwarded into the MTP draft config?

extent analysis

Fix Plan

To resolve the issue, we need to modify the vllm/config/speculative.py file to preserve the target model's hf_overrides when building the draft model config.

Here are the steps:

Modify the ModelConfig initialization in vllm/config/speculative.py to include the target model's hf_overrides.
Update the SpeculativeConfig to accept draft-side HF overrides explicitly.

Code Changes

# In vllm/config/speculative.py

# ...

self.draft_model_config = ModelConfig(
    # ...
    hf_overrides=self.target_model_config.hf_overrides,  # Preserve target model's hf_overrides
    # ...
)

# Add a new field to SpeculativeConfig to accept draft-side HF overrides
class SpeculativeConfig:
    # ...
    draft_hf_overrides = None  # New field to accept draft-side HF overrides

# Update the ModelConfig initialization to use draft_hf_overrides if provided
self.draft_model_config = ModelConfig(
    # ...
    hf_overrides=self.speculative_config.draft_hf_overrides or self.target_model_config.hf_overrides,
    # ...
)

CLI Mechanism

To pass draft-side HF overrides through the CLI, add a new argument to the --speculative-config option.

# In vllm/cli.py

# ...

parser.add_argument(
    '--speculative-config',
    type=json.loads,
    help='Speculative decoding configuration',
    default='{}'
)

# Update the SpeculativeConfig to accept draft_hf_overrides from the CLI
speculative_config = parser.parse_args().speculative_config
speculative_config['draft_hf_overrides'] = speculative_config.get('draft_hf_overrides')

# Pass the draft_hf_overrides to the SpeculativeConfig
speculative_config = SpeculativeConfig(
    # ...
    draft_hf_overrides=speculative_config['draft_hf_overrides'],
    # ...
)

Example Usage

To pass draft-side HF overrides through the CLI, use the --speculative-config option with the draft_hf_overrides field.

podman run --rm -it \
  # ...
  --speculative-config '{"method":"mtp","num_speculative_tokens":2,"draft_hf_overrides":{"text_config":{"rope_parameters":{"mrope_interleaved":true,"mrope_section":[11,11,10],"rope_type":"yarn","rope_theta":10000000,"partial_rotary_factor":0.25,"factor":2.0,"original_max_position_embeddings":262144}}}}' \
  # ...

Verification

To verify that the fix worked, run the same workload with the updated code and check the acceptance rate. The acceptance rate should no longer collapse to 0

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

At minimum, I would expect one of these:

target hf_overrides are merged into the draft ModelConfig
a draft-side override field exists in SpeculativeConfig
a CLI mechanism exists to pass speculative/draft HF overrides separately

#api #ssr #installation #tensor shape #autograd error #embedding generation #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Speculative/MTP draft config appears to drop target --hf-overrides (breaks long-context YaRN/RoPE extension) [1 pull requests, 5 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #37443: [Bugfix][Core] Preserve target hf_overrides in MTP draft config

Description (problem / solution / changelog)

Summary

Test coverage

Validation

Why this is not duplicating an existing PR

AI assistance

Changed files

Code Example

Your current environment

Describe the bug

Code path that looks suspicious

Exact launch command

Observed behavior from logs

Why I suspect hf_overrides

Expected behavior

Additional note

Question

extent analysis

Fix Plan

Code Changes

CLI Mechanism

Example Usage

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Why I suspect `hf_overrides`