vllm - ✅(Solved) Fix [Bug]: AssertionError: Encoder cache miss crashes engine with MTP + multimodal under high concurrency [4 pull requests, 9 comments, 4 participants]

vllm2026-03-30 13:52:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38551•Fetched 2026-04-08 01:53:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×9mentioned ×7subscribed ×7referenced ×3

Error Message

AssertionError: Encoder cache miss for <mm_hash>

Fix Action

Workaround

We patched gpu_model_runner.py to replace the fatal assertion in _gather_mm_embeddings with a check-and-skip: if self.encoder_cache.get(mm_hash, None) returns None, we log a warning and continue past that embedding instead of asserting. The request completes normally — MTP just skips the missing embedding for that draft step. No impact on output quality.

PR fix notes

PR #38622: [Bug] Fix encoder cache miss assertion crash with MTP + multimodal

Repository: vllm-project/vllm
Author: esmeetu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38622

Description (problem / solution / changelog)

Summary

Replaces the fatal assert in _gather_mm_embeddings (gpu_model_runner.py:2952) with a graceful warning + continue, so that MTP draft proposals survive encoder-cache eviction under memory pressure instead of crashing the entire engine.
Adds regression tests in tests/v1/worker/test_encoder_cache_miss.py (5 test cases covering normal path, MTP path, cache miss handling, warning logging, and partial eviction).

Closes #38551

Why this is not duplicating an existing PR

Issue #38551 is specific to MTP speculative decoding (not EAGLE3, which was fixed in #34220).
No open PRs address this specific assertion in the MTP code path.

Test commands run and results

$ python -m pytest tests/v1/worker/test_encoder_cache_miss.py -v --noconftest
5 passed in 0.05s

Additionally verified.

Deployed Qwen/Qwen3.5-9B with MTP on Slurm (GB200 TP=4)
Ran 600+ concurrent multimodal requests (30 concurrency × 200 per round × 3 rounds)
Server stayed healthy through all stress tests including reset_mm_cache attacks

AI assistance

This PR was developed with AI assistance (Claude). All changes were reviewed and tested end-to-end.

Test plan

Unit tests pass (test_encoder_cache_miss.py)
Live deployment with MTP + multimodal traffic — no crashes
Server stays healthy under concurrent reset_mm_cache + multimodal load

🤖 Generated with Claude Code

Changed files

tests/v1/worker/test_encoder_cache_miss.py (added, +218/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +18/-1)

PR #38907: Fix the order of _free_encoder_inputs

Repository: vllm-project/vllm
Author: gty111
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38907

Description (problem / solution / changelog)

Purpose

#38551

freeable can be triggered before fetch due to an inconsistency in how num_computed_tokens is updated.

After a request is scheduled, the scheduler immediately updates the token counter:

request.num_computed_tokens += num_scheduled_tokens

However, at this point the GPU has not actually finished computing those tokens yet. Some of these tokens are still in the scheduling pipeline and have not been executed.

Later, in _free_encoder_inputs, the logic uses the updated num_computed_tokens to determine whether the encoder inputs are freeable. Since the counter already includes tokens that are only scheduled (but not yet computed), the condition may incorrectly evaluate to true.

As a result, encoder inputs may be marked as freeable too early, even though the corresponding GPU computation has not completed.

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/v1/core/sched/scheduler.py (modified, +4/-8)

PR #39543: [Bugfix][V1] Defer encoder cache eviction until after MTP draft proposal

Repository: vllm-project/vllm
Author: kaiktl
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39543

Description (problem / solution / changelog)

Purpose

Fixes the intra-step encoder cache race in https://github.com/vllm-project/vllm/issues/38551.

_update_states pops encoder cache entries before MTP's _gather_mm_embeddings reads them in the same step. This defers the pop until after both target forward and MTP draft proposal complete. Applied to both old (gpu_model_runner.py) and new (gpu/model_runner.py) model runners.

Note: This fixes the intra-step race only. A separate cross-step race also exists where async scheduling frees entries based on speculative num_computed_tokens before the GPU has consumed them. Refer to https://github.com/vllm-project/vllm/issues/38551#issuecomment-4207097289 for details. Follow-up: https://github.com/vllm-project/vllm/pull/39544.

Related PRs:

https://github.com/vllm-project/vllm/pull/39544 — fixes the cross-step variant (model-runner-side eviction)
https://github.com/vllm-project/vllm/pull/38907 — reorders _free_encoder_inputs in the scheduler to fix speculative num_computed_tokens timing
https://github.com/vllm-project/vllm/pull/38622 — graceful skip in _gather_mm_embeddings instead of fatal assertion (based on the production patch highlighted in https://github.com/vllm-project/vllm/issues/38551)

Reproduction

Serve a multimodal model with MTP (e.g. Qwen/Qwen3.5-VL-27B-FP8 with --speculative-config.method mtp --speculative-config.num_speculative_tokens 1)
Shrink encoder cache to 2048 tokens (default is 16384) to accelerate evictions
Send 80+ concurrent 1024×1024 image requests
With diagnostic logging, 13 entries were about to be popped while MTP still needed them in the same step — each would cause AssertionError: Encoder cache miss for <mm_hash> in _gather_mm_embeddings

Test Plan

pytest tests/v1/worker/test_deferred_encoder_cache_free.py -v

Test Result

8 new tests pass.

Changed files

tests/v1/worker/test_deferred_encoder_cache_free.py (added, +172/-0)
vllm/v1/worker/gpu/model_runner.py (modified, +14/-2)
vllm/v1/worker/gpu_model_runner.py (modified, +14/-3)

PR #39544: [Bugfix][V1] Model-runner-side encoder cache eviction for async scheduling race

Repository: vllm-project/vllm
Author: kaiktl
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39544

Description (problem / solution / changelog)

Purpose

Fixes the cross-step encoder cache race in https://github.com/vllm-project/vllm/issues/38551. Stacked on https://github.com/vllm-project/vllm/pull/39543.

With async scheduling (auto-enabled for MTP), the scheduler speculatively advances num_computed_tokens and frees encoder cache entries before the GPU has consumed them. This moves eviction for multimodal models to the model runner, using actual GPU progress. Encoder-decoder models (e.g. Whisper) are unchanged.

Related PRs:

https://github.com/vllm-project/vllm/pull/39543 — fixes the intra-step variant (deferred eviction)
https://github.com/vllm-project/vllm/pull/38907 — reorders _free_encoder_inputs in the scheduler to fix speculative num_computed_tokens timing
https://github.com/vllm-project/vllm/pull/38622 — graceful skip in _gather_mm_embeddings instead of fatal assertion (based on the production patch highlighted in https://github.com/vllm-project/vllm/issues/38551)

Reproduction

The race condition across async scheduling requires:
- Scheduler speculatively inflates num_computed_tokens including unverified spec tokens
- Frees an encoder entry based on that inflated count
- Spec tokens get rejected, num_computed_tokens rolls back — but the entry is already gone
This is timing-dependent and rare (1 natural occurrence in 60K production requests on Qwen/Qwen3.5-VL-27B-FP8 with --speculative-config.method mtp --speculative-config.num_speculative_tokens 1)
To force it, apply a monkey-patch that randomly evicts an active encoder cache entry (one not in the scheduler's free list) with 30% probability per step:

_active_hashes = [h for h in self.encoder_cache
                  if h not in (scheduler_output.free_encoder_mm_hashes or [])]
if _active_hashes and _def_random.random() < 0.3:
    _victim = _def_random.choice(_active_hashes)
    self.encoder_cache.pop(_victim, None)

With this sabotage + shrunk encoder cache (2048 tokens), 2 entries were evicted while MTP still needed them, each causing AssertionError: Encoder cache miss for <mm_hash> in 80 concurrent multimodal requests

Test Plan

pytest tests/v1/core/test_encoder_cache_free.py tests/v1/worker/test_deferred_encoder_cache_free.py -v

Test Result

18 tests pass (10 new + 8 updated).

Changed files

tests/v1/core/test_encoder_cache_free.py (added, +301/-0)
tests/v1/worker/test_deferred_encoder_cache_free.py (added, +171/-0)
vllm/v1/core/sched/scheduler.py (modified, +25/-6)
vllm/v1/worker/gpu/model_runner.py (modified, +43/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +32/-3)

Code Example

AssertionError: Encoder cache miss for <mm_hash>

---

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--enable-chunked-prefill
--enable-prefix-caching
--enforce-eager
--gpu-memory-utilization 0.90
--max-model-len 16384
--limit-mm-per-prompt.image 8 --limit-mm-per-prompt.video 1

---

# Before (crashes the engine):
encoder_output = self.encoder_cache.get(mm_hash, None)
assert encoder_output is not None, f"Encoder cache miss for {mm_hash}."

# After (skips gracefully):
encoder_output = self.encoder_cache.get(mm_hash, None)
if encoder_output is None:
    logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
    continue

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: v0.18.0
Model: Qwen/Qwen3.5-27B-FP8
GPU: NVIDIA L40S (48GB)
OS: Linux (EKS, Ubuntu-based)

Bug description

When using MTP speculative decoding (not EAGLE3) with multimodal (vision) inputs under sustained high-concurrency production traffic, the engine crashes with:

AssertionError: Encoder cache miss for <mm_hash>

The crash originates in gpu_model_runner.py during propose_draft_token_ids → _gather_mm_embeddings. Under heavy load, the encoder cache evicts entries that MTP still needs for draft token proposal, causing a fatal assertion that kills the entire engine and drops all in-flight requests.

This is distinct from #32469 (fixed by #34220), which addressed EAGLE3 + disable_chunked_mm_input. In that case, encoder inputs were never scheduled. In our case, the entries are produced and cached correctly, but get evicted under memory pressure before MTP's draft proposer reads them.

How to reproduce

Flags:

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--enable-chunked-prefill
--enable-prefix-caching
--enforce-eager
--gpu-memory-utilization 0.90
--max-model-len 16384
--limit-mm-per-prompt.image 8 --limit-mm-per-prompt.video 1

Trigger: Sustained high-concurrency mixed text + multimodal traffic in production. The crash typically occurs after hours of traffic, not immediately — it requires enough encoder cache pressure for eviction to occur while MTP is still referencing entries.

Workaround

Suggested diff

# Before (crashes the engine):
encoder_output = self.encoder_cache.get(mm_hash, None)
assert encoder_output is not None, f"Encoder cache miss for {mm_hash}."

# After (skips gracefully):
encoder_output = self.encoder_cache.get(mm_hash, None)
if encoder_output is None:
    logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
    continue

Validated under mixed text + multimodal traffic at 10 concurrency with 100% success rate. Happy to submit a PR if this approach is welcome.

Suggested fix

For a more robust long-term solution, the encoder cache should not evict entries still referenced by active MTP draft proposals, or the scheduler should gate MTP proposals based on encoder cache availability. See also #26777 for the same symptom with EAGLE3.

Before submitting a new issue...

I have searched existing issues and this is specific to MTP (not EAGLE3)
I am running the latest vLLM release (v0.18.0)

extent analysis

Fix Plan

To address the encoder cache eviction issue under high-concurrency production traffic, we will implement a more robust solution that prevents the encoder cache from evicting entries still referenced by active MTP draft proposals.

Here are the steps:

Modify the gpu_model_runner.py to track active MTP draft proposals and their corresponding encoder cache entries.
Update the encoder cache eviction policy to exclude entries referenced by active MTP draft proposals.
Implement a scheduler to gate MTP proposals based on encoder cache availability.

Example code snippet to track active MTP draft proposals and their corresponding encoder cache entries:

class GPUModelRunner:
    def __init__(self, ...):
        self.active_proposals = {}  # Track active MTP draft proposals

    def propose_draft_token_ids(self, ...):
        # ...
        mm_hash = ...
        self.active_proposals[mm_hash] = True  # Mark entry as active
        # ...

    def _gather_mm_embeddings(self, mm_hash):
        if mm_hash in self.active_proposals:
            # Do not evict this entry from the encoder cache
            encoder_output = self.encoder_cache.get(mm_hash, None)
            if encoder_output is None:
                logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
                continue
            # ...
        else:
            # Entry is not active, proceed with normal eviction policy
            encoder_output = self.encoder_cache.get(mm_hash, None)
            if encoder_output is None:
                logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
                continue
            # ...

Verification

To verify that the fix worked, test the system under sustained high-concurrency mixed text + multimodal traffic in production. Monitor the system for crashes and encoder cache misses. If the fix is successful, the system should no longer crash due to encoder cache eviction, and the number of encoder cache misses should be significantly reduced.

Extra Tips

Consider implementing a mechanism to limit the number of active MTP draft proposals to prevent excessive memory usage.
Monitor the system's memory usage and adjust the encoder cache size and eviction policy as needed to prevent memory pressure.
Refer to issue #26777 for similar symptoms with EAGLE3 and consider implementing a similar solution to prevent encoder cache eviction issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: AssertionError: Encoder cache miss crashes engine with MTP + multimodal under high concurrency [4 pull requests, 9 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Workaround

PR fix notes

PR #38622: [Bug] Fix encoder cache miss assertion crash with MTP + multimodal

Description (problem / solution / changelog)

Summary

Why this is not duplicating an existing PR

Test commands run and results

AI assistance

Test plan

Changed files

PR #38907: Fix the order of _free_encoder_inputs

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #39543: [Bugfix][V1] Defer encoder cache eviction until after MTP draft proposal

Description (problem / solution / changelog)

Purpose

Reproduction

Test Plan

Test Result

Changed files

PR #39544: [Bugfix][V1] Model-runner-side encoder cache eviction for async scheduling race

Description (problem / solution / changelog)

Purpose

Reproduction

Test Plan

Test Result

Changed files

Code Example

Your current environment

Bug description

How to reproduce

Workaround

Suggested diff

Suggested fix

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING