vllm - ✅(Solved) Fix [Bug]: AssertionError: Encoder cache miss crashes engine with MTP + multimodal under high concurrency [4 pull requests, 9 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38551Fetched 2026-04-08 01:53:23
View on GitHub
Comments
9
Participants
4
Timeline
29
Reactions
0
Author
Assignees
Timeline (top)
commented ×9mentioned ×7subscribed ×7referenced ×3

Error Message

AssertionError: Encoder cache miss for <mm_hash>

Fix Action

Workaround

We patched gpu_model_runner.py to replace the fatal assertion in _gather_mm_embeddings with a check-and-skip: if self.encoder_cache.get(mm_hash, None) returns None, we log a warning and continue past that embedding instead of asserting. The request completes normally — MTP just skips the missing embedding for that draft step. No impact on output quality.

PR fix notes

PR #38622: [Bug] Fix encoder cache miss assertion crash with MTP + multimodal

Description (problem / solution / changelog)

Summary

  • Replaces the fatal assert in _gather_mm_embeddings (gpu_model_runner.py:2952) with a graceful warning + continue, so that MTP draft proposals survive encoder-cache eviction under memory pressure instead of crashing the entire engine.
  • Adds regression tests in tests/v1/worker/test_encoder_cache_miss.py (5 test cases covering normal path, MTP path, cache miss handling, warning logging, and partial eviction).

Closes #38551

Why this is not duplicating an existing PR

  • Issue #38551 is specific to MTP speculative decoding (not EAGLE3, which was fixed in #34220).
  • No open PRs address this specific assertion in the MTP code path.

Test commands run and results

$ python -m pytest tests/v1/worker/test_encoder_cache_miss.py -v --noconftest
5 passed in 0.05s

Additionally verified.

  • Deployed Qwen/Qwen3.5-9B with MTP on Slurm (GB200 TP=4)
  • Ran 600+ concurrent multimodal requests (30 concurrency × 200 per round × 3 rounds)
  • Server stayed healthy through all stress tests including reset_mm_cache attacks

AI assistance

This PR was developed with AI assistance (Claude). All changes were reviewed and tested end-to-end.

Test plan

  • Unit tests pass (test_encoder_cache_miss.py)
  • Live deployment with MTP + multimodal traffic — no crashes
  • Server stays healthy under concurrent reset_mm_cache + multimodal load

🤖 Generated with Claude Code

Changed files

  • tests/v1/worker/test_encoder_cache_miss.py (added, +218/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +18/-1)

PR #38907: Fix the order of _free_encoder_inputs

Description (problem / solution / changelog)

Purpose

#38551

<img width="2334" height="852" alt="Clipboard_Screenshot_1775212606" src="https://github.com/user-attachments/assets/b4005f6c-3692-4610-ab82-0637d687b54e" />

freeable can be triggered before fetch due to an inconsistency in how num_computed_tokens is updated.

After a request is scheduled, the scheduler immediately updates the token counter:

request.num_computed_tokens += num_scheduled_tokens

However, at this point the GPU has not actually finished computing those tokens yet. Some of these tokens are still in the scheduling pipeline and have not been executed.

Later, in _free_encoder_inputs, the logic uses the updated num_computed_tokens to determine whether the encoder inputs are freeable. Since the counter already includes tokens that are only scheduled (but not yet computed), the condition may incorrectly evaluate to true.

As a result, encoder inputs may be marked as freeable too early, even though the corresponding GPU computation has not completed.

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/v1/core/sched/scheduler.py (modified, +4/-8)

PR #39543: [Bugfix][V1] Defer encoder cache eviction until after MTP draft proposal

Description (problem / solution / changelog)

Purpose

Fixes the intra-step encoder cache race in https://github.com/vllm-project/vllm/issues/38551.

_update_states pops encoder cache entries before MTP's _gather_mm_embeddings reads them in the same step. This defers the pop until after both target forward and MTP draft proposal complete. Applied to both old (gpu_model_runner.py) and new (gpu/model_runner.py) model runners.

Note: This fixes the intra-step race only. A separate cross-step race also exists where async scheduling frees entries based on speculative num_computed_tokens before the GPU has consumed them. Refer to https://github.com/vllm-project/vllm/issues/38551#issuecomment-4207097289 for details. Follow-up: https://github.com/vllm-project/vllm/pull/39544.

Related PRs:

Reproduction

  • Serve a multimodal model with MTP (e.g. Qwen/Qwen3.5-VL-27B-FP8 with --speculative-config.method mtp --speculative-config.num_speculative_tokens 1)
  • Shrink encoder cache to 2048 tokens (default is 16384) to accelerate evictions
  • Send 80+ concurrent 1024×1024 image requests
  • With diagnostic logging, 13 entries were about to be popped while MTP still needed them in the same step — each would cause AssertionError: Encoder cache miss for <mm_hash> in _gather_mm_embeddings

Test Plan

pytest tests/v1/worker/test_deferred_encoder_cache_free.py -v

Test Result

8 new tests pass.

Changed files

  • tests/v1/worker/test_deferred_encoder_cache_free.py (added, +172/-0)
  • vllm/v1/worker/gpu/model_runner.py (modified, +14/-2)
  • vllm/v1/worker/gpu_model_runner.py (modified, +14/-3)

PR #39544: [Bugfix][V1] Model-runner-side encoder cache eviction for async scheduling race

Description (problem / solution / changelog)

Purpose

Fixes the cross-step encoder cache race in https://github.com/vllm-project/vllm/issues/38551. Stacked on https://github.com/vllm-project/vllm/pull/39543.

With async scheduling (auto-enabled for MTP), the scheduler speculatively advances num_computed_tokens and frees encoder cache entries before the GPU has consumed them. This moves eviction for multimodal models to the model runner, using actual GPU progress. Encoder-decoder models (e.g. Whisper) are unchanged.

Related PRs:

Reproduction

  • The race condition across async scheduling requires:
    • Scheduler speculatively inflates num_computed_tokens including unverified spec tokens
    • Frees an encoder entry based on that inflated count
    • Spec tokens get rejected, num_computed_tokens rolls back — but the entry is already gone
  • This is timing-dependent and rare (1 natural occurrence in 60K production requests on Qwen/Qwen3.5-VL-27B-FP8 with --speculative-config.method mtp --speculative-config.num_speculative_tokens 1)
  • To force it, apply a monkey-patch that randomly evicts an active encoder cache entry (one not in the scheduler's free list) with 30% probability per step:
_active_hashes = [h for h in self.encoder_cache
                  if h not in (scheduler_output.free_encoder_mm_hashes or [])]
if _active_hashes and _def_random.random() < 0.3:
    _victim = _def_random.choice(_active_hashes)
    self.encoder_cache.pop(_victim, None)
  • With this sabotage + shrunk encoder cache (2048 tokens), 2 entries were evicted while MTP still needed them, each causing AssertionError: Encoder cache miss for <mm_hash> in 80 concurrent multimodal requests

Test Plan

pytest tests/v1/core/test_encoder_cache_free.py tests/v1/worker/test_deferred_encoder_cache_free.py -v

Test Result

18 tests pass (10 new + 8 updated).

Changed files

  • tests/v1/core/test_encoder_cache_free.py (added, +301/-0)
  • tests/v1/worker/test_deferred_encoder_cache_free.py (added, +171/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +25/-6)
  • vllm/v1/worker/gpu/model_runner.py (modified, +43/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +32/-3)

Code Example

AssertionError: Encoder cache miss for <mm_hash>

---

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--enable-chunked-prefill
--enable-prefix-caching
--enforce-eager
--gpu-memory-utilization 0.90
--max-model-len 16384
--limit-mm-per-prompt.image 8 --limit-mm-per-prompt.video 1

---

# Before (crashes the engine):
encoder_output = self.encoder_cache.get(mm_hash, None)
assert encoder_output is not None, f"Encoder cache miss for {mm_hash}."

# After (skips gracefully):
encoder_output = self.encoder_cache.get(mm_hash, None)
if encoder_output is None:
    logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
    continue
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: v0.18.0
  • Model: Qwen/Qwen3.5-27B-FP8
  • GPU: NVIDIA L40S (48GB)
  • OS: Linux (EKS, Ubuntu-based)

Bug description

When using MTP speculative decoding (not EAGLE3) with multimodal (vision) inputs under sustained high-concurrency production traffic, the engine crashes with:

AssertionError: Encoder cache miss for <mm_hash>

The crash originates in gpu_model_runner.py during propose_draft_token_ids_gather_mm_embeddings. Under heavy load, the encoder cache evicts entries that MTP still needs for draft token proposal, causing a fatal assertion that kills the entire engine and drops all in-flight requests.

This is distinct from #32469 (fixed by #34220), which addressed EAGLE3 + disable_chunked_mm_input. In that case, encoder inputs were never scheduled. In our case, the entries are produced and cached correctly, but get evicted under memory pressure before MTP's draft proposer reads them.

How to reproduce

Flags:

--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
--enable-chunked-prefill
--enable-prefix-caching
--enforce-eager
--gpu-memory-utilization 0.90
--max-model-len 16384
--limit-mm-per-prompt.image 8 --limit-mm-per-prompt.video 1

Trigger: Sustained high-concurrency mixed text + multimodal traffic in production. The crash typically occurs after hours of traffic, not immediately — it requires enough encoder cache pressure for eviction to occur while MTP is still referencing entries.

Workaround

We patched gpu_model_runner.py to replace the fatal assertion in _gather_mm_embeddings with a check-and-skip: if self.encoder_cache.get(mm_hash, None) returns None, we log a warning and continue past that embedding instead of asserting. The request completes normally — MTP just skips the missing embedding for that draft step. No impact on output quality.

Suggested diff

# Before (crashes the engine):
encoder_output = self.encoder_cache.get(mm_hash, None)
assert encoder_output is not None, f"Encoder cache miss for {mm_hash}."

# After (skips gracefully):
encoder_output = self.encoder_cache.get(mm_hash, None)
if encoder_output is None:
    logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
    continue

Validated under mixed text + multimodal traffic at 10 concurrency with 100% success rate. Happy to submit a PR if this approach is welcome.

Suggested fix

For a more robust long-term solution, the encoder cache should not evict entries still referenced by active MTP draft proposals, or the scheduler should gate MTP proposals based on encoder cache availability. See also #26777 for the same symptom with EAGLE3.

Before submitting a new issue...

  • I have searched existing issues and this is specific to MTP (not EAGLE3)
  • I am running the latest vLLM release (v0.18.0)

extent analysis

Fix Plan

To address the encoder cache eviction issue under high-concurrency production traffic, we will implement a more robust solution that prevents the encoder cache from evicting entries still referenced by active MTP draft proposals.

Here are the steps:

  • Modify the gpu_model_runner.py to track active MTP draft proposals and their corresponding encoder cache entries.
  • Update the encoder cache eviction policy to exclude entries referenced by active MTP draft proposals.
  • Implement a scheduler to gate MTP proposals based on encoder cache availability.

Example code snippet to track active MTP draft proposals and their corresponding encoder cache entries:

class GPUModelRunner:
    def __init__(self, ...):
        self.active_proposals = {}  # Track active MTP draft proposals

    def propose_draft_token_ids(self, ...):
        # ...
        mm_hash = ...
        self.active_proposals[mm_hash] = True  # Mark entry as active
        # ...

    def _gather_mm_embeddings(self, mm_hash):
        if mm_hash in self.active_proposals:
            # Do not evict this entry from the encoder cache
            encoder_output = self.encoder_cache.get(mm_hash, None)
            if encoder_output is None:
                logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
                continue
            # ...
        else:
            # Entry is not active, proceed with normal eviction policy
            encoder_output = self.encoder_cache.get(mm_hash, None)
            if encoder_output is None:
                logger.warning("Encoder cache miss for %s, skipping mm embedding", mm_hash)
                continue
            # ...

Verification

To verify that the fix worked, test the system under sustained high-concurrency mixed text + multimodal traffic in production. Monitor the system for crashes and encoder cache misses. If the fix is successful, the system should no longer crash due to encoder cache eviction, and the number of encoder cache misses should be significantly reduced.

Extra Tips

  • Consider implementing a mechanism to limit the number of active MTP draft proposals to prevent excessive memory usage.
  • Monitor the system's memory usage and adjust the encoder cache size and eviction policy as needed to prevent memory pressure.
  • Refer to issue #26777 for similar symptoms with EAGLE3 and consider implementing a similar solution to prevent encoder cache eviction issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING