vllm - ✅(Solved) Fix [Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38428Fetched 2026-04-08 01:45:43
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×1cross-referenced ×1labeled ×1referenced ×1

Error Message

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0... ... (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] assert end_idx <= self.max_model_len, ( (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096 ... (APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Root Cause

Relevant Logs:

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

PR fix notes

PR #38483: fix(v1): Handle max_model_len overflow gracefully instead of crashing

Description (problem / solution / changelog)

Fixes #38428

Problem

When using the V1 engine/scheduler with realtime audio models, reaching the max_model_len limit results in a hard crash (EngineDeadError) of the entire engine instead of a graceful termination of the stream. This happens because of a strict assertion in _bookkeeping_sync:

assert end_idx <= self.max_model_len, (
    "Sampled token IDs exceed the max model length. "
    f"Total number of tokens: {end_idx} > max_model_len: "
    f"{self.max_model_len}"
)

This brings down the entire EngineCore instead of simply stopping generation for that specific request with finish_reason="length".

Solution

Replace the hard assertion with graceful overflow handling:

  1. Mark exceeded requests as invalid - Add them to invalid_req_indices so they get properly filtered out
  2. Clear their sampled tokens - Prevent further processing of the overflow tokens
  3. Allow engine to continue - Other requests can continue to be served normally

The existing scheduler infrastructure already handles FINISHED_LENGTH_CAPPED status which maps to FinishReason.LENGTH, so this change integrates seamlessly with the current architecture.

Impact

  • Critical bug fix - Prevents entire engine crashes on token limit overflow
  • Maintains compatibility - Uses existing invalid_req_indices pattern and FINISHED_LENGTH_CAPPED status
  • Enables streaming use cases - Realtime audio and other continuous generation scenarios can handle length limits gracefully
  • No breaking changes - Existing behavior for valid requests is unchanged

Testing

  • Validated the fix logic with a simulation test
  • Confirmed proper handling of both async and sync scheduling modes
  • Syntax and import checks pass

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +14/-5)

Code Example

FROM vllm/vllm-openai:v0.18.0
 
RUN pip install "mistral-common[soundfile]" soundfile

---

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --enforce-eager \
       --tensor-parallel-size 1 \
       --max-model-len 4096 \
       --max-num-batched-tokens 4096 \
       --max-num-seqs 1 \
       --gpu-memory-utilization 0.90 \
       --host 0.0.0.0 --port 8000

---

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
RAW_BUFFERClick to expand / collapse

When using the V1 engine/scheduler with a realtime audio model (e.g., mistralai/Voxtral-Mini-4B-Realtime-2602), reaching the max_model_len limit results in a hard crash (EngineDeadError) of the entire engine instead of a graceful termination of the stream.

In a realtime audio streaming context, continuous token growth is expected. Currently, hitting the max token limit mid-stream triggers a strict token accounting assertion (AssertionError: Sampled token IDs exceed the max model length) in the V1 scheduler (gpu_model_runner.py). This brings down the engine core entirely, rather than simply stopping the generation for that specific request with a finish_reason="length".

Steps to Reproduce:

  1. Start vLLM with the V1 engine and a realtime audio model.
  2. Stream audio continuously from a client without explicitly closing or resetting the connection.
  3. Once the context reaches max_model_len (in this case, 4096), the engine crashes with EngineDeadError.

Environment / Configuration:

  • Model: mistralai/Voxtral-Mini-4B-Realtime-2602

  • vLLM (v0.18.0 with V1 Engine) via Docker on a single RTX 5060 Ti 16GB (CUDA 13.1).

  • Command / Setup:

docker build -t vllm-voxtral-audio . Dockerfile

FROM vllm/vllm-openai:v0.18.0
 
RUN pip install "mistral-common[soundfile]" soundfile
docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --enforce-eager \
       --tensor-parallel-size 1 \
       --max-model-len 4096 \
       --max-num-batched-tokens 4096 \
       --max-num-seqs 1 \
       --gpu-memory-utilization 0.90 \
       --host 0.0.0.0 --port 8000

Relevant Logs:

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Expected Behavior: The V1 engine should gracefully handle the token overflow in streaming contexts. Instead of hitting an AssertionError and killing the EngineCore, it should stop the generation for the overflowing request, return an appropriate finish reason (e.g., length), and allow the server to continue operating and accepting new requests.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the V1 engine crashing when reaching the max_model_len limit in a realtime audio streaming context, we need to modify the gpu_model_runner.py to handle token overflow gracefully. Here are the steps:

  • Modify the _bookkeeping_sync method in gpu_model_runner.py to catch the AssertionError exception and handle it by stopping the generation for the overflowing request.
  • Return a finish reason of length to indicate that the generation was stopped due to exceeding the maximum model length.

Example code changes:

try:
    assert end_idx <= self.max_model_len, (
        "Sampled token IDs exceed the max model length. Total number of tokens: {} > max_model_len: {}".format(
            end_idx, self.max_model_len
        )
    )
except AssertionError:
    # Handle token overflow by stopping generation and returning finish reason
    self.finish_reason = "length"
    self.stop_generation()

Additionally, consider increasing the max_model_len value to accommodate longer audio streams, or implement a mechanism to dynamically adjust the max_model_len based on the streaming context.

Verification

To verify that the fix worked, restart the V1 engine with the modified gpu_model_runner.py and repeat the steps to reproduce the issue. The engine should now handle token overflow gracefully, stopping the generation for the overflowing request and returning a finish reason of length, without crashing.

Extra Tips

  • Consider implementing a retry mechanism for requests that exceed the max_model_len limit, to allow for seamless continuation of audio streaming.
  • Monitor the engine's performance and adjust the max_model_len value as needed to balance between generating high-quality audio and preventing engine crashes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING