vllm - ✅(Solved) Fix [Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming [1 pull requests, 1 comments, 2 participants]

vllm2026-03-28 11:05:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38428•Fetched 2026-04-08 01:45:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sh1man

Participants

Saad-Mallebhari

sh1man

Timeline (top)

commented ×1cross-referenced ×1labeled ×1referenced ×1

Error Message

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0... ... (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] assert end_idx <= self.max_model_len, ( (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096 ... (APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Root Cause

Relevant Logs:

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

PR fix notes

PR #38483: fix(v1): Handle max_model_len overflow gracefully instead of crashing

Repository: vllm-project/vllm
Author: machov
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38483

Description (problem / solution / changelog)

Fixes #38428

Problem

When using the V1 engine/scheduler with realtime audio models, reaching the max_model_len limit results in a hard crash (EngineDeadError) of the entire engine instead of a graceful termination of the stream. This happens because of a strict assertion in _bookkeeping_sync:

assert end_idx <= self.max_model_len, (
    "Sampled token IDs exceed the max model length. "
    f"Total number of tokens: {end_idx} > max_model_len: "
    f"{self.max_model_len}"
)

This brings down the entire EngineCore instead of simply stopping generation for that specific request with finish_reason="length".

Solution

Replace the hard assertion with graceful overflow handling:

Mark exceeded requests as invalid - Add them to invalid_req_indices so they get properly filtered out
Clear their sampled tokens - Prevent further processing of the overflow tokens
Allow engine to continue - Other requests can continue to be served normally

The existing scheduler infrastructure already handles FINISHED_LENGTH_CAPPED status which maps to FinishReason.LENGTH, so this change integrates seamlessly with the current architecture.

Impact

Critical bug fix - Prevents entire engine crashes on token limit overflow
Maintains compatibility - Uses existing invalid_req_indices pattern and FINISHED_LENGTH_CAPPED status
Enables streaming use cases - Realtime audio and other continuous generation scenarios can handle length limits gracefully
No breaking changes - Existing behavior for valid requests is unchanged

Testing

Validated the fix logic with a simulation test
Confirmed proper handling of both async and sync scheduling modes
Syntax and import checks pass

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +14/-5)

Code Example

FROM vllm/vllm-openai:v0.18.0
 
RUN pip install "mistral-common[soundfile]" soundfile

---

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --enforce-eager \
       --tensor-parallel-size 1 \
       --max-model-len 4096 \
       --max-num-batched-tokens 4096 \
       --max-num-seqs 1 \
       --gpu-memory-utilization 0.90 \
       --host 0.0.0.0 --port 8000

---

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

RAW_BUFFERClick to expand / collapse

When using the V1 engine/scheduler with a realtime audio model (e.g., mistralai/Voxtral-Mini-4B-Realtime-2602), reaching the max_model_len limit results in a hard crash (EngineDeadError) of the entire engine instead of a graceful termination of the stream.

In a realtime audio streaming context, continuous token growth is expected. Currently, hitting the max token limit mid-stream triggers a strict token accounting assertion (AssertionError: Sampled token IDs exceed the max model length) in the V1 scheduler (gpu_model_runner.py). This brings down the engine core entirely, rather than simply stopping the generation for that specific request with a finish_reason="length".

Steps to Reproduce:

Start vLLM with the V1 engine and a realtime audio model.
Stream audio continuously from a client without explicitly closing or resetting the connection.
Once the context reaches max_model_len (in this case, 4096), the engine crashes with EngineDeadError.

Environment / Configuration:

Model: mistralai/Voxtral-Mini-4B-Realtime-2602
vLLM (v0.18.0 with V1 Engine) via Docker on a single RTX 5060 Ti 16GB (CUDA 13.1).
Command / Setup:

docker build -t vllm-voxtral-audio . Dockerfile

FROM vllm/vllm-openai:v0.18.0
 
RUN pip install "mistral-common[soundfile]" soundfile

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --enforce-eager \
       --tensor-parallel-size 1 \
       --max-model-len 4096 \
       --max-num-batched-tokens 4096 \
       --max-num-seqs 1 \
       --gpu-memory-utilization 0.90 \
       --host 0.0.0.0 --port 8000

Relevant Logs:

(EngineCore pid=146) ERROR 03-28 10:55:03 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.27241379310344827, encoder_cache_usage=1.0...
...
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3213, in _bookkeeping_sync
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]     assert end_idx <= self.max_model_len, (
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=146) ERROR 03-28 10:55:03 [core.py:1101] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4097 > max_model_len: 4096
...
(APIServer pid=1) ERROR 03-28 10:55:03 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Expected Behavior: The V1 engine should gracefully handle the token overflow in streaming contexts. Instead of hitting an AssertionError and killing the EngineCore, it should stop the generation for the overflowing request, return an appropriate finish reason (e.g., length), and allow the server to continue operating and accepting new requests.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the V1 engine crashing when reaching the max_model_len limit in a realtime audio streaming context, we need to modify the gpu_model_runner.py to handle token overflow gracefully. Here are the steps:

Modify the _bookkeeping_sync method in gpu_model_runner.py to catch the AssertionError exception and handle it by stopping the generation for the overflowing request.
Return a finish reason of length to indicate that the generation was stopped due to exceeding the maximum model length.

Example code changes:

try:
    assert end_idx <= self.max_model_len, (
        "Sampled token IDs exceed the max model length. Total number of tokens: {} > max_model_len: {}".format(
            end_idx, self.max_model_len
        )
    )
except AssertionError:
    # Handle token overflow by stopping generation and returning finish reason
    self.finish_reason = "length"
    self.stop_generation()

Additionally, consider increasing the max_model_len value to accommodate longer audio streams, or implement a mechanism to dynamically adjust the max_model_len based on the streaming context.

Verification

To verify that the fix worked, restart the V1 engine with the modified gpu_model_runner.py and repeat the steps to reproduce the issue. The engine should now handle token overflow gracefully, stopping the generation for the overflowing request and returning a finish reason of length, without crashing.

Extra Tips

Consider implementing a retry mechanism for requests that exceed the max_model_len limit, to allow for seamless continuation of audio streaming.
Monitor the engine's performance and adjust the max_model_len value as needed to balance between generating high-quality audio and preventing engine crashes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #38483: fix(v1): Handle max_model_len overflow gracefully instead of crashing

Description (problem / solution / changelog)

Problem

Solution

Impact

Testing

Changed files

Code Example

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: V1 Engine: EngineDeadError (AssertionError) on max_model_len overflow during realtime audio streaming [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #38483: fix(v1): Handle max_model_len overflow gracefully instead of crashing

Description (problem / solution / changelog)

Problem

Solution

Impact

Testing

Changed files

Code Example

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING