vllm - 💡(How to fix) Fix [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU [4 comments, 2 participants]

sh1man · 2026-03-26T12:28:29Z

[vllm] Сurrent environment Hello! I am running the mistralai/Voxtral-Mini-4B-Realtime-2602 model using vLLM v0.17.2rc0 with V1 Engine via Docker on a single RT… ### Сurrent environment Hello! I am running the mistralai/Voxtral-Mini-4B-Realtime-2602 model using vLLM (v0.17.2rc0 with V1 Engine) via Docker on a single RTX 5060 Ti 16GB (CUDA 13.1). I am testing the Realtime API endpoint (`/v1/realtime`) with audio streaming. The issue is that the first session works perfectly (audio is processed, and text tokens are returned in real-time). However, when I try to scale or when the context limit is reached, the server stops returning any recognized text, and the generation gets stuck or fatally crashes. My Launch Command: `docker build -t vllm-voxtral-audio .` Dockerfile ```dockerfile FROM vllm/vllm-openai:nightly RUN pip install "mistral-common[soundfile]" soundfile ``` ```bash docker run --rm --gpus all \ --shm-size=4g \ -p 8000:8000 \ -v ~/.cache/huggingface:/hf \ -e HF_HUB_OFFLINE=1 \ -e VLLM_DISABLE_COMPILE_CACHE=1 \ -e HF_HOME=/hf \ vllm-voxtral-audio \ mistralai/Voxtral-Mini-4B-Realtime-2602 \ --tokenizer-mode mistral \ --config-format mistral \ --load-format mistral \ --trust-remote-code \ --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \ --tensor-parallel-size 1 \ --max-model-len 4352 \ --max-num-batched-tokens 4352 \ --max-num-seqs 2 \ --gpu-memory-utilization 0.95 \ --host 0.0.0.0 --port 8000 ``` Observations: 1. The WebSocket connection for concurrent sessions opens successfully ("WebSocket /v1/realtime" [accepted]). 2. Looking at the vLLM metrics logger, I noticed that `GPU KV cache usage` is very low (~16%), but the scheduler stats dump shows `encoder_cache_usage=1.0` (100%) almost immediately. It feels like the audio encoder cache is completely saturated, ignoring new audio inputs for any subsequent requests. 3. If the audio stream continues and the token count exceeds `max_model_len` (4352) by even a single token, the EngineCore hard crashes instead of truncating: `AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4353 > max_model_len: 4352` followed by `vllm.v1.engine.exceptions.EngineDeadError`. ### 🐛 Describe the bug Is there a specific configuration required for Voxtral to handle the `encoder_cache` dynamically so it doesn't bottleneck concurrent audio streams on GPUs with 16GB VRAM? Also, is the hard EngineCore crash upon exceeding `max_model_len` expected behavior for real-time audio models in the V1 engine? Any advice would be greatly appreciated! ### Before submitting a new issue... * [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-26 12:28:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38233•Fetched 2026-04-08 01:37:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sh1man

Participants

Saad-Mallebhari

sh1man

Timeline (top)

commented ×4labeled ×1mentioned ×1subscribed ×1

Code Example

FROM vllm/vllm-openai:nightly
 
RUN pip install "mistral-common[soundfile]" soundfile

---

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
       --tensor-parallel-size 1 \
       --max-model-len 4352 \
       --max-num-batched-tokens 4352 \
       --max-num-seqs 2 \
       --gpu-memory-utilization 0.95 \
       --host 0.0.0.0 --port 8000

RAW_BUFFERClick to expand / collapse

Сurrent environment

Hello! I am running the mistralai/Voxtral-Mini-4B-Realtime-2602 model using vLLM (v0.17.2rc0 with V1 Engine) via Docker on a single RTX 5060 Ti 16GB (CUDA 13.1).

I am testing the Realtime API endpoint (/v1/realtime) with audio streaming. The issue is that the first session works perfectly (audio is processed, and text tokens are returned in real-time). However, when I try to scale or when the context limit is reached, the server stops returning any recognized text, and the generation gets stuck or fatally crashes.

My Launch Command:

docker build -t vllm-voxtral-audio . Dockerfile

FROM vllm/vllm-openai:nightly
 
RUN pip install "mistral-common[soundfile]" soundfile

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
       --tensor-parallel-size 1 \
       --max-model-len 4352 \
       --max-num-batched-tokens 4352 \
       --max-num-seqs 2 \
       --gpu-memory-utilization 0.95 \
       --host 0.0.0.0 --port 8000

Observations:

The WebSocket connection for concurrent sessions opens successfully ("WebSocket /v1/realtime" [accepted]).
Looking at the vLLM metrics logger, I noticed that GPU KV cache usage is very low (~16%), but the scheduler stats dump shows encoder_cache_usage=1.0 (100%) almost immediately. It feels like the audio encoder cache is completely saturated, ignoring new audio inputs for any subsequent requests.
If the audio stream continues and the token count exceeds max_model_len (4352) by even a single token, the EngineCore hard crashes instead of truncating: AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4353 > max_model_len: 4352 followed by vllm.v1.engine.exceptions.EngineDeadError.

🐛 Describe the bug

Is there a specific configuration required for Voxtral to handle the encoder_cache dynamically so it doesn't bottleneck concurrent audio streams on GPUs with 16GB VRAM? Also, is the hard EngineCore crash upon exceeding max_model_len expected behavior for real-time audio models in the V1 engine? Any advice would be greatly appreciated!

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issues with the encoder_cache saturation and the EngineCore crash, follow these steps:

Increase the encoder_cache_size to allow for more concurrent audio streams:
- Add the following flag to your launch command: --encoder-cache-size 2048
Implement a token truncation mechanism to prevent the EngineCore crash when exceeding max_model_len:
- Modify your code to truncate the token count before passing it to the model
- Example code snippet:

max_model_len = 4352 token_count = len(token_ids)

if token_count > max_model_len: token_ids = token_ids[-max_model_len:] # truncate token ids to max_model_len

* Consider increasing the `gpu-memory-utilization` to allow for more memory allocation:
  * Add the following flag to your launch command: `--gpu-memory-utilization 0.99`
* Ensure that the `max-num-batched-tokens` and `max-num-seqs` are set to reasonable values to prevent overloading the GPU:
  * Review your launch command and adjust these values as needed

### Verification
To verify that the fixes are working, monitor the `GPU KV cache usage` and `encoder_cache_usage` metrics to ensure they are not saturating. Additionally, test the model with concurrent audio streams and verify that the token truncation mechanism is preventing the EngineCore crash.

### Extra Tips
* Regularly review and adjust your configuration settings to ensure optimal performance for your specific use case.
* Consider implementing a more robust token truncation mechanism, such as using a queue or buffer to handle excess tokens.
* Refer to the vLLM documentation and community resources for additional guidance on optimizing performance and troubleshooting issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU [4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Сurrent environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU [4 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Сurrent environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING