vllm - 💡(How to fix) Fix [Bug]: Voxtral-Mini-4B-Realtime hangs/crashes on multiple sessions due to encoder_cache_usage saturation on 16GB GPU [4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38233Fetched 2026-04-08 01:37:09
View on GitHub
Comments
4
Participants
2
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×4labeled ×1mentioned ×1subscribed ×1

Code Example

FROM vllm/vllm-openai:nightly
 
RUN pip install "mistral-common[soundfile]" soundfile

---

docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
       --tensor-parallel-size 1 \
       --max-model-len 4352 \
       --max-num-batched-tokens 4352 \
       --max-num-seqs 2 \
       --gpu-memory-utilization 0.95 \
       --host 0.0.0.0 --port 8000
RAW_BUFFERClick to expand / collapse

Сurrent environment

Hello! I am running the mistralai/Voxtral-Mini-4B-Realtime-2602 model using vLLM (v0.17.2rc0 with V1 Engine) via Docker on a single RTX 5060 Ti 16GB (CUDA 13.1).

I am testing the Realtime API endpoint (/v1/realtime) with audio streaming. The issue is that the first session works perfectly (audio is processed, and text tokens are returned in real-time). However, when I try to scale or when the context limit is reached, the server stops returning any recognized text, and the generation gets stuck or fatally crashes.

My Launch Command:

docker build -t vllm-voxtral-audio . Dockerfile

FROM vllm/vllm-openai:nightly
 
RUN pip install "mistral-common[soundfile]" soundfile
docker run --rm --gpus all \
     --shm-size=4g \
     -p 8000:8000 \
     -v ~/.cache/huggingface:/hf \
     -e HF_HUB_OFFLINE=1 \
     -e VLLM_DISABLE_COMPILE_CACHE=1 \
     -e HF_HOME=/hf \
     vllm-voxtral-audio \
       mistralai/Voxtral-Mini-4B-Realtime-2602 \
       --tokenizer-mode mistral \
       --config-format mistral \
       --load-format mistral \
       --trust-remote-code \
       --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
       --tensor-parallel-size 1 \
       --max-model-len 4352 \
       --max-num-batched-tokens 4352 \
       --max-num-seqs 2 \
       --gpu-memory-utilization 0.95 \
       --host 0.0.0.0 --port 8000

Observations:

  1. The WebSocket connection for concurrent sessions opens successfully ("WebSocket /v1/realtime" [accepted]).
  2. Looking at the vLLM metrics logger, I noticed that GPU KV cache usage is very low (~16%), but the scheduler stats dump shows encoder_cache_usage=1.0 (100%) almost immediately. It feels like the audio encoder cache is completely saturated, ignoring new audio inputs for any subsequent requests.
  3. If the audio stream continues and the token count exceeds max_model_len (4352) by even a single token, the EngineCore hard crashes instead of truncating: AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 4353 > max_model_len: 4352 followed by vllm.v1.engine.exceptions.EngineDeadError.

🐛 Describe the bug

Is there a specific configuration required for Voxtral to handle the encoder_cache dynamically so it doesn't bottleneck concurrent audio streams on GPUs with 16GB VRAM? Also, is the hard EngineCore crash upon exceeding max_model_len expected behavior for real-time audio models in the V1 engine? Any advice would be greatly appreciated!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issues with the encoder_cache saturation and the EngineCore crash, follow these steps:

  • Increase the encoder_cache_size to allow for more concurrent audio streams:
    • Add the following flag to your launch command: --encoder-cache-size 2048
  • Implement a token truncation mechanism to prevent the EngineCore crash when exceeding max_model_len:
    • Modify your code to truncate the token count before passing it to the model
    • Example code snippet:

max_model_len = 4352 token_count = len(token_ids)

if token_count > max_model_len: token_ids = token_ids[-max_model_len:] # truncate token ids to max_model_len

* Consider increasing the `gpu-memory-utilization` to allow for more memory allocation:
  * Add the following flag to your launch command: `--gpu-memory-utilization 0.99`
* Ensure that the `max-num-batched-tokens` and `max-num-seqs` are set to reasonable values to prevent overloading the GPU:
  * Review your launch command and adjust these values as needed

### Verification
To verify that the fixes are working, monitor the `GPU KV cache usage` and `encoder_cache_usage` metrics to ensure they are not saturating. Additionally, test the model with concurrent audio streams and verify that the token truncation mechanism is preventing the EngineCore crash.

### Extra Tips
* Regularly review and adjust your configuration settings to ensure optimal performance for your specific use case.
* Consider implementing a more robust token truncation mechanism, such as using a queue or buffer to handle excess tokens.
* Refer to the vLLM documentation and community resources for additional guidance on optimizing performance and troubleshooting issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING