vllm - ✅(Solved) Fix [Bug] Fatal AssertionError: Encoder KV cache fails to evict tokens, exceeding max_model_len in long-lived WebSocket sessions [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39996Fetched 2026-04-17 08:27:52
View on GitHub
Comments
3
Participants
3
Timeline
8
Reactions
0
Timeline (top)
commented ×3cross-referenced ×1labeled ×1mentioned ×1

I am developing an Always-On voice chatbot using Voxtral via the vLLM Realtime WebSocket API. In this architecture, the WebSocket stays open indefinitely, continuously streaming audio chunks (including silence/ambient noise) to the backend.

I have discovered that the Encoder KV Cache accumulates acoustic tokens indefinitely. When the total token count reaches max_model_len, instead of evicting old tokens (as a sliding window should), vLLM attempts to append a new token, causing a fatal AssertionError that kills the WebSocket session.

Error Message

The crash is not an Out-Of-Memory error. It is a hard AssertionError in the GPU model runner (gpu_model_runner.py), proving that token eviction is not happening for the encoder. (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:72] Dumping input data for V1 LLM engine... (EngineCore pid=2274342) ERROR 04-16 11:56:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev335+g10e49d263) with config: model='mistralai/Voxtral-Mini-4B-Realtime-2602', speculative_config=None, tokenizer='mistralai/Voxtral-Mini-4B-Realtime-2602', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantiza> (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rt-ws-4839ef1a-dfbf-413e-9c2e-0f4d6ac43800-04db37dd-6bba-4060-9579-662ef4c863f8-bc76b474,prompt_token_ids_len=16384,prefill_token_ids_len=None,mm_features=[MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0000, 0.0000, 0.0000, ..., -0.0022, -0.0028, -0.0030], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='ebc5aa795fa6fa6af6940bcc6a72f9b4ca3d25bdc3c450666448988b5d51f592', mm_position=PlaceholderRange(offset=0, length=39, is_embed=None), mm_hash='ebc5aa795fa6fa6af6940bcc6a72f9b4ca3d25bdc3c450666448988b5d51f592'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-0.0007, -0.0003, 0.0003, ..., -0.0017, -0.0010, -0.0008], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='37528b36849fcbaf55176d4cfa0bab6be4031c7bbb762958275819a8b5ef2c8e', mm_position=PlaceholderRange(offset=39, length=2, is_embed=None), mm_hash='37528b36849fcbaf55176d4cfa0bab6be4031c7bbb762958275819a8b5ef2c8e'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0014, 0.0015, 0.0021, ..., -0.0036, -0.0036, -0.0034], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='c3f981f213fef677abcd51ec9cd1272716b112a67c27f9b102d432b31aed42d3', mm_position=PlaceholderRange(offset=40, length=2, is_embed=None), mm_hash='c3f981f213fef677abcd51ec9cd1272716b112a67c27f9b102d432b31aed42d3'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0029, 0.0031, 0.0039, ..., -0.0041, -0.0040, -0.0047], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='45c18fd6f9ea2cbf72e60674ba3c05299f1e1a4171c4e1153175c8a23fa3129c', mm_position=PlaceholderRange(offset=41, length=2, is_embed=None), mm_hash='45c18fd6f9ea2cbf72e60674ba3c05299f1e1a4171c4e1153175c8a23fa3129c'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-3.0518e-05, -1.2817e-03, -1.0681e-03, ..., -1.3086e-01, (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] -2.0703e-01, -2.1777e-01], dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='4c2a604020e44dc9f8f836fe8e7d9fc0006fddb2aa44fe261de8b5fec57aa6fb', mm_position=PlaceholderRange(offset=42, length=2, is_embed=None), mm_hash='4c2a604020e44dc9f8f836fe8e7d9fc0006fddb2aa44fe261de8b5fec57aa6fb'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([0.0003, 0.0019, 0.0004, ..., 0.1748, 0.1279, 0.0542], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='bf120195c149ab4b4936072d7094b5640bbdd4fd1024c57a34fe757399a22418', mm_position=PlaceholderRange(offset=43, length=2, is_embed=None), mm_hash='bf120195c149ab4b4936072d7094b5640bbdd4fd1024c57a34fe757399a22418'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([0.2217, 0.1338, 0.0444, ..., 0.1138, 0.1289, 0.1367], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='d9374561a42c2ec466f3c8ec5157f08c1803eecc8aae01143628cb807cd7ee33', mm_position=PlaceholderRange(offset=44, length=2, is_embed=None), mm_hash='d9374561a42c2ec466f3c8ec5157f08c1803eecc8aae01143628cb807cd7ee33'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-0.0251, -0.0287, -0.0243, ..., 0.3242, 0.2852, 0.2373], (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] EngineCore encountered a fatal error. (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] Traceback (most recent call last): (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1125, in run_engine_core (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] engine_core.run_busy_loop() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1166, in run_busy_loop (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] self._process_engine_step() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1205, in _process_engine_step (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] outputs, model_executed = self.step_fn() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 523, in step_with_batch_queue (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] model_output = future.result() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return self.__get_result() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] raise self._exception (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return func(*args, **kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context

Root Cause

(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] EngineCore encountered a fatal error. (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] Traceback (most recent call last): (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1125, in run_engine_core (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] engine_core.run_busy_loop() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1166, in run_busy_loop (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] self._process_engine_step() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1205, in _process_engine_step (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] outputs, model_executed = self.step_fn() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 523, in step_with_batch_queue (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] model_output = future.result() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return self.__get_result() (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] raise self._exception (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return func(*args, **kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return func(*args, **kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 748, in sample_tokens (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return self.model_runner.sample_tokens(grammar_output) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] return func(*args, **kwargs) (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4279, in sample_tokens (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] ) = self._bookkeeping_sync( (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3438, in _bookkeeping_sync (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] assert end_idx <= self.max_model_len, ( (EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 16385 > max_model_len: 16384 (EngineCore pid=2274342) Process EngineCore: (EngineCore pid=2274342) Traceback (most recent call last): (EngineCore pid=2274342) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=2274342) self.run() (EngineCore pid=2274342) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore pid=2274342) self._target(*self._args, **self._kwargs) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1136, in run_engine_core (EngineCore pid=2274342) raise e (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1125, in run_engine_core (EngineCore pid=2274342) engine_core.run_busy_loop() (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1166, in run_busy_loop (EngineCore pid=2274342) self._process_engine_step() (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1205, in _process_engine_step (EngineCore pid=2274342) outputs, model_executed = self.step_fn() (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 523, in step_with_batch_queue (EngineCore pid=2274342) model_output = future.result() (EngineCore pid=2274342) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result (EngineCore pid=2274342) return self.__get_result() (EngineCore pid=2274342) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result (EngineCore pid=2274342) raise self._exception (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc (EngineCore pid=2274342) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=2274342) return func(*args, **kwargs) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=2274342) return func(*args, **kwargs) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 748, in sample_tokens (EngineCore pid=2274342) return self.model_runner.sample_tokens(grammar_output) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=2274342) return func(*args, **kwargs) (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4279, in sample_tokens (EngineCore pid=2274342) ) = self._bookkeeping_sync( (EngineCore pid=2274342) File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3438, in _bookkeeping_sync (EngineCore pid=2274342) assert end_idx <= self.max_model_len, ( (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] AsyncLLM output_handler failed. (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] Traceback (most recent call last): (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 657, in output_handler (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] outputs = await engine_core.get_output_async() (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] raise self._format_exception(outputs) from None (APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore pid=2274342) AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 16385 > max_model_len: 16384 (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] Error in generation: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] Traceback (most recent call last): (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/entrypoints/openai/realtime/connection.py", line 231, in _run_generation (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] async for output in result_gen: (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 576, in generate (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] out = q.get_nowait() or await q.get() (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] raise output (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 657, in output_handler (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] outputs = await engine_core.get_output_async() (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] raise self._format_exception(outputs) from None (APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. [rank0]:[W416 11:56:14.202811981 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=2274232) INFO: Shutting down (APIServer pid=2274232) INFO: connection closed (APIServer pid=2274232) INFO: Waiting for application shutdown. (APIServer pid=2274232) INFO: Application shutdown complete. (APIServer pid=2274232) INFO: Finished server process [2274232]

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz CPU family: 6 Model: 158 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 12 CPU max MHz: 5000,0000 CPU min MHz: 800,0000 BogoMIPS: 7200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities ibpb_exit_to_user Virtualization: VT-x L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 2 MiB (8 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

PR fix notes

PR #40072: fix: prevent streaming sessions from exceeding max_model_len

Description (problem / solution / changelog)

What's broken?

Long-lived WebSocket sessions (e.g., streaming speech-to-text with Voxtral via /v1/realtime) crash with a fatal AssertionError after ~15 minutes of continuous audio streaming. The engine dies and cannot recover.

Who is affected?

Any user running continuous audio streaming through the Realtime WebSocket API with encoder models (e.g., mistralai/Voxtral-Mini-4B-Realtime-2602). Short sessions are unaffected — the crash only occurs when accumulated encoder tokens exceed max_model_len.

When does it trigger?

The crash is deterministic: once total tokens (prompt + encoder + output) reach max_model_len + 1, the assertion fires. With max_model_len=16384, this takes ~15 minutes of continuous audio streaming.

Where is the bug?

Three missing safeguards in vllm/v1/core/sched/scheduler.py:

  1. _update_request_as_session (line 1028-1042): Extends prompt_token_ids and mm_features with each streaming update but never checks if the total exceeds max_model_len.

  2. WAITING scheduling path (line 673): Computes num_new_tokens = request.num_tokens - num_computed_tokens without the max_model_len clamp that exists in the RUNNING path (line 415-416).

  3. _handle_stopped_request / add_request: After _update_request_as_session updates the session, neither caller checks if the session now exceeds max_model_len.

The assertion that crashes is in gpu_model_runner.py:_bookkeeping_sync:

assert end_idx <= self.max_model_len  # end_idx = 16385 > 16384

Why does it happen?

The Realtime WebSocket API continuously streams audio chunks. Each chunk:

  1. Gets converted to encoder tokens (mm_features) and prompt tokens
  2. Is appended to the session via _update_request_as_session
  3. Previous output tokens are folded back into prompt_token_ids

Both prompt_token_ids and mm_features grow monotonically — there is no eviction, truncation, or max_model_len check. When the accumulated total exceeds max_model_len, the model runner's safety assertion catches the overflow and crashes instead of handling it gracefully.

How did we fix it?

Three layers of defense (all in scheduler.py):

  1. _handle_stopped_request: After calling _update_request_as_session, check if request.num_tokens >= max_model_len. If so, finish the request with FINISHED_LENGTH_CAPPED instead of re-enqueuing it.

  2. add_request: Same check when a WAITING_FOR_STREAMING_REQ session receives a new streaming chunk. Calls finish_requests() to cleanly remove it.

  3. WAITING scheduling path: Added max_model_len clamp (min(num_new_tokens, max_model_len - 1 - num_computed_tokens)) mirroring the existing RUNNING path guard. Changed assert num_new_tokens > 0 to a graceful break when the clamp yields ≤ 0.

Backward compatibility: Default behavior is unchanged — the checks only activate when tokens actually reach max_model_len, which doesn't happen in normal (non-streaming) usage.

How do we know it works?

Added two unit tests in tests/v1/streaming_input/test_scheduler_streaming.py:

  • test_streaming_session_max_model_len_cap_via_handle_stopped: Verifies _handle_stopped_request returns finished=True with FINISHED_LENGTH_CAPPED when a streaming update pushes past max_model_len.
  • test_streaming_session_max_model_len_cap_via_add_request: Verifies add_request finishes and removes the session when a new chunk exceeds the limit.

Fixes #39996

Changed files

  • tests/v1/streaming_input/test_scheduler_streaming.py (modified, +83/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +42/-1)
RAW_BUFFERClick to expand / collapse

Your current environment

Description

I am developing an Always-On voice chatbot using Voxtral via the vLLM Realtime WebSocket API. In this architecture, the WebSocket stays open indefinitely, continuously streaming audio chunks (including silence/ambient noise) to the backend.

I have discovered that the Encoder KV Cache accumulates acoustic tokens indefinitely. When the total token count reaches max_model_len, instead of evicting old tokens (as a sliding window should), vLLM attempts to append a new token, causing a fatal AssertionError that kills the WebSocket session.

Steps to Reproduce

  1. Start vLLM on an isolated 24GB VRAM GPU RTX3090TI: vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --max-model-len 16384
  2. Open a WebSocket connection to /v1/realtime.
  3. Continuously stream base64 PCM audio chunks (16kHz, ~125 chunks/sec) representing room silence.
  4. Wait for approximately 15 minutes (1 min from 0% to 10%, 1 min each 1% more, stabilizes for several minutes on 19%).
  5. Observe the GPU KV cache usage slowly plateau around 19-20%.
  6. The session abruptly crashes.

Actual Behavior (The Smoking Gun)

The crash is not an Out-Of-Memory error. It is a hard AssertionError in the GPU model runner (gpu_model_runner.py), proving that token eviction is not happening for the encoder.

Log snippet right before the crash:

(APIServer pid=2274232) INFO 04-16 11:56:03 [loggers.py:271] ... GPU KV cache usage: 19.2%, Prefix cache hit rate: 0.0%
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:72] Dumping input data for V1 LLM engine...
...
(EngineCore pid=2274342)   File ".../vllm/v1/worker/gpu_model_runner.py", line 3438, in _bookkeeping_sync
(EngineCore pid=2274342)     assert end_idx <= self.max_model_len, (
...
AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 16385 > max_model_len: 16384

### Environment
vLLM Version: 0.19.1rc1.dev335+g10e49d263 (also tested on 0.15.2rc1)
Model: mistralai/Voxtral-Mini-4B-Realtime-2602
GPU: Isolated 24GB VRAM (Nvidia)
max_model-len: Tested with 8192 and 16384 (same behavior, crash just takes longer with 16384).
OS: Ubuntu 22.04
Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] on linux

### Launch Command:
#!/bin/bash
#Avvio il venv di vllm/voxtral
source /home/roberto/Voxtral-STT/venv_voxtral/bin/activate
#GPU 1
export CUDA_VISIBLE_DEVICES=1
#avvio il server vllm per servire il modello multimodale voxtral
VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
  --compilation_config '{"cudagraph_mode": "PIECEWISE"}' \
  --port 8000 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 16384 \
  --host 0.0.0.0 2>&1 | tee vllm_log.txt

### Following the required wget result:
(venv_voxtral) roberto@I9:~/Voxtral-STT$ wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
--2026-04-16 12:38:06--  https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35090 (34K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py                                                      100%[==================================================================================================================================================================>]  34,27K  --.-KB/s    in 0,008s  

2026-04-16 12:38:06 (4,06 MB/s) - ‘collect_env.py’ saved [35090/35090]

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : Could not collect
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-107-generic-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version        : 550.144.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           39 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
CPU family:                              6
Model:                                   158
Thread(s) per core:                      2
Core(s) per socket:                      8
Socket(s):                               1
Stepping:                                12
CPU max MHz:                             5000,0000
CPU min MHz:                             800,0000
BogoMIPS:                                7200.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities ibpb_exit_to_user
Virtualization:                          VT-x
L1d cache:                               256 KiB (8 instances)
L1i cache:                               256 KiB (8 instances)
L2 cache:                                2 MiB (8 instances)
L3 cache:                                16 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Mitigation; Microcode
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             KVM: Mitigation: VMX disabled
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Mitigation; IBRS
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Mitigation; Microcode
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Mitigation; TSX disabled
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.17.1.4
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.28.9
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu129
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu129
[pip3] torchvision==0.26.0+cu129
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev335+g10e49d263 (git sha: 10e49d263)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV4	0-15	0		N/A
GPU1	NV4	 X 	0-15	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/lib/nvidia-cuda-toolkit/lib64:
CUDA_HOME=/usr/lib/nvidia-cuda-toolkit
CUDA_HOME=/usr/lib/nvidia-cuda-toolkit
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_roberto


### 🐛 Describe the bug

(APIServer pid=2274232) INFO 04-16 11:56:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.2 tokens/s, Running: 0 reqs, Waiting: 1 reqs, Deferred: 1 reqs, GPU KV cache usage: 19.2%, Prefix cache hit rate: 0.0%
(EngineCore pid=2274342) ERROR 04-16 11:56:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev335+g10e49d263) with config: model='mistralai/Voxtral-Mini-4B-Realtime-2602', speculative_config=None, tokenizer='mistralai/Voxtral-Mini-4B-Realtime-2602', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantiza>
(APIServer pid=2274232) INFO 04-16 11:56:13 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.9 tokens/s, Running: 0 reqs, Waiting: 1 reqs, Deferred: 1 reqs, GPU KV cache usage: 19.2%, Prefix cache hit rate: 0.0%
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rt-ws-4839ef1a-dfbf-413e-9c2e-0f4d6ac43800-04db37dd-6bba-4060-9579-662ef4c863f8-bc76b474,prompt_token_ids_len=16384,prefill_token_ids_len=None,mm_features=[MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0000,  0.0000,  0.0000,  ..., -0.0022, -0.0028, -0.0030],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='ebc5aa795fa6fa6af6940bcc6a72f9b4ca3d25bdc3c450666448988b5d51f592', mm_position=PlaceholderRange(offset=0, length=39, is_embed=None), mm_hash='ebc5aa795fa6fa6af6940bcc6a72f9b4ca3d25bdc3c450666448988b5d51f592'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-0.0007, -0.0003,  0.0003,  ..., -0.0017, -0.0010, -0.0008],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='37528b36849fcbaf55176d4cfa0bab6be4031c7bbb762958275819a8b5ef2c8e', mm_position=PlaceholderRange(offset=39, length=2, is_embed=None), mm_hash='37528b36849fcbaf55176d4cfa0bab6be4031c7bbb762958275819a8b5ef2c8e'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0014,  0.0015,  0.0021,  ..., -0.0036, -0.0036, -0.0034],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='c3f981f213fef677abcd51ec9cd1272716b112a67c27f9b102d432b31aed42d3', mm_position=PlaceholderRange(offset=40, length=2, is_embed=None), mm_hash='c3f981f213fef677abcd51ec9cd1272716b112a67c27f9b102d432b31aed42d3'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([ 0.0029,  0.0031,  0.0039,  ..., -0.0041, -0.0040, -0.0047],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='45c18fd6f9ea2cbf72e60674ba3c05299f1e1a4171c4e1153175c8a23fa3129c', mm_position=PlaceholderRange(offset=41, length=2, is_embed=None), mm_hash='45c18fd6f9ea2cbf72e60674ba3c05299f1e1a4171c4e1153175c8a23fa3129c'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-3.0518e-05, -1.2817e-03, -1.0681e-03,  ..., -1.3086e-01,
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]         -2.0703e-01, -2.1777e-01], dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='4c2a604020e44dc9f8f836fe8e7d9fc0006fddb2aa44fe261de8b5fec57aa6fb', mm_position=PlaceholderRange(offset=42, length=2, is_embed=None), mm_hash='4c2a604020e44dc9f8f836fe8e7d9fc0006fddb2aa44fe261de8b5fec57aa6fb'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([0.0003, 0.0019, 0.0004,  ..., 0.1748, 0.1279, 0.0542],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='bf120195c149ab4b4936072d7094b5640bbdd4fd1024c57a34fe757399a22418', mm_position=PlaceholderRange(offset=43, length=2, is_embed=None), mm_hash='bf120195c149ab4b4936072d7094b5640bbdd4fd1024c57a34fe757399a22418'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([0.2217, 0.1338, 0.0444,  ..., 0.1138, 0.1289, 0.1367],
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [dump_input.py:79]        dtype=torch.bfloat16), field=MultiModalBatchedField(keep_on_cpu=False))}, modality='audio', identifier='d9374561a42c2ec466f3c8ec5157f08c1803eecc8aae01143628cb807cd7ee33', mm_position=PlaceholderRange(offset=44, length=2, is_embed=None), mm_hash='d9374561a42c2ec466f3c8ec5157f08c1803eecc8aae01143628cb807cd7ee33'), MultiModalFeatureSpec(data={'audio_arrays': MultiModalFieldElem(data=tensor([-0.0251, -0.0287, -0.0243,  ...,  0.3242,  0.2852,  0.2373],

...

(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] EngineCore encountered a fatal error.
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] Traceback (most recent call last):
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1125, in run_engine_core
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     engine_core.run_busy_loop()
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1166, in run_busy_loop
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     self._process_engine_step()
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1205, in _process_engine_step
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     outputs, model_executed = self.step_fn()
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 523, in step_with_batch_queue
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     model_output = future.result()
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     return self.__get_result()
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     raise self._exception
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     return func(*args, **kwargs)
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     return func(*args, **kwargs)
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 748, in sample_tokens
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     return func(*args, **kwargs)
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4279, in sample_tokens
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     ) = self._bookkeeping_sync(
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3438, in _bookkeeping_sync
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134]     assert end_idx <= self.max_model_len, (
(EngineCore pid=2274342) ERROR 04-16 11:56:13 [core.py:1134] AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 16385 > max_model_len: 16384
(EngineCore pid=2274342) Process EngineCore:
(EngineCore pid=2274342) Traceback (most recent call last):
(EngineCore pid=2274342)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=2274342)     self.run()
(EngineCore pid=2274342)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore pid=2274342)     self._target(*self._args, **self._kwargs)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1136, in run_engine_core
(EngineCore pid=2274342)     raise e
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1125, in run_engine_core
(EngineCore pid=2274342)     engine_core.run_busy_loop()
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1166, in run_busy_loop
(EngineCore pid=2274342)     self._process_engine_step()
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1205, in _process_engine_step
(EngineCore pid=2274342)     outputs, model_executed = self.step_fn()
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 523, in step_with_batch_queue
(EngineCore pid=2274342)     model_output = future.result()
(EngineCore pid=2274342)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(EngineCore pid=2274342)     return self.__get_result()
(EngineCore pid=2274342)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(EngineCore pid=2274342)     raise self._exception
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 84, in collective_rpc
(EngineCore pid=2274342)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2274342)     return func(*args, **kwargs)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2274342)     return func(*args, **kwargs)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 748, in sample_tokens
(EngineCore pid=2274342)     return self.model_runner.sample_tokens(grammar_output)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=2274342)     return func(*args, **kwargs)
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4279, in sample_tokens
(EngineCore pid=2274342)     ) = self._bookkeeping_sync(
(EngineCore pid=2274342)   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3438, in _bookkeeping_sync
(EngineCore pid=2274342)     assert end_idx <= self.max_model_len, (
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] AsyncLLM output_handler failed.
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] Traceback (most recent call last):
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 657, in output_handler
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701]     outputs = await engine_core.get_output_async()
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701]     raise self._format_exception(outputs) from None
(APIServer pid=2274232) ERROR 04-16 11:56:13 [async_llm.py:701] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=2274342) AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 16385 > max_model_len: 16384
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] Error in generation: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] Traceback (most recent call last):
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/entrypoints/openai/realtime/connection.py", line 231, in _run_generation
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]     async for output in result_gen:
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 576, in generate
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]     out = q.get_nowait() or await q.get()
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/output_processor.py", line 85, in get
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]     raise output
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 657, in output_handler
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]     outputs = await engine_core.get_output_async()
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]   File "/home/roberto/Voxtral-STT/venv_voxtral/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263]     raise self._format_exception(outputs) from None
(APIServer pid=2274232) ERROR 04-16 11:56:13 [connection.py:263] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W416 11:56:14.202811981 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2274232) INFO:     Shutting down
(APIServer pid=2274232) INFO:     connection closed
(APIServer pid=2274232) INFO:     Waiting for application shutdown.
(APIServer pid=2274232) INFO:     Application shutdown complete.
(APIServer pid=2274232) INFO:     Finished server process [2274232]


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to implement a mechanism for evicting old tokens from the Encoder KV Cache when the total token count reaches the maximum allowed length, preventing the AssertionError that kills the WebSocket session.

Guidance

  1. Review the gpu_model_runner.py file: Investigate the _bookkeeping_sync method to understand why token eviction is not happening as expected.
  2. Implement token eviction: Develop a mechanism to remove old tokens from the Encoder KV Cache when the total token count approaches the max_model_len limit, ensuring that the cache does not overflow and cause the AssertionError.
  3. Verify cache management: After implementing token eviction, test the system to confirm that the Encoder KV Cache is properly managed, and the AssertionError is resolved.
  4. Monitor system performance: Observe the system's behavior under various loads to ensure that the token eviction mechanism does not introduce any performance issues or other problems.

Example

# Pseudocode example of token eviction
def _bookkeeping_sync(self, new_tokens):
    # ...
    if self.total_tokens + len(new_tokens) > self.max_model_len:
        # Evict old tokens to make room for new ones
        num_tokens_to_evict = self.total_tokens + len(new_tokens) - self.max_model_len
        self.evict_oldest_tokens(num_tokens_to_evict)
    # ...

Notes

  • The provided code snippet is a simplified example and may require modifications to fit the actual implementation.
  • The root cause of the issue seems to be related to the cache management in the gpu_model_runner.py file, but further investigation is necessary to determine the exact cause.

Recommendation

Apply a workaround by implementing a token eviction mechanism to prevent the Encoder KV Cache from overflowing, as described in the guidance section. This should resolve the AssertionError and allow the system to function as expected.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING