vllm - ✅(Solved) Fix [RFC] Streaming Video Input for Real-Time Video Understanding [1 pull requests, 2 comments, 1 participants]

vllm2026-03-25 20:12:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38141•Fetched 2026-04-08 01:32:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lishunyang12

Participants

lishunyang12

Timeline (top)

commented ×2cross-referenced ×2mentioned ×2subscribed ×2

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Root Cause

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Fix Action

Fixed

Fixed by PR: [Realtime] Add streaming video input support with Qwen3-Omni (https://github.com/vllm-project/vllm/pull/38142)

PR fix notes

PR #38142: [Realtime] Add streaming video input support with Qwen3-Omni

Repository: vllm-project/vllm
Author: lishunyang12
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/38142

Description (problem / solution / changelog)

Purpose

Extends the existing /v1/realtime WebSocket API to accept streaming video frames alongside audio, with Qwen3-Omni as the first supported model. This is Phase 1 of the streaming video input RFC (#38141).

Currently the realtime API only supports audio streaming (for ASR models like Voxtral, Qwen3-ASR). This PR adds the infrastructure for video frame streaming, following the same patterns:

New protocol events (input_video_frame.append/commit, video_chat.delta/done)
SupportsRealtimeVideo model interface (mirrors SupportsRealtime for audio)
Qwen3OmniRealtimeVideoGeneration model class that buffers frames and builds Qwen3-Omni prompts

Related RFC: #38141

Changes

File	Change
`realtime/protocol.py`	+4 event types for video streaming
`realtime/connection.py`	Video queue, PIL frame decode, `start_video_generation()`
`realtime/serving.py`	`video_model_cls`, `understand_video_realtime()`
`realtime/api_router.py`	Updated docstring with video protocol
`models/interfaces.py`	`SupportsRealtimeVideo` protocol + helper
`models/qwen3_omni_realtime_video.py`	New — Qwen3-Omni video streaming adapter
`models/registry.py`	Register `Qwen3OmniRealtimeVideoGeneration`
`gpu_model_runner.py`, `model_states/default.py`	Detect `supports_realtime_video` for task routing
`test_realtime_video.py`	New — unit tests (protocol, frame decode, queue)
`realtime_video_client.py`	New — example client (webcam/file/dir streaming)

Video Protocol

Client                              Server
  │                                    │
  ├─ session.update {model} ──────────►│
  │                                    ├─ session.created
  │                                    │
  ├─ input_video_frame.append ────────►│  (base64 JPEG/PNG + timestamp)
  ├─ input_video_frame.append ────────►│
  ├─ ...                               │
  ├─ input_video_frame.commit ────────►│  {query: "What do you see?"}
  │                                    │
  │◄── video_chat.delta ──────────────┤  (incremental text)
  │◄── video_chat.delta ──────────────┤
  │◄── video_chat.done ───────────────┤  (final text + usage)

Test Plan

Unit tests for protocol serialization (JPEG/PNG roundtrip, event fields)
Unit tests for frame decode (various resolutions, formats)
Unit tests for async queue mechanics (put/get/sentinel)
Integration test with Qwen3-Omni (requires GPU, to be added in follow-up)

pytest tests/entrypoints/openai/realtime/test_realtime_video.py -v

Test Result

Unit tests pass locally (no GPU required).

AI-assisted: Infrastructure code and model adapter were AI-assisted. All changes reviewed and understood by submitter.
Not duplicating existing work — extends #25066 (audio-only) to video modality.

Changed files

examples/online_serving/realtime_video_client.py (added, +250/-0)
tests/entrypoints/openai/realtime/test_realtime_video.py (added, +210/-0)
vllm/entrypoints/openai/realtime/api_router.py (modified, +13/-8)
vllm/entrypoints/openai/realtime/connection.py (modified, +163/-2)
vllm/entrypoints/openai/realtime/protocol.py (modified, +31/-0)
vllm/entrypoints/openai/realtime/serving.py (modified, +45/-1)
vllm/model_executor/models/interfaces.py (modified, +46/-0)
vllm/model_executor/models/qwen3_omni_realtime_video.py (added, +155/-0)
vllm/model_executor/models/registry.py (modified, +4/-0)
vllm/v1/worker/gpu/model_states/default.py (modified, +2/-1)
vllm/v1/worker/gpu_model_runner.py (modified, +2/-1)

Code Example

Video Frames (WebSocket)
    │
    ▼
RealtimeConnection (extended)
    │  InputVideoFrameAppend events
    │  base64 JPEG/PNG frames
    ▼
OpenAIServingRealtime (extended)
    │  buffer_realtime_video()
    ▼
AsyncGenerator[StreamingInput]
    │  frames → vision encoder → embeddings
    ▼
AsyncLLM._add_streaming_input_request(resumable=True)
    │  chunked prefill for new frames
    │  encoder cache for duplicate frames
    │  prefix caching for shared KV blocks
    ▼
Decode (streaming output)
    │  text / audio response
    ▼
VideoChatDelta events

RAW_BUFFERClick to expand / collapse

Motivation

Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms:

Google Gemini Live: camera streaming on Android/iOS
ByteDance Doubao: real-time video call with visual reasoning
Apple Visual Intelligence: on-device camera understanding
NVIDIA Live VLM WebUI: WebRTC webcam → VLM backend

The market demand is massive: AI video analytics ($5-21B in 2025, 22-33% CAGR), manufacturing visual inspection ($30B → $90B by 2033), video surveillance ($6B → $49B by 2035), and robotics/embodied AI (NVIDIA Jetson Thor shipping with onboard VLM inference).

vLLM already has 90% of the infrastructure needed. The v1 engine ships StreamingInput, resumable requests, a WebSocket /v1/realtime endpoint, EVS frame pruning, encoder caching, chunked prefill, prefix caching, and disaggregated encoder support. However, all of this is currently wired only for audio. Extending it to video frames is a natural next step with high impact and relatively low implementation cost.

Use Cases

Use Case	FPS Needed	Latency Target
Robotics / embodied AI	2-8	<500ms
Autonomous driving copilot	4-16	<200ms
Manufacturing QC	1-2	<1s
Security / surveillance	0.5-2	<2s
Accessibility (scene narration)	1-2	<1s
Sports live commentary	8-16	<500ms
Interactive video call (Doubao-style)	1-4	<1s

Benchmarks & SOTA

StreamingBench (human = 91.66%): best model scores 82.80%, showing massive room for improvement. StreamingVLM (MIT+NVIDIA) achieves 8 FPS on a single H100 with bounded memory for 3+ hours. ProVideLLM achieves 10-25 FPS with only 2GB GPU memory at 1B scale.

Proposed Change

Overview

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Architecture

Video Frames (WebSocket)
    │
    ▼
RealtimeConnection (extended)
    │  InputVideoFrameAppend events
    │  base64 JPEG/PNG frames
    ▼
OpenAIServingRealtime (extended)
    │  buffer_realtime_video()
    ▼
AsyncGenerator[StreamingInput]
    │  frames → vision encoder → embeddings
    ▼
AsyncLLM._add_streaming_input_request(resumable=True)
    │  chunked prefill for new frames
    │  encoder cache for duplicate frames
    │  prefix caching for shared KV blocks
    ▼
Decode (streaming output)
    │  text / audio response
    ▼
VideoChatDelta events

Key Components

1. Protocol Extension — New WebSocket events:

Client→Server: input_video_frame.append (base64 JPEG/PNG + timestamp), input_video_frame.commit (query + final flag)
Server→Client: video_chat.delta, video_chat.done

2. SupportsRealtimeVideo Interface — Model protocol with buffer_realtime_video() classmethod, following the existing SupportsRealtime pattern for audio.

3. Video Frame Buffer — asyncio.Queue for incoming frames, PIL decode to numpy, size validation, frame rate limiting.

Implementation Phases

Phase	Description	Target
1 (this PR)	WebSocket video frame ingestion + Qwen3-Omni adapter	End-to-end webcam→text demo
2	EVS frame pruning + encoder cache for streaming	Reduce redundant computation
3	KV cache eviction for long streams (sliding window → attention sink)	1+ hour streaming on A100
4	Bidirectional video+audio streaming	"Doubao experience"
5	Proactive response trigger + disaggregated encoder	Autonomous commentary

Models to Support

Priority	Model	Why
P0	Qwen3-Omni	Native video+audio+speech, 3-stage pipeline, SOTA on 32/36 AV benchmarks
P0	Qwen2.5-VL / Qwen3-VL	StreamingVLM foundation, EVS support
P1	MiniCPM-o 2.6	StreamingBench SOTA, full-duplex
P1	InternVL 2.5/3	Strong accuracy, wide size range

Feedback Period

2 weeks from posting.

CC List

cc @DarkLight1337 @ywang96 (streaming multimodal input owners per #25066)

Related Work

vLLM Issue #25066: Streaming multi-modal input/output (completed for audio)
StreamingVLM (MIT+NVIDIA): https://arxiv.org/abs/2510.09608
LiveVLM: https://arxiv.org/abs/2505.15269
StreamingBench: https://streamingbench.github.io/
NVIDIA Live VLM WebUI: https://github.com/NVIDIA-AI-IOT/live-vlm-webui

Memory Budget Estimates

Qwen2.5-VL-7B on A100 80GB (~60GB available after weights):

1 FPS × 256 tok/frame: ~4-5 hours before KV cache fills
2 FPS: ~1-1.5 hours
With FP8 KV on H100: doubles capacity

extent analysis

Fix Plan

To extend vLLM's existing realtime streaming infrastructure to accept video frames as input, follow these steps:

Protocol Extension: Implement new WebSocket events for video frame ingestion and processing.
SupportsRealtimeVideo Interface: Create a model protocol with buffer_realtime_video() classmethod.
Video Frame Buffer: Implement an asyncio.Queue for incoming frames with PIL decode to numpy, size validation, and frame rate limiting.

Example code for Video Frame Buffer:

import asyncio
from PIL import Image
import numpy as np

class VideoFrameBuffer:
    def __init__(self, max_size, frame_rate_limit):
        self.queue = asyncio.Queue(max_size)
        self.frame_rate_limit = frame_rate_limit

    async def put(self, frame):
        # Validate frame size and rate limit
        if self.queue.full():
            await self.queue.get()
        img = Image.frombytes('RGB', (1280, 720), frame)
        frame_array = np.array(img)
        await self.queue.put(frame_array)

    async def get(self):
        return await self.queue.get()

Verification

To verify the fix, test the end-to-end webcam→text demo with the following steps:

Send video frames to the WebSocket endpoint using the input_video_frame.append event.
Verify that the frames are processed correctly and the text response is generated.
Check the latency and frame rate to ensure they meet the target requirements.

Extra Tips

Use a sliding window approach for KV cache eviction to support long streams.
Implement bidirectional video+audio streaming for a "Doubao experience".
Consider using FP8 KV on H100 to double the capacity.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC] Streaming Video Input for Real-Time Video Understanding [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #38142: [Realtime] Add streaming video input support with Qwen3-Omni

Description (problem / solution / changelog)

Purpose

Changes

Video Protocol

Test Plan

Test Result

Changed files

Code Example

Motivation

Use Cases

Benchmarks & SOTA

Proposed Change

Overview

Architecture

Key Components

Implementation Phases

Models to Support

Feedback Period

CC List

Related Work

Memory Budget Estimates

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING