vllm - ✅(Solved) Fix [RFC] Streaming Video Input for Real-Time Video Understanding [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38141Fetched 2026-04-08 01:32:05
View on GitHub
Comments
2
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
commented ×2cross-referenced ×2mentioned ×2subscribed ×2

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Root Cause

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Fix Action

Fixed

PR fix notes

PR #38142: [Realtime] Add streaming video input support with Qwen3-Omni

Description (problem / solution / changelog)

Purpose

Extends the existing /v1/realtime WebSocket API to accept streaming video frames alongside audio, with Qwen3-Omni as the first supported model. This is Phase 1 of the streaming video input RFC (#38141).

Currently the realtime API only supports audio streaming (for ASR models like Voxtral, Qwen3-ASR). This PR adds the infrastructure for video frame streaming, following the same patterns:

  • New protocol events (input_video_frame.append/commit, video_chat.delta/done)
  • SupportsRealtimeVideo model interface (mirrors SupportsRealtime for audio)
  • Qwen3OmniRealtimeVideoGeneration model class that buffers frames and builds Qwen3-Omni prompts

Related RFC: #38141

Changes

FileChange
realtime/protocol.py+4 event types for video streaming
realtime/connection.pyVideo queue, PIL frame decode, start_video_generation()
realtime/serving.pyvideo_model_cls, understand_video_realtime()
realtime/api_router.pyUpdated docstring with video protocol
models/interfaces.pySupportsRealtimeVideo protocol + helper
models/qwen3_omni_realtime_video.pyNew — Qwen3-Omni video streaming adapter
models/registry.pyRegister Qwen3OmniRealtimeVideoGeneration
gpu_model_runner.py, model_states/default.pyDetect supports_realtime_video for task routing
test_realtime_video.pyNew — unit tests (protocol, frame decode, queue)
realtime_video_client.pyNew — example client (webcam/file/dir streaming)

Video Protocol

Client                              Server
  │                                    │
  ├─ session.update {model} ──────────►│
  │                                    ├─ session.created
  │                                    │
  ├─ input_video_frame.append ────────►│  (base64 JPEG/PNG + timestamp)
  ├─ input_video_frame.append ────────►│
  ├─ ...                               │
  ├─ input_video_frame.commit ────────►│  {query: "What do you see?"}
  │                                    │
  │◄── video_chat.delta ──────────────┤  (incremental text)
  │◄── video_chat.delta ──────────────┤
  │◄── video_chat.done ───────────────┤  (final text + usage)

Test Plan

  • Unit tests for protocol serialization (JPEG/PNG roundtrip, event fields)
  • Unit tests for frame decode (various resolutions, formats)
  • Unit tests for async queue mechanics (put/get/sentinel)
  • Integration test with Qwen3-Omni (requires GPU, to be added in follow-up)
pytest tests/entrypoints/openai/realtime/test_realtime_video.py -v

Test Result

Unit tests pass locally (no GPU required).


  • AI-assisted: Infrastructure code and model adapter were AI-assisted. All changes reviewed and understood by submitter.
  • Not duplicating existing work — extends #25066 (audio-only) to video modality.

Changed files

  • examples/online_serving/realtime_video_client.py (added, +250/-0)
  • tests/entrypoints/openai/realtime/test_realtime_video.py (added, +210/-0)
  • vllm/entrypoints/openai/realtime/api_router.py (modified, +13/-8)
  • vllm/entrypoints/openai/realtime/connection.py (modified, +163/-2)
  • vllm/entrypoints/openai/realtime/protocol.py (modified, +31/-0)
  • vllm/entrypoints/openai/realtime/serving.py (modified, +45/-1)
  • vllm/model_executor/models/interfaces.py (modified, +46/-0)
  • vllm/model_executor/models/qwen3_omni_realtime_video.py (added, +155/-0)
  • vllm/model_executor/models/registry.py (modified, +4/-0)
  • vllm/v1/worker/gpu/model_states/default.py (modified, +2/-1)
  • vllm/v1/worker/gpu_model_runner.py (modified, +2/-1)

Code Example

Video Frames (WebSocket)
RealtimeConnection (extended)
InputVideoFrameAppend events
    │  base64 JPEG/PNG frames
OpenAIServingRealtime (extended)
buffer_realtime_video()
AsyncGenerator[StreamingInput]
    │  frames → vision encoder → embeddings
AsyncLLM._add_streaming_input_request(resumable=True)
    │  chunked prefill for new frames
    │  encoder cache for duplicate frames
    │  prefix caching for shared KV blocks
Decode (streaming output)
    │  text / audio response
VideoChatDelta events
RAW_BUFFERClick to expand / collapse

Motivation

Real-time video understanding — where a model continuously processes a live camera or video stream and responds to user queries about what it sees — is rapidly becoming a core capability of frontier AI platforms:

  • Google Gemini Live: camera streaming on Android/iOS
  • ByteDance Doubao: real-time video call with visual reasoning
  • Apple Visual Intelligence: on-device camera understanding
  • NVIDIA Live VLM WebUI: WebRTC webcam → VLM backend

The market demand is massive: AI video analytics ($5-21B in 2025, 22-33% CAGR), manufacturing visual inspection ($30B → $90B by 2033), video surveillance ($6B → $49B by 2035), and robotics/embodied AI (NVIDIA Jetson Thor shipping with onboard VLM inference).

vLLM already has 90% of the infrastructure needed. The v1 engine ships StreamingInput, resumable requests, a WebSocket /v1/realtime endpoint, EVS frame pruning, encoder caching, chunked prefill, prefix caching, and disaggregated encoder support. However, all of this is currently wired only for audio. Extending it to video frames is a natural next step with high impact and relatively low implementation cost.

Use Cases

Use CaseFPS NeededLatency Target
Robotics / embodied AI2-8<500ms
Autonomous driving copilot4-16<200ms
Manufacturing QC1-2<1s
Security / surveillance0.5-2<2s
Accessibility (scene narration)1-2<1s
Sports live commentary8-16<500ms
Interactive video call (Doubao-style)1-4<1s

Benchmarks & SOTA

StreamingBench (human = 91.66%): best model scores 82.80%, showing massive room for improvement. StreamingVLM (MIT+NVIDIA) achieves 8 FPS on a single H100 with bounded memory for 3+ hours. ProVideLLM achieves 10-25 FPS with only 2GB GPU memory at 1B scale.

Proposed Change

Overview

Extend vLLM's existing realtime streaming infrastructure to accept video frames as input, enabling continuous video understanding with incremental KV cache management.

Architecture

Video Frames (WebSocket)
RealtimeConnection (extended)
    │  InputVideoFrameAppend events
    │  base64 JPEG/PNG frames
OpenAIServingRealtime (extended)
    │  buffer_realtime_video()
AsyncGenerator[StreamingInput]
    │  frames → vision encoder → embeddings
AsyncLLM._add_streaming_input_request(resumable=True)
    │  chunked prefill for new frames
    │  encoder cache for duplicate frames
    │  prefix caching for shared KV blocks
Decode (streaming output)
    │  text / audio response
VideoChatDelta events

Key Components

1. Protocol Extension — New WebSocket events:

  • Client→Server: input_video_frame.append (base64 JPEG/PNG + timestamp), input_video_frame.commit (query + final flag)
  • Server→Client: video_chat.delta, video_chat.done

2. SupportsRealtimeVideo Interface — Model protocol with buffer_realtime_video() classmethod, following the existing SupportsRealtime pattern for audio.

3. Video Frame Bufferasyncio.Queue for incoming frames, PIL decode to numpy, size validation, frame rate limiting.

4. Qwen3-Omni Integration — First target model. Qwen3OmniRealtimeVideoGeneration subclass that buffers frames and builds prompts with <|vision_start|><|video_pad|><|vision_end|> template.

Implementation Phases

PhaseDescriptionTarget
1 (this PR)WebSocket video frame ingestion + Qwen3-Omni adapterEnd-to-end webcam→text demo
2EVS frame pruning + encoder cache for streamingReduce redundant computation
3KV cache eviction for long streams (sliding window → attention sink)1+ hour streaming on A100
4Bidirectional video+audio streaming"Doubao experience"
5Proactive response trigger + disaggregated encoderAutonomous commentary

Models to Support

PriorityModelWhy
P0Qwen3-OmniNative video+audio+speech, 3-stage pipeline, SOTA on 32/36 AV benchmarks
P0Qwen2.5-VL / Qwen3-VLStreamingVLM foundation, EVS support
P1MiniCPM-o 2.6StreamingBench SOTA, full-duplex
P1InternVL 2.5/3Strong accuracy, wide size range

Feedback Period

2 weeks from posting.

CC List

cc @DarkLight1337 @ywang96 (streaming multimodal input owners per #25066)

Related Work

Memory Budget Estimates

Qwen2.5-VL-7B on A100 80GB (~60GB available after weights):

  • 1 FPS × 256 tok/frame: ~4-5 hours before KV cache fills
  • 2 FPS: ~1-1.5 hours
  • With FP8 KV on H100: doubles capacity

extent analysis

Fix Plan

To extend vLLM's existing realtime streaming infrastructure to accept video frames as input, follow these steps:

  • Protocol Extension: Implement new WebSocket events for video frame ingestion and processing.
  • SupportsRealtimeVideo Interface: Create a model protocol with buffer_realtime_video() classmethod.
  • Video Frame Buffer: Implement an asyncio.Queue for incoming frames with PIL decode to numpy, size validation, and frame rate limiting.

Example code for Video Frame Buffer:

import asyncio
from PIL import Image
import numpy as np

class VideoFrameBuffer:
    def __init__(self, max_size, frame_rate_limit):
        self.queue = asyncio.Queue(max_size)
        self.frame_rate_limit = frame_rate_limit

    async def put(self, frame):
        # Validate frame size and rate limit
        if self.queue.full():
            await self.queue.get()
        img = Image.frombytes('RGB', (1280, 720), frame)
        frame_array = np.array(img)
        await self.queue.put(frame_array)

    async def get(self):
        return await self.queue.get()

Verification

To verify the fix, test the end-to-end webcam→text demo with the following steps:

  1. Send video frames to the WebSocket endpoint using the input_video_frame.append event.
  2. Verify that the frames are processed correctly and the text response is generated.
  3. Check the latency and frame rate to ensure they meet the target requirements.

Extra Tips

  • Use a sliding window approach for KV cache eviction to support long streams.
  • Implement bidirectional video+audio streaming for a "Doubao experience".
  • Consider using FP8 KV on H100 to double the capacity.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING