vllm - 💡(How to fix) Fix [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode [1 participants]

ViranjanPagar · 2026-05-06T17:00:42Z

[vllm] Motivation. Summary: Add a new VideoBackend implementation, deepstream, that decodes video on NVIDIA NVDEC hardware via GStreamer, keeping decoded frame… ### Motivation. Summary: Add a new VideoBackend implementation, deepstream, that decodes video on NVIDIA NVDEC hardware via GStreamer, keeping decoded frames GPU-resident through to the VLM preprocessor. Activated via VLLM_VIDEO_LOADER_BACKEND=deepstream, opt-in, sibling to the existing opencv backend. Measured 1.88× request throughput, 4.4× lower mean TPOT and ~6× lower CPU usage vs the current OpenCV backend on a Qwen2-VL-2B file-decode workload. Motivation: vLLM serves video-language requests through a backend abstraction (VideoBackend) with current implementation opencv which: 1. Decode on CPU — under concurrent serving, video decode contends with the rest of the FastAPI/inference stack for cores. Empirically, the CPU becomes the bottleneck before the GPU does. 2. Materialize frames on CPU as numpy arrays — every frame then crosses CPU→GPU on its way to the model preprocessor, doubling memory bandwidth pressure and burning PCIe lanes. 3. Saturate at high concurrency — the request handler thread stalls with OpenCV backend, so request throughput plateaus regardless of how much GPU you throw at the problem. For H.264 / H.265 / containerized formats (which dominate real-world video workloads), every NVIDIA GPU since Maxwell ships dedicated NVDEC hardware that decodes orders of magnitude faster than CPU and runs independently of the SMs and the CPU, so it doesn't contend with either inference or the request handler. ### Proposed Change. **We propose a deepstream backend that:** - Decodes via NVDEC through a GStreamer pipeline (filesrc → parsebin → nvv4l2decoder → nvvideoconvert → capsfilter → fakesink). - Keeps the decoded surface on GPU; the buffer probe copies into a torch.cuda.Tensor on a dedicated CUDA stream — no D2H, no H2D, no IPC handle, no PCIe round-trip per frame. - Frees the CPU entirely from decode work — the FastAPI worker thread does ~zero decode-related work between accept and engine handoff. **Proposed Changes in below files** **workload** **File Decode Benchmark** Full command line # Environment export HF_HUB_OFFLINE=1 export TRANSFORMERS_OFFLINE=1 export CUDA_MODULE_LOADING=LAZY export VLLM_MULTIMODAL_TENSOR_IPC=1 export VLLM_MEDIA_LOADING_THREAD_COUNT=16 export VLLM_VIDEO_LOADER_BACKEND=deepstream # or "opencv" for the baseline # Server vllm serve /work/deepstream_9.0_vllm/Qwen2-VL-2B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --served-model-name bench-model \ --dtype bfloat16 \ --gpu-memory-utilization 0.7 \ --max-model-len 20000 \ --max-num-seqs 16 \ --enforce-eager \ --trust-remote-code \ --skip-mm-profiling \ --allowed-local-media-path /data \ --limit-mm-per-prompt '{"video": 1}' \ --media-io-kwargs '{"video": {"num_frames": 8}}' **Live-stream capability (not in scope for this RFC, but enabled by the same backend)** The same decode pool also exposes a stream_uri() generator that handles rtsp:// / rtsps:// / rtmp:// URLs, yielding decoded segments at a configurable cadence. We use this downstream to build a live captioning service on top of vLLM. We are not proposing any new vLLM HTTP route in this RFC; the stream_uri capability simply demonstrates that the same backend supports the full range of containerized + live H.264/H.265 sources NVDEC handles. Continuous-feed serving belongs in application-layer code (examples/online_serving/), not in vLLM core. A short downstream demo of segment-by-segment RTSP captioning (output truncated): [Segment 0] PTS 0–10s "A white van is driving down a busy street ..." [Segment 1] PTS 10–20s "The video shows a busy street scene with various vehicles ..." [Segment 2] PTS 20–30s "There are multiple lanes of traffic, including cars and a bus ..." [Segment 3] PTS 30–40s "A mix of pedestrians and vehicles ..." ### Feedback Period. _No response_ ### CC List. _No response_ ### Any Other Things. _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-06 17:00:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41843•Fetched 2026-05-07 03:32:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ViranjanPagar

Participants

ViranjanPagar

Timeline (top)

labeled ×1

RAW_BUFFERClick to expand / collapse

Motivation.

Summary: Add a new VideoBackend implementation, deepstream, that decodes video on NVIDIA NVDEC hardware via GStreamer, keeping decoded frames GPU-resident through to the VLM preprocessor. Activated via VLLM_VIDEO_LOADER_BACKEND=deepstream, opt-in, sibling to the existing opencv backend. Measured 1.88× request throughput, 4.4× lower mean TPOT and ~6× lower CPU usage vs the current OpenCV backend on a Qwen2-VL-2B file-decode workload.

Motivation: vLLM serves video-language requests through a backend abstraction (VideoBackend) with current implementation opencv which:

Decode on CPU — under concurrent serving, video decode contends with the rest of the FastAPI/inference stack for cores. Empirically, the CPU becomes the bottleneck before the GPU does.
Materialize frames on CPU as numpy arrays — every frame then crosses CPU→GPU on its way to the model preprocessor, doubling memory bandwidth pressure and burning PCIe lanes.
Saturate at high concurrency — the request handler thread stalls with OpenCV backend, so request throughput plateaus regardless of how much GPU you throw at the problem.

For H.264 / H.265 / containerized formats (which dominate real-world video workloads), every NVIDIA GPU since Maxwell ships dedicated NVDEC hardware that decodes orders of magnitude faster than CPU and runs independently of the SMs and the CPU, so it doesn't contend with either inference or the request handler.

Proposed Change.

We propose a deepstream backend that:

Decodes via NVDEC through a GStreamer pipeline (filesrc → parsebin → nvv4l2decoder → nvvideoconvert → capsfilter → fakesink).
Keeps the decoded surface on GPU; the buffer probe copies into a torch.cuda.Tensor on a dedicated CUDA stream — no D2H, no H2D, no IPC handle, no PCIe round-trip per frame.
Frees the CPU entirely from decode work — the FastAPI worker thread does ~zero decode-related work between accept and engine handoff.

Proposed Changes in below files
<img width="1719" height="219" alt="Image" src="https://github.com/user-attachments/assets/2c04b5af-7591-43c8-94d0-d58d897f1563" />

workload <img width="1238" height="378" alt="Image" src="https://github.com/user-attachments/assets/e90a7ac2-a6d3-4ffc-ab31-f4433b904c7e" />

File Decode Benchmark <img width="730" height="579" alt="Image" src="https://github.com/user-attachments/assets/3720670d-8060-4281-b902-62e3c3bf384e" />

Full command line

Environment

export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1 export CUDA_MODULE_LOADING=LAZY
export VLLM_MULTIMODAL_TENSOR_IPC=1 export VLLM_MEDIA_LOADING_THREAD_COUNT=16
export VLLM_VIDEO_LOADER_BACKEND=deepstream # or "opencv" for the baseline

Server

vllm serve /work/deepstream_9.0_vllm/Qwen2-VL-2B-Instruct \
--host 0.0.0.0
--port 8000 \
--served-model-name bench-model
--dtype bfloat16 \
--gpu-memory-utilization 0.7
--max-model-len 20000 \
--max-num-seqs 16
--enforce-eager \
--trust-remote-code
--skip-mm-profiling \
--allowed-local-media-path /data
--limit-mm-per-prompt '{"video": 1}' \
--media-io-kwargs '{"video": {"num_frames": 8}}'

Live-stream capability (not in scope for this RFC, but enabled by the same backend)
The same decode pool also exposes a stream_uri() generator that handles rtsp:// / rtsps:// / rtmp:// URLs, yielding decoded segments at a configurable cadence. We use this downstream to build a live captioning service on top of vLLM. We are not proposing any new vLLM HTTP route
in this RFC; the stream_uri capability simply demonstrates that the same backend supports the full range of containerized + live H.264/H.265 sources NVDEC handles. Continuous-feed serving belongs in application-layer code (examples/online_serving/), not in vLLM core.

A short downstream demo of segment-by-segment RTSP captioning (output truncated):

[Segment 0] PTS 0–10s "A white van is driving down a busy street ..."
[Segment 1] PTS 10–20s "The video shows a busy street scene with various vehicles ..."
[Segment 2] PTS 20–30s "There are multiple lanes of traffic, including cars and a bus ..."
[Segment 3] PTS 30–40s "A mix of pedestrians and vehicles ..."

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Environment

Server

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Environment

Server

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING