vllm - 💡(How to fix) Fix [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41843Fetched 2026-05-07 03:32:32
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1
RAW_BUFFERClick to expand / collapse

Motivation.

Summary: Add a new VideoBackend implementation, deepstream, that decodes video on NVIDIA NVDEC hardware via GStreamer, keeping decoded frames GPU-resident through to the VLM preprocessor. Activated via VLLM_VIDEO_LOADER_BACKEND=deepstream, opt-in, sibling to the existing opencv backend. Measured 1.88× request throughput, 4.4× lower mean TPOT and ~6× lower CPU usage vs the current OpenCV backend on a Qwen2-VL-2B file-decode workload.

Motivation: vLLM serves video-language requests through a backend abstraction (VideoBackend) with current implementation opencv which:

  1. Decode on CPU — under concurrent serving, video decode contends with the rest of the FastAPI/inference stack for cores. Empirically, the CPU becomes the bottleneck before the GPU does.
  2. Materialize frames on CPU as numpy arrays — every frame then crosses CPU→GPU on its way to the model preprocessor, doubling memory bandwidth pressure and burning PCIe lanes.
  3. Saturate at high concurrency — the request handler thread stalls with OpenCV backend, so request throughput plateaus regardless of how much GPU you throw at the problem.

For H.264 / H.265 / containerized formats (which dominate real-world video workloads), every NVIDIA GPU since Maxwell ships dedicated NVDEC hardware that decodes orders of magnitude faster than CPU and runs independently of the SMs and the CPU, so it doesn't contend with either inference or the request handler.

Proposed Change.

We propose a deepstream backend that:

  • Decodes via NVDEC through a GStreamer pipeline (filesrc → parsebin → nvv4l2decoder → nvvideoconvert → capsfilter → fakesink).
  • Keeps the decoded surface on GPU; the buffer probe copies into a torch.cuda.Tensor on a dedicated CUDA stream — no D2H, no H2D, no IPC handle, no PCIe round-trip per frame.
  • Frees the CPU entirely from decode work — the FastAPI worker thread does ~zero decode-related work between accept and engine handoff.

Proposed Changes in below files
<img width="1719" height="219" alt="Image" src="https://github.com/user-attachments/assets/2c04b5af-7591-43c8-94d0-d58d897f1563" />

workload <img width="1238" height="378" alt="Image" src="https://github.com/user-attachments/assets/e90a7ac2-a6d3-4ffc-ab31-f4433b904c7e" />

File Decode Benchmark <img width="730" height="579" alt="Image" src="https://github.com/user-attachments/assets/3720670d-8060-4281-b902-62e3c3bf384e" />

Full command line

Environment

export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1 export CUDA_MODULE_LOADING=LAZY
export VLLM_MULTIMODAL_TENSOR_IPC=1 export VLLM_MEDIA_LOADING_THREAD_COUNT=16
export VLLM_VIDEO_LOADER_BACKEND=deepstream # or "opencv" for the baseline

Server

vllm serve /work/deepstream_9.0_vllm/Qwen2-VL-2B-Instruct \
--host 0.0.0.0
--port 8000 \
--served-model-name bench-model
--dtype bfloat16 \
--gpu-memory-utilization 0.7
--max-model-len 20000 \
--max-num-seqs 16
--enforce-eager \
--trust-remote-code
--skip-mm-profiling \
--allowed-local-media-path /data
--limit-mm-per-prompt '{"video": 1}' \
--media-io-kwargs '{"video": {"num_frames": 8}}'

<img width="1618" height="542" alt="Image" src="https://github.com/user-attachments/assets/f8293757-f52a-464d-9c33-6b979f66d6c2" />

Live-stream capability (not in scope for this RFC, but enabled by the same backend)
The same decode pool also exposes a stream_uri() generator that handles rtsp:// / rtsps:// / rtmp:// URLs, yielding decoded segments at a configurable cadence. We use this downstream to build a live captioning service on top of vLLM. We are not proposing any new vLLM HTTP route
in this RFC; the stream_uri capability simply demonstrates that the same backend supports the full range of containerized + live H.264/H.265 sources NVDEC handles. Continuous-feed serving belongs in application-layer code (examples/online_serving/), not in vLLM core.

A short downstream demo of segment-by-segment RTSP captioning (output truncated):

[Segment 0] PTS 0–10s "A white van is driving down a busy street ..."
[Segment 1] PTS 10–20s "The video shows a busy street scene with various vehicles ..."
[Segment 2] PTS 20–30s "There are multiple lanes of traffic, including cars and a bus ..."
[Segment 3] PTS 30–40s "A mix of pedestrians and vehicles ..."

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode [1 participants]