vllm - ✅(Solved) Fix [RFC]: Support ViT Full CUDA Graph (Tracker) [5 pull requests, 1 participants]

shen-shanshan · 2026-03-26T02:22:01Z

[vllm] PR 35963: Feature ViT Full CUDA Graph - Repository: vllm-project/vllm - Author: b-mu - State: closed | merged: True - Link: https://github.com/vllm-proj… # PR #35963: [Feature] ViT Full CUDA Graph - Repository: vllm-project/vllm - Author: b-mu - State: closed | merged: True - Link: https://github.com/vllm-project/vllm/pull/35963 ## Description (problem / solution / changelog) ## Purpose Add full CUDA graph for the ViT to reduce kernel launch overheads. **Features:** - **Budget-based graphs with a maximum batch size**: - Capture CUDA graphs at configurable token budgets (e.g., `[256, 512, 1024, 2048, 4096]`). - Pad sequence metadata (e.g. cu_seqlen) so that we can use the same budget-based graph for various number of images during replays. - **Greedy bin-packing**: - Sort images in a batch in ascending order to reduce the number of graphs. - **Data-parallel (DP) support**: - When `mm_encoder_tp_mode=data`, each TP rank runs the ViT independently via data parallelism. - **FlashInfer cuDNN attention support**: - Override FlashInfer buckets in the CUDA graph path. - **Model-agnostic protocol**: - `SupportsEncoderCudaGraph` protocol in `interfaces.py` — models opt in by implementing 9 protocol methods for input handling, metadata computation, and forward dispatch. - `EncoderCudaGraphManager` is fully model-agnostic; all model-specific logic (grid config, dummy inputs, embedding computation) lives in the model class. **New config flags** (via `--compilation-config`): - `cudagraph_mm_encoder: true` — enable encoder CUDA graph - `encoder_cudagraph_token_budgets: [...]` — list of token budget sizes to capture - `encoder_cudagraph_max_images_per_batch: N` — max images per graph replay **Files changed:** - `vllm/config/compilation.py` — new config flags - `vllm/model_executor/models/interfaces.py` — `SupportsEncoderCudaGraph` protocol and `supports_encoder_cudagraph()` type guard - `vllm/model_executor/models/qwen3_vl.py` — implement `SupportsEncoderCudaGraph` on `Qwen3VLForConditionalGeneration` - `vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py` — `EncoderCudaGraphConfig`, `EncoderCudaGraphCaptureInputs`, `EncoderCudaGraphReplayBuffers` dataclasses - `vllm/v1/worker/gpu/mm/encoder_cudagraph.py` — `EncoderCudaGraphManager` (capture, replay, packing, DP) - `vllm/v1/worker/gpu_model_runner.py` — integration into V1 model runner - `tests/v1/cudagraph/test_encoder_cudagraph.py` — unit and GPU tests cc @maxyanghu @wangshangsam @Anerudhan ## Test Plan **Unit Tests:** ```bash pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v ``` **End-to-End Tests**: - **Single GPU**: Qwen3-VL-30B-A3B-Instruct, VisionArena-Chat dataset, 3000 prompts + 300 warmup ``` vllm bench mm-processor \ --model Qwen/Qwen3-VL-30B-A3B-Instruct \ --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \ --num-prompts 3000 --num-warmups 300 \ --max-model-len 32768 --seed 42 \ --mm-encoder-attn-backend FLASHINFER \ --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}' ``` - **Multi GPU**: Qwen3-VL-32B-Instruct, 4×GB200 TP=4 + ViT DP=4, random-mm dataset (20 imgs/req, 336×336), 1000 prompts + 200 warmup ``` vllm bench mm-processor \ --model Qwen/Qwen3-VL-32B-Instruct \ --dataset-name random-mm \ --random-mm-base-items-per-request 20 \ --random-mm-num-mm-items-range-ratio 0.0 \ --random-mm-bucket-config '{"(336,336,1)": 1.0}' \ --num-prompts 1000 --num-warmups 200 \ --max-model-len 8192 --seed 42 \ --mm-encoder-attn-backend FLASHINFER \ --tensor-parallel-size 4 --mm-encoder-tp-mode data \ --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}' ``` ## Test Result **Single GPU** (Qwen3-VL-30B, 1×GB200, VisionArena-Chat, 3000 prompts): | Backend | Mean | P99 | |---|---|---| | FLASH_ATTN | +11.8% (5.13→4.52ms) | +31.6% (9.16→6.26ms) | | FLASH_ATTN | +19.6% (5.42→4.36ms) | +40.3% (10.87→6.49ms) | **Multi GPU** (Qwen3-VL-32B, 4×GB200 TP=4 DP=4, random-mm 20img/req, 1000 prompts): | Backend | Mean | P99 | |---|---|---| | FLASH_ATTN | +18.4% (28.39→23.16ms) | +14.0% (238.78→205.28ms) | | FLASHINFER | +44.4% (23.24→12.91ms) | +84.9% (172.41→26.05ms) | --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/

vllm2026-03-26 02:22:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38175•Fetched 2026-04-08 01:31:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

shen-shanshan

Participants

shen-shanshan

Timeline (top)

subscribed ×5mentioned ×3labeled ×2added_to_project_v2 ×1

Fix Action

Fix / Workaround

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

PR fix notes

PR #35963: [Feature] ViT Full CUDA Graph

Repository: vllm-project/vllm
Author: b-mu
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/35963

Description (problem / solution / changelog)

Purpose

Add full CUDA graph for the ViT to reduce kernel launch overheads.

Features:

Budget-based graphs with a maximum batch size:
- Capture CUDA graphs at configurable token budgets (e.g., [256, 512, 1024, 2048, 4096]).
- Pad sequence metadata (e.g. cu_seqlen) so that we can use the same budget-based graph for various number of images during replays.
Greedy bin-packing:
- Sort images in a batch in ascending order to reduce the number of graphs.
Data-parallel (DP) support:
- When mm_encoder_tp_mode=data, each TP rank runs the ViT independently via data parallelism.
FlashInfer cuDNN attention support:
- Override FlashInfer buckets in the CUDA graph path.
Model-agnostic protocol:
- SupportsEncoderCudaGraph protocol in interfaces.py — models opt in by implementing 9 protocol methods for input handling, metadata computation, and forward dispatch.
- EncoderCudaGraphManager is fully model-agnostic; all model-specific logic (grid config, dummy inputs, embedding computation) lives in the model class.

New config flags (via --compilation-config):

cudagraph_mm_encoder: true — enable encoder CUDA graph
encoder_cudagraph_token_budgets: [...] — list of token budget sizes to capture
encoder_cudagraph_max_images_per_batch: N — max images per graph replay

Files changed:

vllm/config/compilation.py — new config flags
vllm/model_executor/models/interfaces.py — SupportsEncoderCudaGraph protocol and supports_encoder_cudagraph() type guard
vllm/model_executor/models/qwen3_vl.py — implement SupportsEncoderCudaGraph on Qwen3VLForConditionalGeneration
vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py — EncoderCudaGraphConfig, EncoderCudaGraphCaptureInputs, EncoderCudaGraphReplayBuffers dataclasses
vllm/v1/worker/gpu/mm/encoder_cudagraph.py — EncoderCudaGraphManager (capture, replay, packing, DP)
vllm/v1/worker/gpu_model_runner.py — integration into V1 model runner
tests/v1/cudagraph/test_encoder_cudagraph.py — unit and GPU tests

cc @maxyanghu @wangshangsam @Anerudhan

Test Plan

Unit Tests:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

End-to-End Tests:

Single GPU: Qwen3-VL-30B-A3B-Instruct, VisionArena-Chat dataset, 3000 prompts + 300 warmup

vllm bench mm-processor \
  --model Qwen/Qwen3-VL-30B-A3B-Instruct \
  --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 3000 --num-warmups 300 \
  --max-model-len 32768 --seed 42 \
  --mm-encoder-attn-backend FLASHINFER \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'

Multi GPU: Qwen3-VL-32B-Instruct, 4×GB200 TP=4 + ViT DP=4, random-mm dataset (20 imgs/req, 336×336), 1000 prompts + 200 warmup

vllm bench mm-processor \
  --model Qwen/Qwen3-VL-32B-Instruct \
  --dataset-name random-mm \
  --random-mm-base-items-per-request 20 \
  --random-mm-num-mm-items-range-ratio 0.0 \
  --random-mm-bucket-config '{"(336,336,1)": 1.0}' \
  --num-prompts 1000 --num-warmups 200 \
  --max-model-len 8192 --seed 42 \
  --mm-encoder-attn-backend FLASHINFER \
  --tensor-parallel-size 4 --mm-encoder-tp-mode data \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'

Test Result

Single GPU (Qwen3-VL-30B, 1×GB200, VisionArena-Chat, 3000 prompts):

Backend	Mean	P99
FLASH_ATTN	+11.8% (5.13→4.52ms)	+31.6% (9.16→6.26ms)
FLASH_ATTN	+19.6% (5.42→4.36ms)	+40.3% (10.87→6.49ms)

Multi GPU (Qwen3-VL-32B, 4×GB200 TP=4 DP=4, random-mm 20img/req, 1000 prompts):

Backend	Mean	P99
FLASH_ATTN	+18.4% (28.39→23.16ms)	+14.0% (238.78→205.28ms)
FLASHINFER	+44.4% (23.24→12.91ms)	+84.9% (172.41→26.05ms)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/v1/cudagraph/test_encoder_cudagraph.py (added, +451/-0)
vllm/config/compilation.py (modified, +32/-0)
vllm/model_executor/models/interfaces.py (modified, +141/-0)
vllm/model_executor/models/qwen3_vl.py (modified, +270/-30)
vllm/v1/worker/gpu/mm/encoder_cudagraph.py (added, +576/-0)
vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py (added, +66/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +48/-1)

PR #38061: [MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference

Repository: vllm-project/vllm
Author: shen-shanshan
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38061

Description (problem / solution / changelog)

Purpose

Following https://github.com/vllm-project/vllm/pull/35963 (only supports image inference), this PR continues to work on it to support video inference for Qwen3-VL.

TODO:

Unit test.
E2E functional test.
Benchmark in some scenarios:
- no DP VIT + eager vs no DP VIT + cuda graph.
- DP VIT + eager vs DP VIT + cuda graph.
Update "Vision Encoder (ViT) CUDA Graphs" docs.

🤖 AI Summary

Following #35963 (ViT full CUDA graph support for image inference), this PR extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (pixel_values + image_grid_thw). Video inputs use different keys (pixel_values_videos + video_grid_thw) and require larger cu_seqlens buffers because each video item contributes multiple frames (T attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.

Note: Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.

Key Changes:

EncoderCudaGraphConfig (vllm/v1/worker/encoder_cudagraph_defs.py): Replace single input_key field with input_key_by_modality dict (e.g., {"image": "pixel_values", "video": "pixel_values_videos"}) to support per-modality input tensor routing.
SupportsEncoderCudaGraph protocol (vllm/model_executor/models/interfaces.py): Add get_input_modality(mm_kwargs) method to determine whether inputs are image or video. Add max_frames_per_batch parameter to prepare_encoder_cudagraph_capture_inputs() and prepare_encoder_cudagraph_replay_buffers().
Qwen3VLForConditionalGeneration (vllm/model_executor/models/qwen3_vl.py):
- Implement get_input_modality() to route based on mm_kwargs keys.
- Add _get_pixel_values_by_modality() and _get_grid_thw_by_modality() helpers to abstract modality-specific key access across all protocol methods.
- Update prepare_encoder_cudagraph_capture_inputs() to build video-format grid configs (T>1 per item) when max_frames_per_batch exceeds max_batch_size, sizing cu_seqlens buffer for video replays.
- Add replay buffer caching (_replay_buffer_cache) keyed by (modality, grid_thw) to avoid redundant CPU-side NumPy computation for repeated grid shapes.
- Update prepare_encoder_metadata() to accept max_frames_per_batch for cu_seqlens padding, allowing video frames to exceed max_batch_size.
EncoderCudaGraphManager (vllm/v1/worker/encoder_cudagraph.py):
- Add max_frames_per_batch field to BudgetGraphMetadata and manager initialization.
- Rename encoder_cudagraph_max_images_per_batch → encoder_cudagraph_max_mm_items_per_batch for generality.
- Route input_key lookup through get_input_modality() during replay instead of using a fixed key.
CompilationConfig (vllm/config/compilation.py): Add encoder_cudagraph_max_frames_per_batch config option (0 = auto-infer). Rename encoder_cudagraph_max_images_per_batch → encoder_cudagraph_max_mm_items_per_batch.
Tests (tests/v1/cudagraph/test_encoder_cudagraph.py): Add SimpleMockViTVideoModel with dual-modality support, TestGetInputModality (no GPU), and TestEncoderCudaGraphVideoReplay (GPU) covering video capture/replay, fallback, counters, chunking, and mixed image+video through a shared manager. (+316 lines)

Test Plan

Unit test:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

Functional test:

# Pass compilation_config to EngineArgs in run_qwen3_vl()
# compilation_config={
#     "cudagraph_mm_encoder": true,
#     "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048],
#     "encoder_cudagraph_max_mm_items_per_batch": 4,
#     "encoder_cudagraph_max_frames_per_batch": 32,
# }
python examples/offline_inference/vision_language.py -m qwen3_vl --modality "video"

Benchmark:

# Single GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 100 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-32B-Instruct \
--max-model-len 8192 \
--dataset-name random-mm \
--random-mm-base-items-per-request 4 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 4}' \
--num-prompts 100 \
--seed 42 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASHINFER \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

Test Result

✅ Unit test:

tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_exact_powers_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_max_not_power_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_min_equals_max PASSED
...
36 passed, 3 warnings in 10.04s

✅ Functional test:

--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby’s serious expression and focused demeanor while pretending to read, combined with the fact that they are so young and unable to actually read, creates a humorous contrast. The baby’s movements
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious expression and focused posture, combined with the fact that they are clearly not reading in the traditional sense, create a humorous contrast. The baby's attempts to turn the pages
--------------------------------------------------
The video is funny because it captures a toddler wearing glasses and pretending to read a book, which is an adorable and endearing sight. The child's focused expression and the way they turn the pages with their hands, as if they are truly engrossed in the book, adds to the humor. The fact that the
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious demeanor and focused expression while holding the book add to the humor, as it creates a comical contrast between the baby's innocent actions and the adult-like behavior of reading
--------------------------------------------------

✅ Benchmark:

Single GPU (Qwen3-VL-8B-Instruct, 1xA100, random-mm, 100 prompts):

Backend	Mean	P99
FLASH_ATTN	-24.52% (3.67ms -> 4.57ms)	+61.66% (17.03ms -> 6.53ms)
FLASHINFER	+21.84% (8.38ms -> 6.55ms)	+87.60% (58.62ms -> 7.27ms)

Multi GPU (Qwen3-VL-32B-Instruct, 4xA100, random-mm, 100 prompts):

Backend	Mean	P99
FLASH_ATTN	+13.44% (5.43ms -> 4.70ms)	+83.22% (51.25ms -> 8.60ms)
FLASHINFER	+21.37% (8.75ms -> 6.88ms)	+82.85% (77.77ms -> 13.34ms)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

docs/design/cuda_graphs_multimodal.md (modified, +66/-7)
tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +312/-1)
vllm/config/compilation.py (modified, +20/-4)
vllm/model_executor/models/interfaces.py (modified, +10/-1)
vllm/model_executor/models/qwen3_vl.py (modified, +138/-42)
vllm/v1/worker/encoder_cudagraph.py (modified, +33/-11)
vllm/v1/worker/encoder_cudagraph_defs.py (modified, +4/-2)

PR #38040: [Fix] Invariant Check for Auto-Inferred Budgets/Max Batch Size in ViT CUDA Graph Manager

Repository: vllm-project/vllm
Author: b-mu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38040

Description (problem / solution / changelog)

Purpose

Previously max_batch_size = max_budget // min_budget could exceed min_budget, causing prepare_encoder_cudagraph_capture_inputs to compute per_image_output = token_budget // max_batch_size = 0 for small budgets, leading to a reshape crash on empty tensors in Qwen3_VisionPatchEmbed.forward. Fixed by capping to min(max_budget // min_budget, min_budget) if both budgets and max batch size are auto-inferred. For the paths where either budget or max batch size are provided by the user, we adjust the other (i.e. the one that is auto-inferred) to satisfy the invariant: max_batch_size <= min_budget.

Dependency: merge after PR #37914

Test Plan

4×GB200 NVLink (TP=4, ViT DP=4), Qwen3-VL-32B-Instruct, random-mm dataset (synthetic)

Eager (baseline):

  vllm bench mm-processor \
    --model Qwen/Qwen3-VL-32B-Instruct \
    --dataset-name random-mm \
    --random-mm-base-items-per-request 20 \
    --random-mm-num-mm-items-range-ratio 0.5 \
    --random-mm-bucket-config '{"(224,224,1)": 0.2, "(336,336,1)": 0.3, "(448,448,1)": 0.2, "(672,672,1)": 0.2,
  "(1008,1008,1)": 0.1}' \
    --num-prompts 1000 --num-warmups 200 \
    --max-model-len 16384 --dtype bfloat16 --seed 42 \
    --mm-encoder-attn-backend FLASH_ATTN \
    --tensor-parallel-size 4 --mm-encoder-tp-mode data

ViT full CUDA Graph:

  vllm bench mm-processor \
    --model Qwen/Qwen3-VL-32B-Instruct \
    --dataset-name random-mm \
    --random-mm-base-items-per-request 20 \
    --random-mm-num-mm-items-range-ratio 0.5 \
    --random-mm-bucket-config '{"(224,224,1)": 0.2, "(336,336,1)": 0.3, "(448,448,1)": 0.2, "(672,672,1)": 0.2,
  "(1008,1008,1)": 0.1}' \
    --num-prompts 1000 --num-warmups 200 \
    --max-model-len 16384 --dtype bfloat16 --seed 42 \
    --mm-encoder-attn-backend FLASH_ATTN \
    --tensor-parallel-size 4 --mm-encoder-tp-mode data \
    --compilation-config '{"cudagraph_mm_encoder": true}'

Test Result

Encoder Forward Latency (mean):

Config	FLASH_ATTN	FLASHINFER
Eager (baseline)	70.5ms	57.5ms
ViT full CUDA Graph	43.0ms (+39.1%)	37.4ms (+35.0%)

Encoder Forward Latency (median):

Config	FLASH_ATTN	FLASHINFER
Eager (baseline)	55.7ms	45.7ms
ViT full CUDA Graph	33.2ms (+40.3%)	27.7ms (+39.5%)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +166/-0)
vllm/config/compilation.py (modified, +8/-0)
vllm/v1/worker/encoder_cudagraph.py (modified, +52/-10)

PR #38116: Relocate Encoder CUDA graph manager

Repository: vllm-project/vllm
Author: WoosukKwon
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38116

Description (problem / solution / changelog)

@Isotr0py v1/worker/gpu/ is reserved for model runner v2, so the encoder cuda graph manager (used in v1) should not belong there. Sorry for the confusing name 😓

Changed files

tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +2/-2)
vllm/v1/worker/encoder_cudagraph.py (renamed, +0/-0)
vllm/v1/worker/encoder_cudagraph_defs.py (renamed, +0/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +2/-4)

PR #37914: [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc

Repository: vllm-project/vllm
Author: b-mu
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37914

Description (problem / solution / changelog)

Summary

Add a new "Encoder (ViT) CUDA Graphs" section to docs/design/cuda_graphs.md, documenting the encoder CUDA graph feature from #35963
Covers motivation, budget-based capture/replay design, greedy bin-packing algorithm, data-parallel support, SupportsEncoderCudaGraph protocol, configuration options, and usage examples (CLI + Python)
Add table of contents entry linking to the new section

cc @maxyanghu @wangshangsam @Isotr0py @ywang96

Test plan

Built docs locally with mkdocs serve and verified the new section renders correctly (headings, code blocks, admonitions, table of contents links)
No existing content modified other than adding the table of contents entry

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

docs/design/cuda_graphs.md (modified, +1/-0)
docs/design/cuda_graphs_multimodal.md (added, +169/-0)

RAW_BUFFERClick to expand / collapse

Motivation.

Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.

Proposed Change.

Implementation:

Bug Fixes / Improvements:

https://github.com/vllm-project/vllm/pull/38040

Documentation:

https://github.com/vllm-project/vllm/pull/37914

Feedback Period.

No response

CC List.

@ywang96 @Isotr0py @wangshangsam

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To extend full CUDA graph support to the ViT encoder, follow these steps:

Capture the ViT encoder forward pass into a CUDA graph
Replay the captured graph for each inference request
Update the model serving code to use the captured graph

Example code snippet:

import torch
import torch.cuda.graph

# Create a ViT encoder model
vit_encoder = ViTEncoder()

# Capture the ViT encoder forward pass into a CUDA graph
graph = torch.cuda.graph.capture_begin()
output = vit_encoder(input_tensor)
graph = torch.cuda.graph.capture_end()

# Replay the captured graph for each inference request
def inference(input_tensor):
    output = torch.cuda.graph.replay(graph, inputs=(input_tensor,))
    return output

Verification

To verify that the fix worked, measure the inference latency and throughput before and after applying the fix. You can use tools like torch.profiler to profile the model and measure the performance improvements.

Extra Tips

Make sure to update the model serving code to use the captured graph for inference requests
Test the fix with different input sizes and batch sizes to ensure that it works correctly in all scenarios
Consider adding error handling and logging to ensure that any issues with the captured graph are properly handled and reported.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC]: Support ViT Full CUDA Graph (Tracker) [5 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #35963: [Feature] ViT Full CUDA Graph

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #38061: [MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference

Description (problem / solution / changelog)

Purpose

🤖 AI Summary

Test Plan

Test Result

Changed files

PR #38040: [Fix] Invariant Check for Auto-Inferred Budgets/Max Batch Size in ViT CUDA Graph Manager

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #38116: Relocate Encoder CUDA graph manager

Description (problem / solution / changelog)

Changed files

PR #37914: [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING