vllm - ✅(Solved) Fix [RFC]: Support ViT Full CUDA Graph (Tracker) [5 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38175Fetched 2026-04-08 01:31:51
View on GitHub
Comments
0
Participants
1
Timeline
13
Reactions
0
Participants
Timeline (top)
subscribed ×5mentioned ×3labeled ×2added_to_project_v2 ×1

Fix Action

Fix / Workaround

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

PR fix notes

PR #35963: [Feature] ViT Full CUDA Graph

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Purpose

Add full CUDA graph for the ViT to reduce kernel launch overheads.

Features:

  • Budget-based graphs with a maximum batch size:
    • Capture CUDA graphs at configurable token budgets (e.g., [256, 512, 1024, 2048, 4096]).
    • Pad sequence metadata (e.g. cu_seqlen) so that we can use the same budget-based graph for various number of images during replays.
  • Greedy bin-packing:
    • Sort images in a batch in ascending order to reduce the number of graphs.
  • Data-parallel (DP) support:
    • When mm_encoder_tp_mode=data, each TP rank runs the ViT independently via data parallelism.
  • FlashInfer cuDNN attention support:
    • Override FlashInfer buckets in the CUDA graph path.
  • Model-agnostic protocol:
    • SupportsEncoderCudaGraph protocol in interfaces.py — models opt in by implementing 9 protocol methods for input handling, metadata computation, and forward dispatch.
    • EncoderCudaGraphManager is fully model-agnostic; all model-specific logic (grid config, dummy inputs, embedding computation) lives in the model class.

New config flags (via --compilation-config):

  • cudagraph_mm_encoder: true — enable encoder CUDA graph
  • encoder_cudagraph_token_budgets: [...] — list of token budget sizes to capture
  • encoder_cudagraph_max_images_per_batch: N — max images per graph replay

Files changed:

  • vllm/config/compilation.py — new config flags
  • vllm/model_executor/models/interfaces.pySupportsEncoderCudaGraph protocol and supports_encoder_cudagraph() type guard
  • vllm/model_executor/models/qwen3_vl.py — implement SupportsEncoderCudaGraph on Qwen3VLForConditionalGeneration
  • vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.pyEncoderCudaGraphConfig, EncoderCudaGraphCaptureInputs, EncoderCudaGraphReplayBuffers dataclasses
  • vllm/v1/worker/gpu/mm/encoder_cudagraph.pyEncoderCudaGraphManager (capture, replay, packing, DP)
  • vllm/v1/worker/gpu_model_runner.py — integration into V1 model runner
  • tests/v1/cudagraph/test_encoder_cudagraph.py — unit and GPU tests

cc @maxyanghu @wangshangsam @Anerudhan

Test Plan

Unit Tests:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

End-to-End Tests:

  • Single GPU: Qwen3-VL-30B-A3B-Instruct, VisionArena-Chat dataset, 3000 prompts + 300 warmup
vllm bench mm-processor \
  --model Qwen/Qwen3-VL-30B-A3B-Instruct \
  --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 3000 --num-warmups 300 \
  --max-model-len 32768 --seed 42 \
  --mm-encoder-attn-backend FLASHINFER \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'
  • Multi GPU: Qwen3-VL-32B-Instruct, 4×GB200 TP=4 + ViT DP=4, random-mm dataset (20 imgs/req, 336×336), 1000 prompts + 200 warmup
vllm bench mm-processor \
  --model Qwen/Qwen3-VL-32B-Instruct \
  --dataset-name random-mm \
  --random-mm-base-items-per-request 20 \
  --random-mm-num-mm-items-range-ratio 0.0 \
  --random-mm-bucket-config '{"(336,336,1)": 1.0}' \
  --num-prompts 1000 --num-warmups 200 \
  --max-model-len 8192 --seed 42 \
  --mm-encoder-attn-backend FLASHINFER \
  --tensor-parallel-size 4 --mm-encoder-tp-mode data \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4864], "encoder_cudagraph_max_images_per_batch": 8}'

Test Result

Single GPU (Qwen3-VL-30B, 1×GB200, VisionArena-Chat, 3000 prompts):

BackendMeanP99
FLASH_ATTN+11.8% (5.13→4.52ms)+31.6% (9.16→6.26ms)
FLASH_ATTN+19.6% (5.42→4.36ms)+40.3% (10.87→6.49ms)

Multi GPU (Qwen3-VL-32B, 4×GB200 TP=4 DP=4, random-mm 20img/req, 1000 prompts):

BackendMeanP99
FLASH_ATTN+18.4% (28.39→23.16ms)+14.0% (238.78→205.28ms)
FLASHINFER+44.4% (23.24→12.91ms)+84.9% (172.41→26.05ms)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/v1/cudagraph/test_encoder_cudagraph.py (added, +451/-0)
  • vllm/config/compilation.py (modified, +32/-0)
  • vllm/model_executor/models/interfaces.py (modified, +141/-0)
  • vllm/model_executor/models/qwen3_vl.py (modified, +270/-30)
  • vllm/v1/worker/gpu/mm/encoder_cudagraph.py (added, +576/-0)
  • vllm/v1/worker/gpu/mm/encoder_cudagraph_defs.py (added, +66/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +48/-1)

PR #38061: [MM][Perf][CG] Support ViT full CUDA graph for Qwen3-VL video inference

Description (problem / solution / changelog)

Purpose

Following https://github.com/vllm-project/vllm/pull/35963 (only supports image inference), this PR continues to work on it to support video inference for Qwen3-VL.

TODO:

  • Unit test.
  • E2E functional test.
  • Benchmark in some scenarios:
    • no DP VIT + eager vs no DP VIT + cuda graph.
    • DP VIT + eager vs DP VIT + cuda graph.
  • Update "Vision Encoder (ViT) CUDA Graphs" docs.

🤖 AI Summary

Following #35963 (ViT full CUDA graph support for image inference), this PR extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (pixel_values + image_grid_thw). Video inputs use different keys (pixel_values_videos + video_grid_thw) and require larger cu_seqlens buffers because each video item contributes multiple frames (T attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.

Note: Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.

Key Changes:

  • EncoderCudaGraphConfig (vllm/v1/worker/encoder_cudagraph_defs.py): Replace single input_key field with input_key_by_modality dict (e.g., {"image": "pixel_values", "video": "pixel_values_videos"}) to support per-modality input tensor routing.
  • SupportsEncoderCudaGraph protocol (vllm/model_executor/models/interfaces.py): Add get_input_modality(mm_kwargs) method to determine whether inputs are image or video. Add max_frames_per_batch parameter to prepare_encoder_cudagraph_capture_inputs() and prepare_encoder_cudagraph_replay_buffers().
  • Qwen3VLForConditionalGeneration (vllm/model_executor/models/qwen3_vl.py):
    • Implement get_input_modality() to route based on mm_kwargs keys.
    • Add _get_pixel_values_by_modality() and _get_grid_thw_by_modality() helpers to abstract modality-specific key access across all protocol methods.
    • Update prepare_encoder_cudagraph_capture_inputs() to build video-format grid configs (T>1 per item) when max_frames_per_batch exceeds max_batch_size, sizing cu_seqlens buffer for video replays.
    • Add replay buffer caching (_replay_buffer_cache) keyed by (modality, grid_thw) to avoid redundant CPU-side NumPy computation for repeated grid shapes.
    • Update prepare_encoder_metadata() to accept max_frames_per_batch for cu_seqlens padding, allowing video frames to exceed max_batch_size.
  • EncoderCudaGraphManager (vllm/v1/worker/encoder_cudagraph.py):
    • Add max_frames_per_batch field to BudgetGraphMetadata and manager initialization.
    • Rename encoder_cudagraph_max_images_per_batchencoder_cudagraph_max_mm_items_per_batch for generality.
    • Route input_key lookup through get_input_modality() during replay instead of using a fixed key.
  • CompilationConfig (vllm/config/compilation.py): Add encoder_cudagraph_max_frames_per_batch config option (0 = auto-infer). Rename encoder_cudagraph_max_images_per_batchencoder_cudagraph_max_mm_items_per_batch.
  • Tests (tests/v1/cudagraph/test_encoder_cudagraph.py): Add SimpleMockViTVideoModel with dual-modality support, TestGetInputModality (no GPU), and TestEncoderCudaGraphVideoReplay (GPU) covering video capture/replay, fallback, counters, chunking, and mixed image+video through a shared manager. (+316 lines)

Test Plan

Unit test:

pytest tests/v1/cudagraph/test_encoder_cudagraph.py -v

Functional test:

# Pass compilation_config to EngineArgs in run_qwen3_vl()
# compilation_config={
#     "cudagraph_mm_encoder": true,
#     "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048],
#     "encoder_cudagraph_max_mm_items_per_batch": 4,
#     "encoder_cudagraph_max_frames_per_batch": 32,
# }
python examples/offline_inference/vision_language.py -m qwen3_vl --modality "video"

Benchmark:

# Single GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-8B-Instruct \
--max-model-len 16384 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 1}' \
--num-prompts 100 \
--seed 42 \
--mm-encoder-attn-backend FLASH_ATTN \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

# Multi GPU:
vllm bench mm-processor \
--model /shared/models/modelscope/models/Qwen/Qwen3-VL-32B-Instruct \
--max-model-len 8192 \
--dataset-name random-mm \
--random-mm-base-items-per-request 4 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(224, 224, 8): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 0, "video": 4}' \
--num-prompts 100 \
--seed 42 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--mm-encoder-attn-backend FLASHINFER \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096], "encoder_cudagraph_max_mm_items_per_batch": 4, "encoder_cudagraph_max_frames_per_batch": 32}'

Test Result

✅ Unit test:

tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_exact_powers_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_max_not_power_of_2 PASSED
tests/v1/cudagraph/test_encoder_cudagraph.py::TestGenerateBudgets::test_min_equals_max PASSED
...
36 passed, 3 warnings in 10.04s

✅ Functional test:

--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby’s serious expression and focused demeanor while pretending to read, combined with the fact that they are so young and unable to actually read, creates a humorous contrast. The baby’s movements
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious expression and focused posture, combined with the fact that they are clearly not reading in the traditional sense, create a humorous contrast. The baby's attempts to turn the pages
--------------------------------------------------
The video is funny because it captures a toddler wearing glasses and pretending to read a book, which is an adorable and endearing sight. The child's focused expression and the way they turn the pages with their hands, as if they are truly engrossed in the book, adds to the humor. The fact that the
--------------------------------------------------
The video is funny because it captures a baby wearing glasses and pretending to read a book, which is an adorable and endearing sight. The baby's serious demeanor and focused expression while holding the book add to the humor, as it creates a comical contrast between the baby's innocent actions and the adult-like behavior of reading
--------------------------------------------------

✅ Benchmark:

Single GPU (Qwen3-VL-8B-Instruct, 1xA100, random-mm, 100 prompts):

BackendMeanP99
FLASH_ATTN-24.52% (3.67ms -> 4.57ms)+61.66% (17.03ms -> 6.53ms)
FLASHINFER+21.84% (8.38ms -> 6.55ms)+87.60% (58.62ms -> 7.27ms)

Multi GPU (Qwen3-VL-32B-Instruct, 4xA100, random-mm, 100 prompts):

BackendMeanP99
FLASH_ATTN+13.44% (5.43ms -> 4.70ms)+83.22% (51.25ms -> 8.60ms)
FLASHINFER+21.37% (8.75ms -> 6.88ms)+82.85% (77.77ms -> 13.34ms)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

  • docs/design/cuda_graphs_multimodal.md (modified, +66/-7)
  • tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +312/-1)
  • vllm/config/compilation.py (modified, +20/-4)
  • vllm/model_executor/models/interfaces.py (modified, +10/-1)
  • vllm/model_executor/models/qwen3_vl.py (modified, +138/-42)
  • vllm/v1/worker/encoder_cudagraph.py (modified, +33/-11)
  • vllm/v1/worker/encoder_cudagraph_defs.py (modified, +4/-2)

PR #38040: [Fix] Invariant Check for Auto-Inferred Budgets/Max Batch Size in ViT CUDA Graph Manager

Description (problem / solution / changelog)

Purpose

Previously max_batch_size = max_budget // min_budget could exceed min_budget, causing prepare_encoder_cudagraph_capture_inputs to compute per_image_output = token_budget // max_batch_size = 0 for small budgets, leading to a reshape crash on empty tensors in Qwen3_VisionPatchEmbed.forward. Fixed by capping to min(max_budget // min_budget, min_budget) if both budgets and max batch size are auto-inferred. For the paths where either budget or max batch size are provided by the user, we adjust the other (i.e. the one that is auto-inferred) to satisfy the invariant: max_batch_size <= min_budget.

  • Dependency: merge after PR #37914

Test Plan

4×GB200 NVLink (TP=4, ViT DP=4), Qwen3-VL-32B-Instruct, random-mm dataset (synthetic)

  • Eager (baseline):
  vllm bench mm-processor \
    --model Qwen/Qwen3-VL-32B-Instruct \
    --dataset-name random-mm \
    --random-mm-base-items-per-request 20 \
    --random-mm-num-mm-items-range-ratio 0.5 \
    --random-mm-bucket-config '{"(224,224,1)": 0.2, "(336,336,1)": 0.3, "(448,448,1)": 0.2, "(672,672,1)": 0.2,
  "(1008,1008,1)": 0.1}' \
    --num-prompts 1000 --num-warmups 200 \
    --max-model-len 16384 --dtype bfloat16 --seed 42 \
    --mm-encoder-attn-backend FLASH_ATTN \
    --tensor-parallel-size 4 --mm-encoder-tp-mode data
  • ViT full CUDA Graph:
  vllm bench mm-processor \
    --model Qwen/Qwen3-VL-32B-Instruct \
    --dataset-name random-mm \
    --random-mm-base-items-per-request 20 \
    --random-mm-num-mm-items-range-ratio 0.5 \
    --random-mm-bucket-config '{"(224,224,1)": 0.2, "(336,336,1)": 0.3, "(448,448,1)": 0.2, "(672,672,1)": 0.2,
  "(1008,1008,1)": 0.1}' \
    --num-prompts 1000 --num-warmups 200 \
    --max-model-len 16384 --dtype bfloat16 --seed 42 \
    --mm-encoder-attn-backend FLASH_ATTN \
    --tensor-parallel-size 4 --mm-encoder-tp-mode data \
    --compilation-config '{"cudagraph_mm_encoder": true}'

Test Result

Encoder Forward Latency (mean):

ConfigFLASH_ATTNFLASHINFER
Eager (baseline)70.5ms57.5ms
ViT full CUDA Graph43.0ms (+39.1%)37.4ms (+35.0%)

Encoder Forward Latency (median):

ConfigFLASH_ATTNFLASHINFER
Eager (baseline)55.7ms45.7ms
ViT full CUDA Graph33.2ms (+40.3%)27.7ms (+39.5%)

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +166/-0)
  • vllm/config/compilation.py (modified, +8/-0)
  • vllm/v1/worker/encoder_cudagraph.py (modified, +52/-10)

PR #38116: Relocate Encoder CUDA graph manager

Description (problem / solution / changelog)

@Isotr0py v1/worker/gpu/ is reserved for model runner v2, so the encoder cuda graph manager (used in v1) should not belong there. Sorry for the confusing name 😓

Changed files

  • tests/v1/cudagraph/test_encoder_cudagraph.py (modified, +2/-2)
  • vllm/v1/worker/encoder_cudagraph.py (renamed, +0/-0)
  • vllm/v1/worker/encoder_cudagraph_defs.py (renamed, +0/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +2/-4)

PR #37914: [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc

Description (problem / solution / changelog)

Summary

  • Add a new "Encoder (ViT) CUDA Graphs" section to docs/design/cuda_graphs.md, documenting the encoder CUDA graph feature from #35963
  • Covers motivation, budget-based capture/replay design, greedy bin-packing algorithm, data-parallel support, SupportsEncoderCudaGraph protocol, configuration options, and usage examples (CLI + Python)
  • Add table of contents entry linking to the new section

cc @maxyanghu @wangshangsam @Isotr0py @ywang96

Test plan

  • Built docs locally with mkdocs serve and verified the new section renders correctly (headings, code blocks, admonitions, table of contents links)
  • No existing content modified other than adding the table of contents entry

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/design/cuda_graphs.md (modified, +1/-0)
  • docs/design/cuda_graphs_multimodal.md (added, +169/-0)
RAW_BUFFERClick to expand / collapse

Motivation.

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.

Proposed Change.

Implementation:

Bug Fixes / Improvements:

Documentation:

Feedback Period.

No response

CC List.

@ywang96 @Isotr0py @wangshangsam

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To extend full CUDA graph support to the ViT encoder, follow these steps:

  • Capture the ViT encoder forward pass into a CUDA graph
  • Replay the captured graph for each inference request
  • Update the model serving code to use the captured graph

Example code snippet:

import torch
import torch.cuda.graph

# Create a ViT encoder model
vit_encoder = ViTEncoder()

# Capture the ViT encoder forward pass into a CUDA graph
graph = torch.cuda.graph.capture_begin()
output = vit_encoder(input_tensor)
graph = torch.cuda.graph.capture_end()

# Replay the captured graph for each inference request
def inference(input_tensor):
    output = torch.cuda.graph.replay(graph, inputs=(input_tensor,))
    return output

Verification

To verify that the fix worked, measure the inference latency and throughput before and after applying the fix. You can use tools like torch.profiler to profile the model and measure the performance improvements.

Extra Tips

  • Make sure to update the model serving code to use the captured graph for inference requests
  • Test the fix with different input sizes and batch sizes to ensure that it works correctly in all scenarios
  • Consider adding error handling and logging to ensure that any issues with the captured graph are properly handled and reported.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [RFC]: Support ViT Full CUDA Graph (Tracker) [5 pull requests, 1 participants]