PR fix notes

PR #37447: [CI/Build] enable Intel XPU test flow with prebuilt image

wendyliu235 · 2026-03-17T14:49:04Z

[vllm] PR 37447: CI/Build enable Intel XPU test flow with prebuilt image - Repository: vllm-project/vllm - Author: wendyliu235 - State: open | merged: False -… # PR #37447: [CI/Build] enable Intel XPU test flow with prebuilt image - Repository: vllm-project/vllm - Author: wendyliu235 - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37447 ## Description (problem / solution / changelog) This PR create to enable a standalone intel CI pipeline ## Purpose add xpu image build and ci pipeline ##design ## Test Plan run 5 times to ensure stable ## Test Result ## depend on ci-infra PR: https://github.com/vllm-project/ci-infra/pull/306/ --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `.buildkite/ci_config_intel.yaml` (added, +23/-0) - `.buildkite/image_build/image_build_xpu.sh` (added, +34/-0) - `.buildkite/intel_jobs/test-intel.yaml` (added, +63/-0) - `.buildkite/scripts/hardware_ci/run-intel-test.sh` (added, +276/-0) --- # PR #306: Add intel ci in case generator - Repository: vllm-project/ci-infra - Author: wendyliu235 - State: open | merged: False - Link: https://github.com/vllm-project/ci-infra/pull/306 ## Description (problem / solution / changelog) _(No description)_ ## Changed files - `buildkite/bootstrap-intel.sh` (added, +310/-0) - `buildkite/pipeline_generator/buildkite_step.py` (modified, +13/-0) - `buildkite/pipeline_generator/step.py` (modified, +2/-0) ### Motivation. ## 1. Summary This RFC proposes enabling a dedicated Intel XPU CI pipeline for vLLM. The goal is to ensure that updates to vLLM maintain correctness and performance on Intel XPU devices, while improving test efficiency, parallelism, and scalability of CI. --- ## 2. Motivation / Background Currently, the vLLM CI on Intel XPU is limited: - A **single simple script** triggers both build and sanity tests. - **Build and tests execute on the same machine**, leading to low device utilization. - Tests **are not executed in parallel**, reducing efficiency. - Test case management and expansion are **inefficient**. - The current workflow **does not follow ci-infra’s design standards**. With increasing contributions targeting Intel XPU in vLLM, a dedicated Intel CI pipeline is necessary to: - Guarantee correctness and performance on Intel XPU. - Improve test parallelism and device utilization. - Enable scalable, maintainable test case management. --- ## 3. Problem Statement 1. **Inefficient device usage**: build and tests share the same Intel XPU machine sequentially. 2. **Non-parallel test execution**: limits throughput and increases CI runtime. 3. **Limited test case management**: adding or enabling new cases is not efficient. 4. **Non-standard CI workflow**: current CI does not follow the `ci-infra` design pattern. ### Proposed Change. ## 4. Proposal / Design We propose a staged approach to enable Intel XPU CI: ### Stage 1: Stable Intel CI Implementation (~40% UT enable) - **Follow ci-infra pipeline design** with build/test separation. - **Build and tests run on separate machines**. - **Parallel test execution** across multiple XPU devices to improve utilization. - Adjust the **number of machines** based on observed test runtime. - Goal: ensure stable execution of all tests on Intel XPU and 40% of test cases enabled on Intel XPU - **Status**: Intel CI pipeline and script Ready Trigger: PRs with label intel-gpu will trigger when the label is added, and will trigger on each commit. vLLM PR merged: https://github.com/vllm-project/vllm/pull/37447 ci-infra PR merged: https://github.com/vllm-project/ci-infra/pull/306 Add autolabel intel-gpu PR: https://github.com/vllm-project/vllm/pull/38320 ### Stage 2: Gradual Test Case Expansion (~60% UT enable) - Incrementally **enable additional unit tests** (UT) for Intel XPU. - **Adjust machine allocation** to ensure that a full CI run completes within ~1 hour. - Monitor stability and runtime metrics to balance load. ### Stage 3: Expanded Test Coverage (~85% UT enable) - Continue enabling more test cases, focusing on high-priority or high-risk features. - Maintain test

Repository: vllm-project/vllm
Author: wendyliu235
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37447

Description (problem / solution / changelog)

This PR create to enable a standalone intel CI pipeline

Purpose

add xpu image build and ci pipeline

##design <img width="969" height="717" alt="image" src="https://github.com/user-attachments/assets/1fae149e-5280-45c0-bbf6-fbef769c8570" />

Test Plan

run 5 times to ensure stable

Test Result

depend on

ci-infra PR: https://github.com/vllm-project/ci-infra/pull/306/

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.buildkite/ci_config_intel.yaml (added, +23/-0)
.buildkite/image_build/image_build_xpu.sh (added, +34/-0)
.buildkite/intel_jobs/test-intel.yaml (added, +63/-0)
.buildkite/scripts/hardware_ci/run-intel-test.sh (added, +276/-0)

PR #306: Add intel ci in case generator

Repository: vllm-project/ci-infra
Author: wendyliu235
State: open | merged: False
Link: https://github.com/vllm-project/ci-infra/pull/306

Description (problem / solution / changelog)

(No description)

Changed files

buildkite/bootstrap-intel.sh (added, +310/-0)
buildkite/pipeline_generator/buildkite_step.py (modified, +13/-0)
buildkite/pipeline_generator/step.py (modified, +2/-0)

Motivation.

1. Summary

This RFC proposes enabling a dedicated Intel XPU CI pipeline for vLLM.
The goal is to ensure that updates to vLLM maintain correctness and performance on Intel XPU devices, while improving test efficiency, parallelism, and scalability of CI.

2. Motivation / Background

Currently, the vLLM CI on Intel XPU is limited:

A single simple script triggers both build and sanity tests.
Build and tests execute on the same machine, leading to low device utilization.
Tests are not executed in parallel, reducing efficiency.
Test case management and expansion are inefficient.
The current workflow does not follow ci-infra’s design standards.

With increasing contributions targeting Intel XPU in vLLM, a dedicated Intel CI pipeline is necessary to:

Guarantee correctness and performance on Intel XPU.
Improve test parallelism and device utilization.
Enable scalable, maintainable test case management.

3. Problem Statement

Inefficient device usage: build and tests share the same Intel XPU machine sequentially.
Non-parallel test execution: limits throughput and increases CI runtime.
Limited test case management: adding or enabling new cases is not efficient.
Non-standard CI workflow: current CI does not follow the ci-infra design pattern.

Proposed Change.

4. Proposal / Design

We propose a staged approach to enable Intel XPU CI:

Stage 1: Stable Intel CI Implementation (~40% UT enable)

Follow ci-infra pipeline design with build/test separation.
Build and tests run on separate machines.
Parallel test execution across multiple XPU devices to improve utilization.
Adjust the number of machines based on observed test runtime.
Goal: ensure stable execution of all tests on Intel XPU and 40% of test cases enabled on Intel XPU
Status: Intel CI pipeline and script Ready Trigger: PRs with label intel-gpu will trigger when the label is added, and will trigger on each commit. vLLM PR merged: https://github.com/vllm-project/vllm/pull/37447 ci-infra PR merged: https://github.com/vllm-project/ci-infra/pull/306 Add autolabel intel-gpu PR: https://github.com/vllm-project/vllm/pull/38320

Stage 2: Gradual Test Case Expansion (~60% UT enable)

Incrementally enable additional unit tests (UT) for Intel XPU.
Adjust machine allocation to ensure that a full CI run completes within ~1 hour.
Monitor stability and runtime metrics to balance load.

Stage 3: Expanded Test Coverage (~85% UT enable)

Continue enabling more test cases, focusing on high-priority or high-risk features.
Maintain test parallelism and optimize machine allocation.
Goal: 85% of test cases enabled on Intel XPU.

Stage 4: Full Test Coverage (~95% UT enable) & Mirror GPU CI

Enable ~95% of total test cases on Intel XPU.
Begin integrating mirror GPU tests to support gating CI workflows.
Achieve a fully functional, maintainable, and scalable Intel XPU CI pipeline.

5. Detailed Design Considerations

CI Infrastructure
- Use Buildkite agents or GitHub Actions runners with Intel XPU support.
- Separate build and test stages with dedicated agents.
- Ensure proper device isolation and environment setup (Docker / oneAPI / IPEX).
Test Execution
- Use existing pytest framework with proper -k filtering for XPU-relevant tests.
- Enable parallel execution where possible (pytest-xdist or similar).
- Track and log XPU-specific failures separately for easier triage.
Test Case Management
- Maintain a case enablement matrix to track which tests are active on XPU.
- Stage-wise increase in enabled tests to manage CI runtime.
Metrics & Monitoring
- Track CI runtime, machine utilization, and device load.
- Adjust machine allocation dynamically based on runtime statistics.
Integration with ci-infra
- Follow existing ci-infra patterns for pipeline structure, logging, and artifact management.
- Ensure the new Intel XPU CI can be maintained alongside existing pipelines (CPU/NVIDIA).

6. Impact

Improved device utilization: separate build/test stages and parallel execution.
Scalable test case management: incremental enabling of UTs.
Faster feedback for Intel XPU contributors: reduced CI runtime and higher reliability.
CI maintenance cost: additional runners and monitoring required, but staged approach mitigates risk.

7. Open Questions / Discussion Points

Should Stage 2 initially enable only critical tests or a broader selection?
How many machines are optimal for Stage 1 and Stage 2 to ensure <1 hour CI runtime?
Should mirror GPU CI be run in parallel with Intel XPU CI or sequentially?
Any additional metrics or monitoring requirements for XPU-specific tests?

extent analysis

Fix Plan

To address the issues with the current Intel XPU CI pipeline, we will implement a staged approach:

Stage 1: Stable Intel CI Implementation
- Separate build and test stages using Buildkite agents or GitHub Actions runners.
- Implement parallel test execution using pytest-xdist.
- Create a case enablement matrix to track enabled tests.
Stage 2: Gradual Test Case Expansion
- Incrementally enable additional unit tests for Intel XPU.
- Adjust machine allocation to ensure CI runtime is within 1 hour.
- Monitor stability and runtime metrics.
Stage 3: Expanded Test Coverage
- Continue enabling test cases, focusing on high-priority features.
- Maintain test parallelism and optimize machine allocation.
Stage 4: Full Test Coverage & Mirror GPU CI
- Enable 90% of total test cases on Intel XPU.
- Integrate mirror GPU tests to support gating CI workflows.

Example Code

To implement parallel test execution using pytest-xdist:

# pytest.ini
[pytest]
addopts = -n 4  # Run 4 tests in parallel

To create a case enablement matrix:

# test_enablement_matrix.py
import pandas as pd

# Define test cases and their status
test_cases = [
    {"test_name": "test1", "enabled": True},
    {"test_name": "test2", "enabled": False},
]

# Create a DataFrame
df = pd.DataFrame(test_cases)

# Save to CSV
df.to_csv("test_enablement_matrix.csv", index=False)

To adjust machine allocation based on CI runtime:

# ci_runtime_monitor.py
import time

# Define CI runtime threshold (1 hour)
threshold = 3600

# Monitor CI runtime
start_time = time.time()
# Run CI pipeline
end_time = time.time()

# Calculate CI runtime
ci_runtime = end_time - start_time

# Adjust machine allocation if CI runtime exceeds threshold
if ci_runtime > threshold:
    # Reduce machine allocation
    print("Reducing machine allocation")

Verification

To verify the fix, monitor CI runtime, machine utilization, and device load. Check that:

CI runtime is within 1 hour
Machine utilization is optimized
Device load is balanced
Test cases are enabled and running in parallel

Extra Tips

Use existing ci-infra patterns for pipeline structure, logging, and artifact management.
Ensure the new Intel XPU CI can be maintained alongside existing pipelines (CPU/NVIDIA).
Continuously monitor and adjust machine allocation based on runtime statistics.

vllm - ✅(Solved) Fix [RFC][XPU]: Enable Intel XPU CI for vLLM [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #37447: [CI/Build] enable Intel XPU test flow with prebuilt image

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

depend on

Changed files

PR #306: Add intel ci in case generator

Description (problem / solution / changelog)

Changed files

Motivation.

1. Summary

2. Motivation / Background

3. Problem Statement

Proposed Change.

4. Proposal / Design

Stage 1: Stable Intel CI Implementation (~40% UT enable)

Stage 2: Gradual Test Case Expansion (~60% UT enable)

Stage 3: Expanded Test Coverage (~85% UT enable)

Stage 4: Full Test Coverage (~95% UT enable) & Mirror GPU CI

5. Detailed Design Considerations

6. Impact

7. Open Questions / Discussion Points

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING