vllm - 💡(How to fix) Fix [Tracking][NUMA] Replace hard-coded Granite Rapids PCT detection with a generic, root-free path [1 pull requests]

PR #43270 adds opt-in NUMA pinning to Intel Priority Core Turbo (PCT) "priority" cores on Xeon 6776P / 6774P / 6962P, gated by a hard-coded SKU table in vllm/utils/numa_utils.py::_PCT_CAPABLE_SKUS. The workaround delivered a measurable end-to-end win on DGX B300 (Qwen3.5-397B-A17B-NVFP4, TP=8, 32K/2 prompts, MC=128: total token throughput 46.1k -> 75.8k tok/s, +64.4 %), but it is intentionally narrow:

New PCT-capable SKUs require a vLLM patch that adds them to the table.
The cpu_id % stride in (0, 1) priority-core filter is empirical, not derived from the kernel.
The CPPC highest_perf value is hard-coded per SKU (4.6 GHz -> 46, 4.4 GHz -> 44).

This issue tracks removing that workaround once we have a way to discover PCT priority cores without root and without per-SKU enumeration.

Root Cause

New PCT-capable SKUs require a vLLM patch that adds them to the table.
The cpu_id % stride in (0, 1) priority-core filter is empirical, not derived from the kernel.
The CPPC highest_perf value is hard-coded per SKU (4.6 GHz -> 46, 4.4 GHz -> 44).

This issue tracks removing that workaround once we have a way to discover PCT priority cores without root and without per-SKU enumeration.

Fix Action

Fix / Workaround

New PCT-capable SKUs require a vLLM patch that adds them to the table.
The cpu_id % stride in (0, 1) priority-core filter is empirical, not derived from the kernel.
The CPPC highest_perf value is hard-coded per SKU (4.6 GHz -> 46, 4.4 GHz -> 44).

This issue tracks removing that workaround once we have a way to discover PCT priority cores without root and without per-SKU enumeration.

Summary

New PCT-capable SKUs require a vLLM patch that adds them to the table.
The cpu_id % stride in (0, 1) priority-core filter is empirical, not derived from the kernel.
The CPPC highest_perf value is hard-coded per SKU (4.6 GHz -> 46, 4.4 GHz -> 44).

This issue tracks removing that workaround once we have a way to discover PCT priority cores without root and without per-SKU enumeration.

Why the workaround exists today

Quoting #43270 (comment by @vadiklyutiy):

1. Kernel doesn't help

In a perfect world the Linux scheduler would prefer PCT priority cores when there are fewer hot threads than priority cores, and vLLM wouldn't need to care. Two recent kernels were tested:

6.8.0-90-generic (Ubuntu, kernel from Mar 2024)
6.14.0-37-generic (Ubuntu, kernel from Mar 2025)

Neither preferentially schedules work on PCT priority cores out of the box.

2. PCT discovery is root-only

The only kernel interface that reports PCT / CLOS membership is /dev/isst_interface, used by intel-speed-select:

$ ls -l /dev/isst_interface
crw------- 1 root root 10, 118 May 22 11:00 /dev/isst_interface

There is no unprivileged sysfs path that exposes per-CPU PCT membership. Intel's own guidance is to use intel-speed-select, which has the same root requirement. Production environments (shared clusters, managed cloud, prebuilt containers) typically can't grant root or rely on intel-speed-select being installed.

3. The two stop-gap alternatives are insufficient

"Document the manual --numa-bind-cpus recipe" — the users who need PCT pinning the most are also the ones least likely to read NUMA/PCT docs and act on them.
"Add a --numa-bind=pct flag that shells out to intel-speed-select" — still root-only, still requires the tool in the image, and leaves the default path slow.

See PR #33222 (intel-ai-tce) for the second variant explored as standalone scripts plus docker-compose. It is complementary to #43270 (manual workflow vs. zero-config auto-detection) and has the same root-only limitation.

Definition of done

The workaround can be removed when PCT priority cores can be discovered:

without root, and
without enumerating individual SKUs in vLLM,

so that _PCT_CAPABLE_SKUS and _pct_sku_config() can be deleted and replaced by a generic detector. The detector should ideally also support hosts where PCT is dynamically reconfigured, since the priority-core set is BIOS / runtime configurable per Intel-Speed-Select-Technology.

Candidate paths forward

In rough order of preference (each independent):

A. Kernel-side: PCT-aware scheduler

If the upstream Linux scheduler learns to bias work toward PCT priority cores when CPU contention is low, the entire vLLM-side mechanism becomes unnecessary. Track Intel / kernel mailing-list patches; once a stable kernel ships PCT-aware scheduling, drop --numa-bind PCT logic on those kernels.

B. Kernel-side: unprivileged sysfs for PCT membership

A simpler change than scheduler patches: expose per-CPU PCT membership via /sys/devices/system/cpu/cpuN/... (similar to cpufreq/scaling_* or topology/core_id). vLLM would read it in user mode, no root, no intel-speed-select. Worth raising upstream as a feature request even if PCT-aware scheduling is far off.

C. Runtime probe via `cpufreq`

When PCT is active, priority cores reach a noticeably higher scaling_max_freq / actually-observed turbo frequency. A short calibration loop at startup could classify cores. Fragile (frequency is workload-dependent) but root-free; might be useful as a fallback when other paths are unavailable.

Acceptance criteria

PCT priority cores can be detected on a fresh Granite Rapids host without root.
No hard-coded SKU list is needed (or, if a list is unavoidable, it lives in a kernel/OS-provided file rather than vLLM source).
vllm/utils/numa_utils.py::_PCT_CAPABLE_SKUS, _PctSku, and _pct_sku_config can be deleted, and the existing tests in tests/utils_/test_numa_utils.py::test_pct_binding_* can be replaced with the new detector's tests.
DGX B300 + Xeon 6776P E2E throughput remains within noise of the +64.4 % observed in PR #43270.

References

PR #43270 — the workaround being tracked here.
- #43270 issuecomment-4523383323 — kernel / root / alternatives analysis (quoted above).
- #43270 issuecomment-4542497871 — the request to file this tracking issue.
PR #33222 — alternative approach via intel-speed-select and standalone scripts.
PR #38635 — the original --numa-bind plumbing the workaround sits on top of.
Intel ARK SKU pages (used to derive the table values in PR #43270):
intel-speed-select(8) — the root-only canonical interface today.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Tracking][NUMA] Replace hard-coded Granite Rapids PCT detection with a generic, root-free path [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Why the workaround exists today

1. Kernel doesn't help

2. PCT discovery is root-only

3. The two stop-gap alternatives are insufficient

Definition of done

Candidate paths forward

A. Kernel-side: PCT-aware scheduler

B. Kernel-side: unprivileged sysfs for PCT membership

C. Runtime probe via `cpufreq`

Acceptance criteria

References

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Tracking][NUMA] Replace hard-coded Granite Rapids PCT detection with a generic, root-free path [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Why the workaround exists today

1. Kernel doesn't help

2. PCT discovery is root-only

3. The two stop-gap alternatives are insufficient

Definition of done

Candidate paths forward

A. Kernel-side: PCT-aware scheduler

B. Kernel-side: unprivileged sysfs for PCT membership

C. Runtime probe via cpufreq

Acceptance criteria

References

Still need to ship something?

TRENDING

C. Runtime probe via `cpufreq`