vllm - 💡(How to fix) Fix [RFC]: Unified Device Capability Abstraction for Cross-Platform Feature Detection [1 comments, 2 participants]

jikunshang · 2026-04-22T13:33:57Z

[vllm] Motivation. This is an AI-generated proposal, there may be some error. appreciate if you can point out. 1. Problem Summary As raised by @tjtanaa in 3915… ## Fix / Workaround | Location | Usage Pattern | Notes | |----------|--------------|-------| | **tests/** and **benchmarks/** | Test skip / gating | ~60% of call sites. vLLM CI always uses NVIDIA server GPUs, so numeric checks "happen to work" but are not correct for community contributors on consumer/workstation GPUs | | **Kernel selection code** (`model_executor/layers/`) | Dispatch to optimal kernel implementation | ~30%. Inherently platform-specific (CUTLASS, Marlin, etc.) | | **Config / runtime** (`vllm/config/`, `vllm/v1/`) | Feature availability checks | ~10%. Should be cross-platform | This distribution suggests the migration can be prioritized: **fix tests/config first** (highest cross-platform impact), then gradually migrate kernel dispatch code. These are CUDA-kernel-specific dispatch decisions. They are inherently platform-specific and **do not need** cross-platform abstraction — they are always guarded behind `is_cuda()` check already. However, even here the Blackwell SKU split causes issues (see Section 1.2). ### Motivation. This is an AI-generated proposal, there may be some error. appreciate if you can point out. ## 1. Problem Summary As raised by @tjtanaa in [#39158](https://github.com/vllm-project/vllm/issues/39158), the current `has_device_capability(int)` / `is_device_capability_family(int)` API is **fundamentally CUDA-centric** and does **not** translate correctly to ROCm or XPU. ### 1.1 Problem A: `device_capability` is inherently a CUDA concept `torch.cuda.get_device_capability()` returns `(major, minor)` tied to NVIDIA's SM (Streaming Multiprocessor) versioning. **ROCm and XPU have completely different hardware models** — any mapping to CUDA-style numbers is artificial and lossy: | Platform | Capability Model | How it works | |----------|-----------------|--------------| | **CUDA** | SM version `(major, minor)` e.g. `(8,9)`, `(9,0)`, `(10,0)` | Native `torch.cuda.get_device_capability()` | | **ROCm** | GCN arch string (e.g. `gfx942`) → **artificially mapped** to `(major, minor)` | Semantic mismatch: `gfx90a` maps to `(9,0)` but has NO FP8, while CUDA's `(9,0)` = Hopper = has FP8 | | **XPU** | No capability model → always returns `None` → all checks = `False` | All feature gates broken | | **CPU/TPU** | No capability model → returns `None` | N/A | | **OOT** | No capability model → returns `None` | N/A | ### 1.2 Problem B: Same CUDA generation, different capability numbers across SKU tiers Even within NVIDIA's own ecosystem, **the same architecture generation has different compute capability numbers** depending on the product tier (server / workstation / client): | Platform | Device | Compute Capability | Data Types Supported | |----------|--------|-------------------|---------------------| | **Server** | B200 / B300 | `(10, 0)` / `(10, 3)` | bf16 / fp8 / fp4 | | **Workstation** | RTX PRO 6000 Blackwell | `(12, 0)` | bf16 / fp8 / fp4 | | **Client** | RTX 5090 | `(12, 0)` | bf16 only | This is particularly problematic because: 1. **Same generation, different numbers**: B200 (`10,0`) and RTX PRO 6000 (`12,0`) are both Blackwell, both support FP8/FP4, but have completely different capability numbers. Code using `is_device_capability_family(100)` to gate Blackwell features will **miss** RTX PRO 6000. 2. **Same number, different features**: RTX PRO 6000 and RTX 5090 both report `(12,0)`, but RTX PRO supports FP8/FP4 while consumer RTX 5090 does **not**. Code using `has_device_capability(120)` to gate FP8 would be **wrong** on RTX 5090. 3. **Maintenance burden**: Every new GPU SKU tier requires auditing all numeric capability checks. The current codebase already handles `is_device_capability_family(100)` and `is_device_capability_family(120)` separately (e.g., in `cutlass_moe.py`), and this will only get worse. ### 1.3 Problem C: Cross-platform semantic mismatch The same numeric value means **completely different things** on different platforms: | Capability Value | CUDA Meaning | ROCm Meaning | |-----------------|--------------|--------------| | `(9, 0)` | Hopper (H100/H200) — FP8 ✅, TMA ✅ | MI200 (gfx90a) — FP8 ❌, BF16 only | | `(9, 4)` | N/A | MI300 (gfx942) — FP8 ✅ (FNUZ) | | `(9, 5)` | N/A | MI355 (gfx950) — FP8 ✅, FP4 ✅ | | `(12, 0)` | Blackwell consumer (RTX 5090) — BF16 only | RDNA4 (gfx1201) — BF16 only | ### 1.4 Scope observation: where capability checks are used Most usage of `has_device_capability` / `get_device_capability` falls into: | Location | Usage Pattern | Notes | |----------|--------------|-------| | **tests/** and **benchmarks/** | Test skip / gating | ~60% of call sites. vLLM CI always uses NVIDIA server GPUs, so numeric checks "happen to work" but are not correct for community contributors on consumer/workstation GPUs | | **Kernel selection code** (`model_executor/layers/`) | Dispatch to optimal kernel impleme

vllm2026-04-22 13:33:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40620•Fetched 2026-04-23 07:23:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jikunshang

Participants

github-actions[bot]

jikunshang

Timeline (top)

mentioned ×3subscribed ×3labeled ×2added_to_project_v2 ×1

Error Message

This is an AI-generated proposal, there may be some error. appreciate if you can point out. Add pre-commit hook to warn on new has_device_capability usage in test files without a is_cuda() / is_rocm() guard.

Root Cause

This is particularly problematic because:

Fix Action

Fix / Workaround

Location	Usage Pattern	Notes
tests/ and benchmarks/	Test skip / gating	~60% of call sites. vLLM CI always uses NVIDIA server GPUs, so numeric checks "happen to work" but are not correct for community contributors on consumer/workstation GPUs
Kernel selection code (`model_executor/layers/`)	Dispatch to optimal kernel implementation	~30%. Inherently platform-specific (CUTLASS, Marlin, etc.)
Config / runtime (`vllm/config/`, `vllm/v1/`)	Feature availability checks	~10%. Should be cross-platform

This distribution suggests the migration can be prioritized: fix tests/config first (highest cross-platform impact), then gradually migrate kernel dispatch code.

These are CUDA-kernel-specific dispatch decisions. They are inherently platform-specific and do not need cross-platform abstraction — they are always guarded behind is_cuda() check already. However, even here the Blackwell SKU split causes issues (see Section 1.2).

Code Example

# "Does this device support FP8?"
current_platform.has_device_capability(89)   # Actually means: supports_fp8()

# "Does this device support BF16?"
current_platform.has_device_capability(80)   # Actually means: supports_bf16()

# "Does this device support TMA / warp-group MMA?" 
current_platform.has_device_capability(90)   # Actually means: supports_hopper_features()

# "Does this device support Blackwell features (FP4, TMA v2)?"
current_platform.is_device_capability_family(100)  # Actually means: supports_blackwell_features()

---

# "Use CUTLASS 3.x path for SM90+"
if current_platform.has_device_capability(90): use_cutlass3()

# "Use Blackwell-specific GEMM kernel"
if current_platform.is_device_capability_family(100): use_blackwell_gemm()

# "Use CUTLASS MoE for Blackwell server OR workstation"
if p.is_device_capability_family(100) or p.is_device_capability_family(120): ...

---

# fbgemm_fp8.py — Marlin WOQ fallback for non-FP8 hardware
self.use_marlin = not current_platform.has_device_capability(89)

# marlin.py — FP8 Marlin works on SM ≥ 7.5 (Turing+)
def is_fp8_marlin_supported():
    return current_platform.has_device_capability(75)

---

# vllm/platforms/interface.py — base class
Platform.supports_fp8()    → bool   # ✅ Cross-platform
Platform.supports_mx()     → bool   # ✅ Cross-platform
Platform.support_deep_gemm() → bool # ✅ Cross-platform
Platform.fp8_dtype()       → dtype  # ✅ Cross-platform
Platform.is_fp8_fnuz()     → bool   # ✅ Cross-platform

---

# vllm/platforms/interface.py

class Platform:
    # === Existing (keep) ===
    def supports_fp8(cls) -> bool: ...      # keep for backward compat, alias to supports_fp8_native()
    def supports_mx(cls) -> bool: ...
    def support_deep_gemm(cls) -> bool: ...
    
    # === New feature queries ===
    
    @classmethod
    def supports_bf16(cls) -> bool:
        """Returns whether the current platform supports BF16 compute."""
        return False
    
    @classmethod
    def supports_fp8_native(cls) -> bool:
        """Returns whether the current platform has native FP8 tensor core compute.
        
        This means the hardware can natively perform matrix multiplication in FP8
        (e.g., NVIDIA SM ≥ 89 Ada/Hopper/Blackwell server, AMD gfx942/gfx950).
        Used for: native FP8 GEMM, FP8 KV cache, FP8 attention.
        """
        return False

    @classmethod
    def supports_fp8_woq(cls) -> bool:
        """Returns whether the platform supports FP8 weight-only quantization.
        
        This means kernels like Marlin can load FP8-quantized weights and 
        dequantize them to FP16/BF16 for compute. Works on significantly older
        hardware than native FP8 (e.g., NVIDIA SM ≥ 75 Turing+).
        Used for: Marlin FP8, fbgemm FP8 fallback path.
        """
        return False

    @classmethod
    def supports_fp4(cls) -> bool:
        """Returns whether the current platform supports FP4 quantization 
        (native compute or equivalent)."""
        return False

    @classmethod
    def supports_tma(cls) -> bool:
        """Returns whether the current platform supports 
        Tensor Memory Accelerator (or equivalent async copy engine)."""
        return False

    @classmethod
    def supports_fp8_kv_cache(cls) -> bool:
        """Returns whether the current platform supports FP8 KV cache."""
        return cls.supports_fp8_native()

    @classmethod
    def get_architecture_family(cls) -> str:
        """Returns human-readable architecture family name.
        
        Examples: 'hopper', 'blackwell', 'blackwell_consumer','cdna3', 
                  'rdna4', 'ponte_vecchio', 'unknown'
        """
        return "unknown"

---

# fbgemm_fp8.py — decides whether to use native FP8 or Marlin fallback
self.use_marlin = not current_platform.has_device_capability(89)

# marlin_utils_fp8.py — Marlin FP8 works on much older hardware
def is_fp8_marlin_supported():
    return current_platform.has_device_capability(75)  # Turing+!

---

# fbgemm_fp8.py — clear intent
self.use_marlin = not current_platform.supports_fp8_native()

# marlin_utils_fp8.py — explicit WOQ check
def is_fp8_marlin_supported():
    return current_platform.supports_fp8_woq()

---

# Before
if current_platform.has_device_capability(89):
    # FP8 native path
elif current_platform.has_device_capability(75):
    # FP8 Marlin WOQ fallback

# After  
if current_platform.supports_fp8_native():
    # FP8 native path
elif current_platform.supports_fp8_woq():
    # FP8 Marlin WOQ fallback

---

# Before (CUDA-only, broken on ROCm, wrong on consumer GPUs)
@pytest.mark.skipif(not current_platform.has_device_capability(89), reason="need fp8")

# After (cross-platform, correct on all SKUs)
@requires_feature("fp8_native")

RAW_BUFFERClick to expand / collapse

Motivation.

This is an AI-generated proposal, there may be some error. appreciate if you can point out.

1. Problem Summary

As raised by @tjtanaa in #39158, the current has_device_capability(int) / is_device_capability_family(int) API is fundamentally CUDA-centric and does not translate correctly to ROCm or XPU.

1.1 Problem A: `device_capability` is inherently a CUDA concept

torch.cuda.get_device_capability() returns (major, minor) tied to NVIDIA's SM (Streaming Multiprocessor) versioning. ROCm and XPU have completely different hardware models — any mapping to CUDA-style numbers is artificial and lossy:

Platform	Capability Model	How it works
CUDA	SM version `(major, minor)` e.g. `(8,9)`, `(9,0)`, `(10,0)`	Native `torch.cuda.get_device_capability()`
ROCm	GCN arch string (e.g. `gfx942`) → artificially mapped to `(major, minor)`	Semantic mismatch: `gfx90a` maps to `(9,0)` but has NO FP8, while CUDA's `(9,0)` = Hopper = has FP8
XPU	No capability model → always returns `None` → all checks = `False`	All feature gates broken
CPU/TPU	No capability model → returns `None`	N/A
OOT	No capability model → returns `None`	N/A

1.2 Problem B: Same CUDA generation, different capability numbers across SKU tiers

Even within NVIDIA's own ecosystem, the same architecture generation has different compute capability numbers depending on the product tier (server / workstation / client):

Platform	Device	Compute Capability	Data Types Supported
Server	B200 / B300	`(10, 0)` / `(10, 3)`	bf16 / fp8 / fp4
Workstation	RTX PRO 6000 Blackwell	`(12, 0)`	bf16 / fp8 / fp4
Client	RTX 5090	`(12, 0)`	bf16 only

This is particularly problematic because:

Same generation, different numbers: B200 (10,0) and RTX PRO 6000 (12,0) are both Blackwell, both support FP8/FP4, but have completely different capability numbers. Code using is_device_capability_family(100) to gate Blackwell features will miss RTX PRO 6000.
Same number, different features: RTX PRO 6000 and RTX 5090 both report (12,0), but RTX PRO supports FP8/FP4 while consumer RTX 5090 does not. Code using has_device_capability(120) to gate FP8 would be wrong on RTX 5090.
Maintenance burden: Every new GPU SKU tier requires auditing all numeric capability checks. The current codebase already handles is_device_capability_family(100) and is_device_capability_family(120) separately (e.g., in cutlass_moe.py), and this will only get worse.

1.3 Problem C: Cross-platform semantic mismatch

The same numeric value means completely different things on different platforms:

Capability Value	CUDA Meaning	ROCm Meaning
`(9, 0)`	Hopper (H100/H200) — FP8 ✅, TMA ✅	MI200 (gfx90a) — FP8 ❌, BF16 only
`(9, 4)`	N/A	MI300 (gfx942) — FP8 ✅ (FNUZ)
`(9, 5)`	N/A	MI355 (gfx950) — FP8 ✅, FP4 ✅
`(12, 0)`	Blackwell consumer (RTX 5090) — BF16 only	RDNA4 (gfx1201) — BF16 only

1.4 Scope observation: where capability checks are used

Most usage of has_device_capability / get_device_capability falls into:

Location	Usage Pattern	Notes
tests/ and benchmarks/	Test skip / gating	~60% of call sites. vLLM CI always uses NVIDIA server GPUs, so numeric checks "happen to work" but are not correct for community contributors on consumer/workstation GPUs
Kernel selection code (`model_executor/layers/`)	Dispatch to optimal kernel implementation	~30%. Inherently platform-specific (CUTLASS, Marlin, etc.)
Config / runtime (`vllm/config/`, `vllm/v1/`)	Feature availability checks	~10%. Should be cross-platform

This distribution suggests the migration can be prioritized: fix tests/config first (highest cross-platform impact), then gradually migrate kernel dispatch code.

2. Analysis of Current Codebase

2.1 How capability is actually used (282 call sites)

Analyzing all has_device_capability / is_device_capability* call sites across the codebase, they fall into three distinct categories:

Category A: Testing for a hardware feature (~60%)

# "Does this device support FP8?"
current_platform.has_device_capability(89)   # Actually means: supports_fp8()

# "Does this device support BF16?"
current_platform.has_device_capability(80)   # Actually means: supports_bf16()

# "Does this device support TMA / warp-group MMA?" 
current_platform.has_device_capability(90)   # Actually means: supports_hopper_features()

# "Does this device support Blackwell features (FP4, TMA v2)?"
current_platform.is_device_capability_family(100)  # Actually means: supports_blackwell_features()

These should be feature queries, not numeric comparisons.

Category B: Selecting a kernel implementation (~30%)

# "Use CUTLASS 3.x path for SM90+"
if current_platform.has_device_capability(90): use_cutlass3()

# "Use Blackwell-specific GEMM kernel"
if current_platform.is_device_capability_family(100): use_blackwell_gemm()

# "Use CUTLASS MoE for Blackwell server OR workstation"
if p.is_device_capability_family(100) or p.is_device_capability_family(120): ...

Category C: FP8 weight-only quantization fallback (~10%)

# fbgemm_fp8.py — Marlin WOQ fallback for non-FP8 hardware
self.use_marlin = not current_platform.has_device_capability(89)

# marlin.py — FP8 Marlin works on SM ≥ 7.5 (Turing+)
def is_fp8_marlin_supported():
    return current_platform.has_device_capability(75)

This pattern reveals that FP8 "support" is not binary — there are two distinct levels:

Native FP8 compute: Hardware tensor cores that natively compute in FP8 (SM ≥ 89 on CUDA, gfx942/gfx950 on ROCm)
FP8 weight-only quantization (WOQ): Kernels like Marlin that can load FP8-quantized weights and dequantize them to FP16/BF16 for compute (works on SM ≥ 75 on CUDA, i.e., Turing+)

The current supports_fp8() doesn't distinguish between these, which leads to suboptimal decisions. For example, FBGEMMFp8Config uses has_device_capability(89) to decide whether to use native FP8 compute vs Marlin fallback, but this is incorrect on RTX 5090 ((12,0)) which has has_device_capability(89) = True but may not actually have native FP8 tensor cores for all operations.

2.2 Existing feature-level APIs (the right pattern)

The codebase already has the correct abstraction for some features:

# vllm/platforms/interface.py — base class
Platform.supports_fp8()    → bool   # ✅ Cross-platform
Platform.supports_mx()     → bool   # ✅ Cross-platform
Platform.support_deep_gemm() → bool # ✅ Cross-platform
Platform.fp8_dtype()       → dtype  # ✅ Cross-platform
Platform.is_fp8_fnuz()     → bool   # ✅ Cross-platform

Each platform overrides these with the correct implementation:

CUDA: supports_fp8() → has_device_capability(89)
ROCm: supports_fp8() → "gfx94" in arch or "gfx95" in arch or "gfx12" in arch
XPU: inherits False (or could override when XPU gains FP8)

This is the pattern that should be expanded — but with finer granularity (see Section 3).

2.3 Missing feature-level APIs

Feature	How it's checked today	Missing `Platform` method
BF16 support	`has_device_capability(80)` (CUDA-only)	`supports_bf16()`
FP8 native compute vs WOQ	Implicit: `has_device_capability(89)` vs `has_device_capability(75)`	`supports_fp8_native()` / `supports_fp8_woq()`
FP4 / NVFP4	C++ `cutlass_scaled_mm_supports_fp4()` (CUDA-only)	`supports_fp4()`
Hopper features (TMA, wgmma)	`has_device_capability(90)`	`supports_tma()` or architecture family check
Blackwell features	`is_device_capability_family(100)` — misses SM 12.0 workstation	`supports_blackwell_features()`
Flash attention FP8	`has_device_capability(89)`	Covered by `supports_fp8_native()`
CUTLASS GEMM dispatch	`has_device_capability(75/80/89/90/100)`	Platform-specific, keep as-is

Proposed Change.

3. Proposal: Two-Layer Capability Model

Layer 1: Feature-Based Queries (Cross-Platform) — PRIMARY

Expand the existing Platform base class with semantic feature queries that each platform implements correctly:

# vllm/platforms/interface.py

class Platform:
    # === Existing (keep) ===
    def supports_fp8(cls) -> bool: ...      # keep for backward compat, alias to supports_fp8_native()
    def supports_mx(cls) -> bool: ...
    def support_deep_gemm(cls) -> bool: ...
    
    # === New feature queries ===
    
    @classmethod
    def supports_bf16(cls) -> bool:
        """Returns whether the current platform supports BF16 compute."""
        return False
    
    @classmethod
    def supports_fp8_native(cls) -> bool:
        """Returns whether the current platform has native FP8 tensor core compute.
        
        This means the hardware can natively perform matrix multiplication in FP8
        (e.g., NVIDIA SM ≥ 89 Ada/Hopper/Blackwell server, AMD gfx942/gfx950).
        Used for: native FP8 GEMM, FP8 KV cache, FP8 attention.
        """
        return False

    @classmethod
    def supports_fp8_woq(cls) -> bool:
        """Returns whether the platform supports FP8 weight-only quantization.
        
        This means kernels like Marlin can load FP8-quantized weights and 
        dequantize them to FP16/BF16 for compute. Works on significantly older
        hardware than native FP8 (e.g., NVIDIA SM ≥ 75 Turing+).
        Used for: Marlin FP8, fbgemm FP8 fallback path.
        """
        return False

    @classmethod
    def supports_fp4(cls) -> bool:
        """Returns whether the current platform supports FP4 quantization 
        (native compute or equivalent)."""
        return False

    @classmethod
    def supports_tma(cls) -> bool:
        """Returns whether the current platform supports 
        Tensor Memory Accelerator (or equivalent async copy engine)."""
        return False

    @classmethod
    def supports_fp8_kv_cache(cls) -> bool:
        """Returns whether the current platform supports FP8 KV cache."""
        return cls.supports_fp8_native()

    @classmethod
    def get_architecture_family(cls) -> str:
        """Returns human-readable architecture family name.
        
        Examples: 'hopper', 'blackwell', 'blackwell_consumer','cdna3', 
                  'rdna4', 'ponte_vecchio', 'unknown'
        """
        return "unknown"

Why split FP8 into native vs WOQ?

The current supports_fp8() is ambiguous. In the codebase today:

# fbgemm_fp8.py — decides whether to use native FP8 or Marlin fallback
self.use_marlin = not current_platform.has_device_capability(89)

# marlin_utils_fp8.py — Marlin FP8 works on much older hardware
def is_fp8_marlin_supported():
    return current_platform.has_device_capability(75)  # Turing+!

With the split:

# fbgemm_fp8.py — clear intent
self.use_marlin = not current_platform.supports_fp8_native()

# marlin_utils_fp8.py — explicit WOQ check
def is_fp8_marlin_supported():
    return current_platform.supports_fp8_woq()

This distinction matters for:

Testing: A test for native FP8 GEMM should use @requires_feature("fp8_native"), while a test for Marlin FP8 WOQ should use @requires_feature("fp8_woq")
Consumer GPUs: RTX 5090 (SM 12.0) may support FP8 WOQ via Marlin but not native FP8 tensor core compute
ROCm: MI200 (gfx90a) has no FP8 at all, MI300 (gfx942) has native FP8-FNUZ, different Marlin support

Layer 2: Numeric Capability (Platform-Specific) — KEEP BUT DEPRECATE for cross-platform use

Keep has_device_capability() / is_device_capability() / is_device_capability_family() as-is for platform-specific kernel dispatch, but:

Document that these are CUDA/ROCm-specific and must always be guarded by is_cuda() / is_rocm().
Deprecation warning in tests when used without platform guard (enforce via lint rule).
XPU/CPU/TPU continue to return None / False — this is correct behavior.
For kernel dispatch: These remain the correct API when selecting between CUDA-specific kernel implementations (e.g., CUTLASS 2.x vs 3.x). Such code is inherently platform-specific and doesn't need cross-platform abstraction.

5. Migration Plan

Phase 0: Add feature methods to Platform (this RFC)

Task	Effort
Add `supports_bf16()`, `supports_fp8_native()`, `supports_fp8_woq()`, `supports_fp4()`, `supports_tma()`, `supports_wgmma()`, `get_architecture_family()`, `get_device_tier()` to base `Platform`	Small
Implement in `CudaPlatformBase`, `RocmPlatform`, `XPUPlatform`	Small
Keep `supports_fp8()` as backward-compat alias → `supports_fp8_native()`	Trivial
Add `requires_feature()` to `tests/utils.py`	Small

Phase 1: Convert feature-gated capability checks in vllm/ source

Convert ~60% of has_device_capability calls that are actually feature checks:

# Before
if current_platform.has_device_capability(89):
    # FP8 native path
elif current_platform.has_device_capability(75):
    # FP8 Marlin WOQ fallback

# After  
if current_platform.supports_fp8_native():
    # FP8 native path
elif current_platform.supports_fp8_woq():
    # FP8 Marlin WOQ fallback

Priority conversion targets (most impactful):

has_device_capability(89) → supports_fp8_native() (~15 sites in vllm/)
has_device_capability(75) for Marlin → supports_fp8_woq() (~3 sites)
has_device_capability(80) → supports_bf16() (~8 sites)
is_device_capability_family(100) → supports_fp4() or architecture check (~20 sites)
not has_device_capability(89) for use_marlin → not supports_fp8_native() and supports_fp8_woq() (~5 sites)

Phase 2: Convert test skip patterns

Integrate with the test skip RFC (#39158) to migrate test files:

# Before (CUDA-only, broken on ROCm, wrong on consumer GPUs)
@pytest.mark.skipif(not current_platform.has_device_capability(89), reason="need fp8")

# After (cross-platform, correct on all SKUs)
@requires_feature("fp8_native")

Note: Since vLLM CI always uses NVIDIA server GPUs, the migration can be incremental — existing tests won't break, but new tests should use the feature-based API.

Phase 3: Lint enforcement

Add pre-commit hook to warn on new has_device_capability usage in test files without a is_cuda() / is_rocm() guard.

Feedback Period.

1-2 weeks.

CC List.

@tjtanaa

Any Other Things.

6. Comprehensive Device Capability Reference

6.1 NVIDIA CUDA — `torch.cuda.get_device_capability()`

Server / Data Center GPUs

Device	Compute Capability	Architecture	BF16	FP8 Native	FP4	TMA
GB300 / B300	`(10, 3)`	Blackwell	✅	✅	✅	✅
GB200 / B200	`(10, 0)`	Blackwell	✅	✅	✅	✅
H100 / H200	`(9, 0)`	Hopper	✅	✅	❌	✅
L4 / L40 / L40S	`(8, 9)`	Ada Lovelace	✅	✅	❌	❌
A40 / A10 / A16 / A2	`(8, 6)`	Ampere	✅	❌	❌	❌
A100	`(8, 0)`	Ampere	✅	❌	❌	❌
T4	`(7, 5)`	Turing	❌	❌	❌	❌
V100	`(7, 0)`	Volta	❌	❌	❌	❌

Workstation / Pro GPUs

Device	Compute Capability	Architecture	BF16	FP8 Native	FP4	TMA
RTX PRO 6000 Blackwell	`(12, 0)`	Blackwell (consumer SM)	✅	✅	✅	✅*
RTX 6000 Ada	`(8, 9)`	Ada Lovelace	✅	✅	❌	❌
RTX A6000 / A5000 / A4000	`(8, 6)`	Ampere	✅	❌	❌	❌
Quadro RTX	`(7, 5)`	Turing	❌	❌	❌	❌

Consumer / GeForce GPUs

Device	Compute Capability	Architecture	BF16	FP8 Native	FP4	TMA
RTX 5090/5080/5070/5060/5050	`(12, 0)`	Blackwell (consumer SM)	✅	❌	❌	❌*
RTX 4090/4080/4070/4060	`(8, 9)`	Ada Lovelace	✅	✅	❌	❌
RTX 3090/3080/3070/3060/3050	`(8, 6)`	Ampere	✅	❌	❌	❌
RTX 2080/2070/2060, Titan RTX	`(7, 5)`	Turing	❌	❌	❌	❌

Key insight: RTX PRO 6000 and RTX 5090 both report (12, 0), but RTX PRO supports FP8/FP4 while consumer RTX 5090 does not. Capability number alone is insufficient for feature detection on SM 12.0.

6.2 AMD ROCm — GCN Architecture → Mapped Capability

Device	GCN Arch	Mapped Capability	Architecture	BF16	FP8 Native	FP4
MI4xx (future)	gfx1250	TBD	CDNA next	✅	✅	✅
MI355	gfx950	`(9, 5)`	CDNA4	✅	✅ (OCP+FNUZ)	✅
MI300/MI325	gfx942	`(9, 4)`	CDNA3	✅	✅ (FNUZ)	❌
MI200	gfx90a	`(9, 0)`	CDNA2	✅	❌	❌
Radeon (RDNA4)	gfx12xx	`(12, 0)`	RDNA4	✅	✅	❌
Radeon (RDNA3)	gfx11xx	`(11, x)`	RDNA3	✅	❌	❌

Key insight: ROCm's (9, 0) = MI200 (NO FP8) vs CUDA's (9, 0) = Hopper (HAS FP8). The same number has opposite meanings.

6.3 Feature to Capability Mapping — Why Numeric Checks Fail

Feature	CUDA Numeric Gate	Why It Fails
FP8 Native	`has_device_capability(89)`	❌ RTX 5090 is `(12,0)` ≥ `(8,9)` but has no native FP8. ROCm MI200 maps to `(9,0)` ≥ `(8,9)` but has no FP8.
FP4	`is_device_capability_family(100)`	❌ Misses RTX PRO 6000 at `(12,0)`.
BF16	`has_device_capability(80)`	⚠️ Works for CUDA server/workstation, but returns `None` on XPU (which does support BF16).
Blackwell features	`is_device_capability_family(100)`	❌ Misses workstation Blackwell at `(12,0)`. Includes some non-Blackwell if future families reuse.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix involves replacing numeric device capability checks with semantic feature queries to ensure cross-platform compatibility and accuracy.

Guidance

Introduce feature-based queries: Expand the Platform base class with methods like supports_fp8_native(), supports_fp8_woq(), supports_bf16(), and supports_fp4() to provide a clear and cross-platform way to check for specific hardware features.
Deprecate numeric capability checks for cross-platform use: While keeping has_device_capability() and similar methods for platform-specific kernel dispatch, deprecate their use for cross-platform feature checks and encourage the use of feature-based queries instead.
Migrate existing code: Gradually convert existing has_device_capability() calls to use the new feature-based queries, prioritizing tests and config/runtime code for the highest cross-platform impact.
Enforce lint rules: Implement pre-commit hooks to warn against new usage of has_device_capability() without proper platform guards, ensuring that future code adheres to the new guidelines.

Example

# Before
if current_platform.has_device_capability(89):
    # FP8 native path

# After
if current_platform.supports_fp8_native():
    # FP8 native path

Notes

The migration should be incremental, starting with the most impactful areas such as tests and config/runtime code.
The introduction of feature-based queries does not immediately render numeric capability checks obsolete for all use cases, especially within platform-specific kernel dispatch code.
Documentation and clear guidelines are crucial for a smooth transition to the new feature-based API.

Recommendation

Apply the workaround by introducing and gradually migrating to the feature-based queries, ensuring a more robust and cross-platform compatible codebase. This approach allows for clearer intent in code, reduces maintenance burdens due to changing hardware capabilities, and improves the overall reliability of feature checks across different platforms.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.