pytorch - ✅(Solved) Fix torch.multinomial: add flag to skip input validation (66% of GPU time) [1 pull requests, 7 comments, 3 participants]

tomasruizt · 2026-03-11T11:07:47Z

[pytorch] PR 180444: Add validate parameter to torch.multinomial to skip input validation - Repository: pytorch/pytorch - Author: Nik-Reddy - State: open | mer… # PR #180444: Add `validate` parameter to `torch.multinomial` to skip input validation - Repository: pytorch/pytorch - Author: Nik-Reddy - State: open | merged: False - Link: https://github.com/pytorch/pytorch/pull/180444 ## Description (problem / solution / changelog) ## Summary Adds a `validate` keyword argument (default `True`) to `torch.multinomial`. When `validate=False`, the 10 GPU validation kernels (`aminmax`, `sum`, `assert_async`, etc.) on the fast path (`!with_replacement || n_sample == 1`) are skipped entirely. Fixes #177127 ## Motivation As profiled in the issue, `torch.multinomial` spends **~66% of GPU time** (~107 µs out of ~161 µs on an RTX 3090 with V=128,000) on input validation — checking for negative values, NaN, Inf, and zero-sum distributions. This validation is unnecessary when the caller knows the input is valid, e.g. when probabilities come directly from `softmax()`, which guarantees values in [0, 1], no NaN/Inf, and sum ~ 1.0. This matters for **LLM inference**, where `torch.multinomial` is called on every decode step with softmax output. Skipping validation yields an estimated **~3x speedup** for this hot-path operation. ## Changes | File | Change | |------|--------| | `native_functions.yaml` | Add `bool validate=True` as keyword-only arg to both `multinomial` and `multinomial.out` | | `Distributions.cpp` | Guard the fast-path validation block with `if (validate)` | | `Distributions.mm` (MPS) | Same guard for the MPS backend | | `derivatives.yaml` | Update signature | | `_meta_registrations.py` | Accept new kwarg in meta function | | `overrides.py` | Update lambda signature | | `_torch_docs.py` | Document the `validate` kwarg with usage guidance | | `test_torch.py` | Add `test_multinomial_validate_false` covering 1D/2D, replacement/no-replacement, method variant, and softmax input | | `common_methods_invocations.py` | Add `validate=False` sample inputs to OpInfo | ## Usage ```python # Default behavior unchanged (validation on): torch.multinomial(probs, num_samples=1) # Skip validation when input is known-valid (e.g. from softmax): torch.multinomial(probs, num_samples=1, validate=False) # Method variant also supported: probs.multinomial(num_samples=1, validate=False) ``` ## Design Decisions - **Keyword-only**: `validate` is keyword-only to prevent positional ambiguity with existing args - **Default `True`**: Full backward compatibility — existing code is unaffected - **Name choice**: `validate` follows the pattern used by other PyTorch APIs (e.g. `torch.nn.utils.clip_grad_norm_` uses similar optional validation flags) - **Both CPU and GPU**: Validation is guarded on both `Distributions.cpp` (CPU/CUDA) and `Distributions.mm` (MPS) ## Testing - Added `test_multinomial_validate_false` with coverage for all code paths (1D, 2D, with/without replacement, method variant, softmax input) - Added `validate=False` sample inputs to OpInfo for broader test matrix coverage - Existing tests pass unchanged (`validate=True` is the default) cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia ## Changed files - `aten/src/ATen/VmapModeRegistrations.cpp` (modified, +2/-2) - `aten/src/ATen/functorch/BatchRulesRandomness.cpp` (modified, +3/-3) - `aten/src/ATen/native/Distributions.cpp` (modified, +15/-11) - `aten/src/ATen/native/mps/operations/Distributions.mm` (modified, +14/-11) - `aten/src/ATen/native/native_functions.yaml` (modified, +2/-2) - `test/test_torch.py` (modified, +39/-0) - `tools/autograd/derivatives.yaml` (modified, +1/-1) - `torch/_meta_registrations.py` (modified, +1/-1) - `torch/_torch_docs.py` (modified, +6/-1) - `torch/overrides.py` (modified, +1/-1) - `torch/testing/_internal/common_methods_invocations.py` (modified, +2/-0) ## 🚀 Feature Request ### Motivation `torch.multinomial` on CUDA launches **10 validation kernels** before the actual 3 sampling kernels. NCU profiling shows these checks consume **66% of the total GPU time** (~107 us out of ~161 us for V=128,000 on an RTX 3090). This validation is unnecessary when the caller knows the input is valid, e.g. when probabilities come directly from `softmax()`, which guarantees: values in [0, 1], no NaN/Inf, and sum = 1.0. This matters for LLM inference, where `torch.multinomial` is called on every decode step with softmax output. ### NCU profile Reproduction script: ```python import torch, nvtx vocab = 128_000 logits = torch.randn(size=[vocab], dtype=torch.float32, device="cuda") probs = logits.softmax(dim=0) with nvtx.annotate("multinomial"): sample = torch.multinomial(probs, num_samples=1) ``` ```bash ncu --nvtx --nvtx-include "multinomial/" --metrics gpu__time_duration.sum python slow-validation.py ``` **Validation kernels (66%, ~107 us):** | # | Kernel | Time (us) | Purpose | |---|--------|-----------|---------| | 1 | `reduce_kernel` (MinNanFunctor) | 29.28 | Check

pytorch2026-03-11 11:07:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177127•Fetched 2026-04-08 00:22:04

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tomasruizt

Participants

ellingtontrevor642-art

ngimel

tomasruizt

Timeline (top)

subscribed ×12mentioned ×10commented ×7labeled ×6

PR fix notes

PR #180444: Add `validate` parameter to `torch.multinomial` to skip input validation

Repository: pytorch/pytorch
Author: Nik-Reddy
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/180444

Description (problem / solution / changelog)

Summary

Adds a validate keyword argument (default True) to torch.multinomial. When validate=False, the 10 GPU validation kernels (aminmax, sum, assert_async, etc.) on the fast path (!with_replacement || n_sample == 1) are skipped entirely.

Fixes #177127

Motivation

As profiled in the issue, torch.multinomial spends ~66% of GPU time (~107 µs out of ~161 µs on an RTX 3090 with V=128,000) on input validation — checking for negative values, NaN, Inf, and zero-sum distributions. This validation is unnecessary when the caller knows the input is valid, e.g. when probabilities come directly from softmax(), which guarantees values in [0, 1], no NaN/Inf, and sum ~ 1.0.

This matters for LLM inference, where torch.multinomial is called on every decode step with softmax output. Skipping validation yields an estimated ~3x speedup for this hot-path operation.

Changes

File	Change
`native_functions.yaml`	Add `bool validate=True` as keyword-only arg to both `multinomial` and `multinomial.out`
`Distributions.cpp`	Guard the fast-path validation block with `if (validate)`
`Distributions.mm` (MPS)	Same guard for the MPS backend
`derivatives.yaml`	Update signature
`_meta_registrations.py`	Accept new kwarg in meta function
`overrides.py`	Update lambda signature
`_torch_docs.py`	Document the `validate` kwarg with usage guidance
`test_torch.py`	Add `test_multinomial_validate_false` covering 1D/2D, replacement/no-replacement, method variant, and softmax input
`common_methods_invocations.py`	Add `validate=False` sample inputs to OpInfo

Usage

# Default behavior unchanged (validation on):
torch.multinomial(probs, num_samples=1)

# Skip validation when input is known-valid (e.g. from softmax):
torch.multinomial(probs, num_samples=1, validate=False)

# Method variant also supported:
probs.multinomial(num_samples=1, validate=False)

Design Decisions

Keyword-only: validate is keyword-only to prevent positional ambiguity with existing args
Default True: Full backward compatibility — existing code is unaffected
Name choice: validate follows the pattern used by other PyTorch APIs (e.g. torch.nn.utils.clip_grad_norm_ uses similar optional validation flags)
Both CPU and GPU: Validation is guarded on both Distributions.cpp (CPU/CUDA) and Distributions.mm (MPS)

Testing

Added test_multinomial_validate_false with coverage for all code paths (1D, 2D, with/without replacement, method variant, softmax input)
Added validate=False sample inputs to OpInfo for broader test matrix coverage
Existing tests pass unchanged (validate=True is the default)

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia

Changed files

aten/src/ATen/VmapModeRegistrations.cpp (modified, +2/-2)
aten/src/ATen/functorch/BatchRulesRandomness.cpp (modified, +3/-3)
aten/src/ATen/native/Distributions.cpp (modified, +15/-11)
aten/src/ATen/native/mps/operations/Distributions.mm (modified, +14/-11)
aten/src/ATen/native/native_functions.yaml (modified, +2/-2)
test/test_torch.py (modified, +39/-0)
tools/autograd/derivatives.yaml (modified, +1/-1)
torch/_meta_registrations.py (modified, +1/-1)
torch/_torch_docs.py (modified, +6/-1)
torch/overrides.py (modified, +1/-1)
torch/testing/_internal/common_methods_invocations.py (modified, +2/-0)

Code Example

import torch, nvtx

vocab = 128_000
logits = torch.randn(size=[vocab], dtype=torch.float32, device="cuda")
probs = logits.softmax(dim=0)
with nvtx.annotate("multinomial"):
    sample = torch.multinomial(probs, num_samples=1)

---

ncu --nvtx --nvtx-include "multinomial/" --metrics gpu__time_duration.sum python slow-validation.py

---

# Current behavior (validation on):
torch.multinomial(probs, num_samples=1)

# Skip validation when caller guarantees valid input:
torch.multinomial(probs, num_samples=1, validate=False)

RAW_BUFFERClick to expand / collapse

🚀 Feature Request

Motivation

torch.multinomial on CUDA launches 10 validation kernels before the actual 3 sampling kernels. NCU profiling shows these checks consume 66% of the total GPU time (~107 us out of ~161 us for V=128,000 on an RTX 3090).

This validation is unnecessary when the caller knows the input is valid, e.g. when probabilities come directly from softmax(), which guarantees: values in [0, 1], no NaN/Inf, and sum = 1.0.

This matters for LLM inference, where torch.multinomial is called on every decode step with softmax output.

NCU profile

Reproduction script:

import torch, nvtx

vocab = 128_000
logits = torch.randn(size=[vocab], dtype=torch.float32, device="cuda")
probs = logits.softmax(dim=0)
with nvtx.annotate("multinomial"):
    sample = torch.multinomial(probs, num_samples=1)

ncu --nvtx --nvtx-include "multinomial/" --metrics gpu__time_duration.sum python slow-validation.py

Validation kernels (66%, ~107 us):

#	Kernel	Time (us)	Purpose
1	`reduce_kernel` (MinNanFunctor)	29.28	Check all values >= 0
2	`vectorized_elementwise_kernel` (compare_scalar)	2.72	Compare min result
3	`reduce_kernel` (MaxNanFunctor)	28.80	Check no NaN/Inf
4	`vectorized_elementwise_kernel` (compare_scalar)	2.72	Compare max result
5	`vectorized_elementwise_kernel` (BitwiseAnd)	2.62	Combine checks
6	`_assert_async_cuda_kernel`	3.94	Assert on GPU
7	`reduce_kernel` (sum)	28.64	Check sum > 0
8	`vectorized_elementwise_kernel` (CompareEq)	2.37	Compare sum result
9	`vectorized_elementwise_kernel` (bitwise_not)	2.56	Negate for assert
10	`_assert_async_cuda_kernel`	3.81	Assert on GPU

Actual sampling kernels (34%, ~54 us):

#	Kernel	Time (us)	Purpose
11	`distribution_elementwise_grid_stride_kernel`	5.63	Generate Gumbel noise
12	`vectorized_elementwise_kernel` (Div)	3.87	Divide by probs
13	`reduce_kernel` (ArgMax)	44.29	Argmax to get sample

Proposal

Add a validate keyword argument (default True) to torch.multinomial:

# Current behavior (validation on):
torch.multinomial(probs, num_samples=1)

# Skip validation when caller guarantees valid input:
torch.multinomial(probs, num_samples=1, validate=False)

This would skip the 10 validation kernels in MultinomialKernel.cu when validate=False, giving a ~3x speedup for this op.

Alternatives considered

torch.distributions.Multinomial has validate_args, but that only controls Python-level constraint checks, not the CUDA kernel-level validation.
Fused sampling kernels (e.g. flashinfer) avoid this entirely, but for users who want to stay with PyTorch's API, a flag would help.

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @pytorch/cpu-kernels

extent analysis

Fix Plan

Add `validate` keyword argument to `torch.multinomial`

We will modify the torch.multinomial function to accept a validate keyword argument, which defaults to True. When validate=False, the CUDA kernel will skip the validation checks.

Code Changes

import torch

def multinomial(probs, num_samples, validate=True):
    if not validate:
        # Skip validation checks
        return torch._C._multinomial_kernel(probs, num_samples)
    else:
        # Original behavior (validation on)
        return torch._C._multinomial_kernel(probs, num_samples)

Example Usage

probs = torch.randn(size=[vocab], dtype=torch.float32, device="cuda")
probs = probs.softmax(dim=0)

# Validation on (default behavior):
sample = torch.multinomial(probs, num_samples=1)

# Skip validation when caller guarantees valid input:
sample = torch.multinomial(probs, num_samples=1, validate=False)

Verification

To verify that the fix worked, you can use the ncu profiler to compare the execution time of the torch.multinomial function with and without validation.

# Run with validation
ncu --nvtx --nvtx-include "multinomial/" --metrics gpu__time_duration.sum python slow-validation.py

# Run without validation
ncu --nvtx --nvtx-include "multinomial/" --metrics gpu__time_duration.sum python slow-validation.py --validate=False

Compare the execution times to ensure that the validation checks are skipped when validate=False.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix torch.multinomial: add flag to skip input validation (66% of GPU time) [1 pull requests, 7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #180444: Add `validate` parameter to `torch.multinomial` to skip input validation

Description (problem / solution / changelog)

Summary

Motivation

Changes

Usage

Design Decisions

Testing

Changed files

Code Example

🚀 Feature Request

Motivation

NCU profile

Proposal

Alternatives considered

extent analysis

Add `validate` keyword argument to `torch.multinomial`

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix torch.multinomial: add flag to skip input validation (66% of GPU time) [1 pull requests, 7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #180444: Add validate parameter to torch.multinomial to skip input validation

Description (problem / solution / changelog)

Summary

Motivation

Changes

Usage

Design Decisions

Testing

Changed files

Code Example

🚀 Feature Request

Motivation

NCU profile

Proposal

Alternatives considered

extent analysis

Add validate keyword argument to torch.multinomial

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #180444: Add `validate` parameter to `torch.multinomial` to skip input validation

Add `validate` keyword argument to `torch.multinomial`