vllm - 💡(How to fix) Fix DeepGEMM SiLU/mul FP8 quant Triton kernel overflows int32 addresses for large DPEP warmup shapes

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

rank=0/16 host=... M=4610048 N=4096 max_offset=18882756607 ... File ".../vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 785, in silu_mul_per_token_group_quant_fp8_colmajor _silu_mul_per_token_group_quant_fp8_colmajorgrid RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

Code Example

M = round_up(tokens_per_rank * dpep_size * top_k + local_experts * 127, 128)
  = round_up(36000 * 16 * 8 + 16 * 127, 128)
  = 4,610,048
N = 4096  # 2 * GLM MoE intermediate size
max input element offset = M * N - 1 = 18,882,756,607
int32 max = 2,147,483,647

---

vllm 0.19.0
torch 2.10.0+cu128
triton 3.6.0
cuda 12.8
deep_gemm 2.5.0+891d57b
deep_ep 1.2.1+29d31c0

---

import os
import socket

import torch
import torch.distributed as dist
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
    silu_mul_per_token_group_quant_fp8_colmajor,
)

os.environ.setdefault("CUDA_LAUNCH_BLOCKING", "1")
rank = int(os.environ.get("RANK", os.environ.get("SLURM_PROCID", "0")))
world = int(os.environ.get("WORLD_SIZE", os.environ.get("SLURM_NTASKS", "1")))
local_rank = int(os.environ.get("LOCAL_RANK", os.environ.get("SLURM_LOCALID", "0")))
torch.cuda.set_device(local_rank)

if world > 1:
    dist.init_process_group("nccl", rank=rank, world_size=world)
    dist.barrier()

M = 4_610_048
N = 4096
print(f"rank={rank}/{world} host={socket.gethostname()} M={M} N={N} max_offset={M * N - 1}", flush=True)

x = torch.empty((M, N), device="cuda", dtype=torch.bfloat16)
torch.cuda.synchronize()
y, scales = silu_mul_per_token_group_quant_fp8_colmajor(x, use_ue8m0=False)
torch.cuda.synchronize()
print(f"rank={rank} ok {y.shape} {scales.shape}", flush=True)

---

srun -N2 --ntasks-per-node=8 --gres=gpu:8 --exclusive bash -lc '
  export RANK=$SLURM_PROCID
  export WORLD_SIZE=$SLURM_NTASKS
  export LOCAL_RANK=$SLURM_LOCALID
  export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
  export MASTER_PORT=29577
  export VLLM_USE_DEEP_GEMM=1
  export VLLM_MOE_USE_DEEP_GEMM=1
  python repro_vllm_deepgemm_silu_int32.py
'

---

rank=0/16 host=... M=4610048 N=4096 max_offset=18882756607
...
File ".../vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 785,
  in silu_mul_per_token_group_quant_fp8_colmajor
    _silu_mul_per_token_group_quant_fp8_colmajor[grid](...)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

---

M = 524_416
N = 4096

---

m_offset = (pid_m * BLOCK_M).to(tl.int64)
n_offset = (pid_n * BLOCK_N).to(tl.int64)
offs_n = tl.arange(0, BLOCK_N).to(tl.int64)
offs_m = tl.arange(0, BLOCK_M).to(tl.int64)

base_y_ptr = y_ptr + m_offset * N + n_offset
base_y_q_ptr = y_q_ptr + m_offset * N_2 + n_offset
base_y_s_ptr = y_s_ptr + group_id * y_s_col_stride + m_offset

---

rank 0/16 quant ok output_shape=(4610048, 2048) scales_shape=(4610048, 16)
...
rank 15/16 quant ok output_shape=(4610048, 2048) scales_shape=(4610048, 16)
RAW_BUFFERClick to expand / collapse

Problem

vllm.model_executor.layers.quantization.utils.fp8_utils._silu_mul_per_token_group_quant_fp8_colmajor does row-based pointer arithmetic in int32. With large DeepGEMM MoE warmup/workspace shapes this overflows and the Triton launch fails with CUDA illegal memory access.

The failure is easy to hit for GLM-style MoE inference with DPEP=16 and 36k max tokens per rank. The DeepGEMM activation-quant workspace for the SiLU/mul path is:

M = round_up(tokens_per_rank * dpep_size * top_k + local_experts * 127, 128)
  = round_up(36000 * 16 * 8 + 16 * 127, 128)
  = 4,610,048
N = 4096  # 2 * GLM MoE intermediate size
max input element offset = M * N - 1 = 18,882,756,607
int32 max = 2,147,483,647

The kernel is called from DeepGemmExperts._act_mul_quant for the Hopper/non-E8M0 SiLU path.

Environment

Validated on 2 nodes x 8 NVIDIA H200 GPUs:

vllm 0.19.0
torch 2.10.0+cu128
triton 3.6.0
cuda 12.8
deep_gemm 2.5.0+891d57b
deep_ep 1.2.1+29d31c0

Minimal reproducer

Save as repro_vllm_deepgemm_silu_int32.py:

import os
import socket

import torch
import torch.distributed as dist
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
    silu_mul_per_token_group_quant_fp8_colmajor,
)

os.environ.setdefault("CUDA_LAUNCH_BLOCKING", "1")
rank = int(os.environ.get("RANK", os.environ.get("SLURM_PROCID", "0")))
world = int(os.environ.get("WORLD_SIZE", os.environ.get("SLURM_NTASKS", "1")))
local_rank = int(os.environ.get("LOCAL_RANK", os.environ.get("SLURM_LOCALID", "0")))
torch.cuda.set_device(local_rank)

if world > 1:
    dist.init_process_group("nccl", rank=rank, world_size=world)
    dist.barrier()

M = 4_610_048
N = 4096
print(f"rank={rank}/{world} host={socket.gethostname()} M={M} N={N} max_offset={M * N - 1}", flush=True)

x = torch.empty((M, N), device="cuda", dtype=torch.bfloat16)
torch.cuda.synchronize()
y, scales = silu_mul_per_token_group_quant_fp8_colmajor(x, use_ue8m0=False)
torch.cuda.synchronize()
print(f"rank={rank} ok {y.shape} {scales.shape}", flush=True)

Launch as 16 ranks over two 8-GPU nodes:

srun -N2 --ntasks-per-node=8 --gres=gpu:8 --exclusive bash -lc '
  export RANK=$SLURM_PROCID
  export WORLD_SIZE=$SLURM_NTASKS
  export LOCAL_RANK=$SLURM_LOCALID
  export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
  export MASTER_PORT=29577
  export VLLM_USE_DEEP_GEMM=1
  export VLLM_MOE_USE_DEEP_GEMM=1
  python repro_vllm_deepgemm_silu_int32.py
'

Observed failure:

rank=0/16 host=... M=4610048 N=4096 max_offset=18882756607
...
File ".../vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 785,
  in silu_mul_per_token_group_quant_fp8_colmajor
    _silu_mul_per_token_group_quant_fp8_colmajor[grid](...)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
NCCL WARN Cuda failure 700 'an illegal memory access was encountered'

The same kernel also fails on one GPU with the first aligned overflowing shape:

M = 524_416
N = 4096

Proposed fix

Upcast row/column offsets used for pointer arithmetic to tl.int64 inside the Triton kernel before computing load/store addresses. In particular:

m_offset = (pid_m * BLOCK_M).to(tl.int64)
n_offset = (pid_n * BLOCK_N).to(tl.int64)
offs_n = tl.arange(0, BLOCK_N).to(tl.int64)
offs_m = tl.arange(0, BLOCK_M).to(tl.int64)

base_y_ptr = y_ptr + m_offset * N + n_offset
base_y_q_ptr = y_q_ptr + m_offset * N_2 + n_offset
base_y_s_ptr = y_s_ptr + group_id * y_s_col_stride + m_offset

I validated this exact int64-addressing change with the same 2-node / 16-rank launch and M=4,610,048, N=4096; all ranks completed:

rank 0/16 quant ok output_shape=(4610048, 2048) scales_shape=(4610048, 16)
...
rank 15/16 quant ok output_shape=(4610048, 2048) scales_shape=(4610048, 16)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix DeepGEMM SiLU/mul FP8 quant Triton kernel overflows int32 addresses for large DPEP warmup shapes