pytorch - 💡(How to fix) Fix [BUG] When using multiprocessing backend, import torch_npu._inductor fails on torch/torch_npu 2.9.0 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178535Fetched 2026-04-08 01:35:43
View on GitHub
Comments
0
Participants
1
Timeline
79
Reactions
0
Participants
Timeline (top)
mentioned ×30subscribed ×30unsubscribed ×10labeled ×7

Error Message

terminate called after throwing an instance of 'c10::Error' terminate called after throwing an instance of 'c10::Error' what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

Fix Action

Fix / Workaround

CPU: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: HiSilicon Model name: Kunpeng-920 Model: 0 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 4 Stepping: 0x1 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs L1d cache: 16 MiB (256 instances) L1i cache: 16 MiB (256 instances) L2 cache: 128 MiB (256 instances) L3 cache: 256 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 NUMA node2 CPU(s): 64-95 NUMA node3 CPU(s): 96-127 NUMA node4 CPU(s): 128-159 NUMA node5 CPU(s): 160-191 NUMA node6 CPU(s): 192-223 NUMA node7 CPU(s): 224-255 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Code Example

from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=4)
llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=2,
    enforce_eager=True,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

---

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Environment

<details > Collecting environment information... PyTorch version: 2.9.0+cpu Is debug build: False

OS: Ubuntu 22.04 LTS (aarch64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 4.2.3 Libc version: glibc-2.35

Python version: 3.11.15 (main, Mar 11 2026, 17:11:19) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.0-25-generic-aarch64-with-glibc2.35

CPU: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: HiSilicon Model name: Kunpeng-920 Model: 0 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 4 Stepping: 0x1 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs L1d cache: 16 MiB (256 instances) L1i cache: 16 MiB (256 instances) L2 cache: 128 MiB (256 instances) L3 cache: 256 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 NUMA node2 CPU(s): 64-95 NUMA node3 CPU(s): 96-127 NUMA node4 CPU(s): 128-159 NUMA node5 CPU(s): 160-191 NUMA node6 CPU(s): 192-223 NUMA node7 CPU(s): 224-255 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] gpytorch==1.15.2 [pip3] numpy==1.26.4 [pip3] pyzmq==27.1.0 [pip3] torch==2.9.0 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torch_npu==2.9.0 [pip3] torchaudio==2.9.0 [pip3] torchdata==0.11.0 [pip3] torchvision==0.24.0 [pip3] transformers==4.57.6 [pip3] triton-ascend==3.2.0 [conda] gpytorch 1.15.2 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] pyzmq 27.1.0 pypi_0 pypi [conda] torch 2.9.0 pypi_0 pypi [conda] torch-c-dlpack-ext 0.1.5 pypi_0 pypi [conda] torch-npu 2.9.0 pypi_0 pypi [conda] torchaudio 2.9.0 pypi_0 pypi [conda] torchdata 0.11.0 pypi_0 pypi [conda] torchvision 0.24.0 pypi_0 pypi [conda] transformers 4.57.6 pypi_0 pypi [conda] triton-ascend 3.2.0 pypi_0 pypi vLLM Version: 0.18.1rc1.dev40+g35141a7ee (git sha: 35141a7ee) vLLM Ascend Version: 0.17.0rc2.dev144+g55c680073 (git sha: 55c680073)

ENV Variables: ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1 ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0 ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0 ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5 ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1 ASCEND_TOOLKIT_HOME=/usr/local/Ascend/cann-8.5.0 ATB_COMPARE_TILING_EVERY_KERNEL=0 ASCEND_OPP_PATH=/usr/local/Ascend/cann-8.5.0/opp LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver ASCEND_AICPU_PATH=/usr/local/Ascend/cann-8.5.0 VLLM_LOGGING_LEVEL=DEBUG ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0 ASCEND_HOME_PATH=/usr/local/Ascend/cann-8.5.0 ATB_MATMUL_SHUFFLE_K_ENABLE=1 ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1 ATB_SHARE_MEMORY_NAME_SUFFIX= TORCH_DEVICE_BACKEND_AUTOLOAD=1 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1

NPU: +------------------------------------------------------------------------------------------------+ | npu-smi 25.2.3.1 Version: 25.2.3.1 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 910B3 | OK | 105.3 45 0 / 0 | | 0 | 0000:C1:00.0 | 0 0 / 0 3430 / 65536 | +===========================+===============+====================================================+ | 1 910B3 | OK | 93.7 44 0 / 0 | | 0 | 0000:C2:00.0 | 0 0 / 0 3409 / 65536 | +===========================+===============+====================================================+ | 2 910B3 | OK | 95.8 43 0 / 0 | | 0 | 0000:81:00.0 | 0 0 / 0 3405 / 65536 | +===========================+===============+====================================================+ | 3 910B3 | OK | 96.8 44 0 / 0 | | 0 | 0000:82:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 4 910B3 | OK | 93.4 49 0 / 0 | | 0 | 0000:01:00.0 | 0 0 / 0 3406 / 65536 | +===========================+===============+====================================================+ | 5 910B3 | OK | 94.3 49 0 / 0 | | 0 | 0000:02:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 6 910B3 | OK | 98.7 50 0 / 0 | | 0 | 0000:41:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 7 910B3 | OK | 95.1 46 0 / 0 | | 0 | 0000:42:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ | No running processes found in NPU 1 | +===========================+===============+====================================================+ | No running processes found in NPU 2 | +===========================+===============+====================================================+ | No running processes found in NPU 3 | +===========================+===============+====================================================+ | No running processes found in NPU 4 | +===========================+===============+====================================================+ | No running processes found in NPU 5 | +===========================+===============+====================================================+ | No running processes found in NPU 6 | +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+

CANN: package_name=Ascend-cann-toolkit version=8.5.0 innerversion=V100R001C25SPC001B232 compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23] arch=aarch64 os=linux path=/usr/local/Ascend/cann-8.5.0

</details>

Description:

vLLM Version: 0.18.1rc1.dev40+g35141a7ee (git sha: 35141a7ee) vLLM Ascend Version: 0.17.0rc2.dev144+g55c680073 (git sha: 55c680073)

I tried run simple example with tensor_parallel_size > 1:

from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=4)
llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=2,
    enforce_eager=True,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

When tp size = 1 - everything works well.

But when it's > 1 Inside NPUWorker._init_device():

<img width="539" height="305" alt="Image" src="https://github.com/user-attachments/assets/58a9a39a-5209-43fc-b2b6-c0581148538f" />

Inside torch_npu._triton it fails on function register_fa_pass() . I have this error from torch:

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

Versions

.

cc @VitalyFedyunin @albanD @pragupta @ppwwyyxx @ezyang @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93

extent analysis

Fix Plan

The issue seems to be related to the tensor_parallel_size parameter in the LLM model. When this parameter is set to a value greater than 1, the model fails with an error related to the thread pool.

To fix this issue, you can try the following steps:

  • Set the tensor_parallel_size parameter to 1:
llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=1,
    enforce_eager=True,
)
  • Alternatively, you can try setting the OMP_NUM_THREADS environment variable to a value that is a power of 2 (e.g., 2, 4, 8, etc.):
import os
os.environ['OMP_NUM_THREADS'] = '2'
  • If the above steps do not work, you can try updating the PyTorch and torch-npu libraries to the latest versions.

Verification

To verify that the fix worked, you can run the same code with the modified tensor_parallel_size parameter or the updated environment variable. If the model runs without errors, it should generate the expected output.

Extra Tips

  • Make sure to check the documentation for the LLM model and the tensor_parallel_size parameter to understand the expected behavior and any limitations.
  • If you are using a multi-threaded environment, make sure to set the OMP_NUM_THREADS environment variable to a value that is compatible with your system's configuration.
  • If you are still experiencing issues, you can try debugging the code using tools like gdb or pdb to get more information about the error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING