pytorch - 💡(How to fix) Fix [BUG] When using multiprocessing backend, import torch_npu._inductor fails on torch/torch_npu 2.9.0 [1 participants]

pytorch2026-03-26 20:22:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178535•Fetched 2026-04-08 01:35:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

KlyzhenkoVadim

Participants

KlyzhenkoVadim

Timeline (top)

mentioned ×30subscribed ×30unsubscribed ×10labeled ×7

Error Message

terminate called after throwing an instance of 'c10::Error' terminate called after throwing an instance of 'c10::Error' what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error' what(): pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool! Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so) frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6) frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

Fix Action

Fix / Workaround

CPU: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: HiSilicon Model name: Kunpeng-920 Model: 0 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 4 Stepping: 0x1 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs L1d cache: 16 MiB (256 instances) L1i cache: 16 MiB (256 instances) L2 cache: 128 MiB (256 instances) L3 cache: 256 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 NUMA node2 CPU(s): 64-95 NUMA node3 CPU(s): 96-127 NUMA node4 CPU(s): 128-159 NUMA node5 CPU(s): 160-191 NUMA node6 CPU(s): 192-223 NUMA node7 CPU(s): 224-255 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Code Example

from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=4)
llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=2,
    enforce_eager=True,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

---

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Environment

<details > Collecting environment information... PyTorch version: 2.9.0+cpu Is debug build: False

OS: Ubuntu 22.04 LTS (aarch64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 4.2.3 Libc version: glibc-2.35

Python version: 3.11.15 (main, Mar 11 2026, 17:11:19) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.0-25-generic-aarch64-with-glibc2.35

Versions of relevant libraries: [pip3] gpytorch==1.15.2 [pip3] numpy==1.26.4 [pip3] pyzmq==27.1.0 [pip3] torch==2.9.0 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torch_npu==2.9.0 [pip3] torchaudio==2.9.0 [pip3] torchdata==0.11.0 [pip3] torchvision==0.24.0 [pip3] transformers==4.57.6 [pip3] triton-ascend==3.2.0 [conda] gpytorch 1.15.2 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] pyzmq 27.1.0 pypi_0 pypi [conda] torch 2.9.0 pypi_0 pypi [conda] torch-c-dlpack-ext 0.1.5 pypi_0 pypi [conda] torch-npu 2.9.0 pypi_0 pypi [conda] torchaudio 2.9.0 pypi_0 pypi [conda] torchdata 0.11.0 pypi_0 pypi [conda] torchvision 0.24.0 pypi_0 pypi [conda] transformers 4.57.6 pypi_0 pypi [conda] triton-ascend 3.2.0 pypi_0 pypi vLLM Version: 0.18.1rc1.dev40+g35141a7ee (git sha: 35141a7ee) vLLM Ascend Version: 0.17.0rc2.dev144+g55c680073 (git sha: 55c680073)

ENV Variables: ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1 ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0 ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0 ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5 ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1 ASCEND_TOOLKIT_HOME=/usr/local/Ascend/cann-8.5.0 ATB_COMPARE_TILING_EVERY_KERNEL=0 ASCEND_OPP_PATH=/usr/local/Ascend/cann-8.5.0/opp LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver ASCEND_AICPU_PATH=/usr/local/Ascend/cann-8.5.0 VLLM_LOGGING_LEVEL=DEBUG ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0 ASCEND_HOME_PATH=/usr/local/Ascend/cann-8.5.0 ATB_MATMUL_SHUFFLE_K_ENABLE=1 ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1 ATB_SHARE_MEMORY_NAME_SUFFIX= TORCH_DEVICE_BACKEND_AUTOLOAD=1 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1

NPU: +------------------------------------------------------------------------------------------------+ | npu-smi 25.2.3.1 Version: 25.2.3.1 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 910B3 | OK | 105.3 45 0 / 0 | | 0 | 0000:C1:00.0 | 0 0 / 0 3430 / 65536 | +===========================+===============+====================================================+ | 1 910B3 | OK | 93.7 44 0 / 0 | | 0 | 0000:C2:00.0 | 0 0 / 0 3409 / 65536 | +===========================+===============+====================================================+ | 2 910B3 | OK | 95.8 43 0 / 0 | | 0 | 0000:81:00.0 | 0 0 / 0 3405 / 65536 | +===========================+===============+====================================================+ | 3 910B3 | OK | 96.8 44 0 / 0 | | 0 | 0000:82:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 4 910B3 | OK | 93.4 49 0 / 0 | | 0 | 0000:01:00.0 | 0 0 / 0 3406 / 65536 | +===========================+===============+====================================================+ | 5 910B3 | OK | 94.3 49 0 / 0 | | 0 | 0000:02:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 6 910B3 | OK | 98.7 50 0 / 0 | | 0 | 0000:41:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ | 7 910B3 | OK | 95.1 46 0 / 0 | | 0 | 0000:42:00.0 | 0 0 / 0 3404 / 65536 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ | No running processes found in NPU 1 | +===========================+===============+====================================================+ | No running processes found in NPU 2 | +===========================+===============+====================================================+ | No running processes found in NPU 3 | +===========================+===============+====================================================+ | No running processes found in NPU 4 | +===========================+===============+====================================================+ | No running processes found in NPU 5 | +===========================+===============+====================================================+ | No running processes found in NPU 6 | +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+

CANN: package_name=Ascend-cann-toolkit version=8.5.0 innerversion=V100R001C25SPC001B232 compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23] arch=aarch64 os=linux path=/usr/local/Ascend/cann-8.5.0

</details>

Description:

vLLM Version: 0.18.1rc1.dev40+g35141a7ee (git sha: 35141a7ee) vLLM Ascend Version: 0.17.0rc2.dev144+g55c680073 (git sha: 55c680073)

I tried run simple example with tensor_parallel_size > 1:

from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, n=4)
llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=2,
    enforce_eager=True,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

When tp size = 1 - everything works well.

But when it's > 1 Inside NPUWorker._init_device():

Inside torch_npu._triton it fails on function `register_fa_pass()` . I have this error from torch:

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  pool INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/ParallelOpenMP.cpp":64, please report a bug to PyTorch. Invalid thread pool!
Exception raised from set_num_threads at /pytorch/aten/src/ATen/ParallelOpenMP.cpp:64 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffb0e9c700 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffb0e3a860 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x48 (0xffffb0e99098 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x13c9548 (0xffffb1ee9548 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x234 (0xffffb600b0f4 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xa8c184 (0xffffbb8fc184 in /home/k00914150/miniconda3/envs/verl/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xda294 (0xffffb0cda294 in /home/k00914150/miniconda3/envs/verl/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7d5b8 (0xffffbd43d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0xe5edc (0xffffbd4a5edc in /lib/aarch64-linux-gnu/libc.so.6)

Versions

cc @VitalyFedyunin @albanD @pragupta @ppwwyyxx @ezyang @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93

extent analysis

Fix Plan

The issue seems to be related to the tensor_parallel_size parameter in the LLM model. When this parameter is set to a value greater than 1, the model fails with an error related to the thread pool.

To fix this issue, you can try the following steps:

Set the tensor_parallel_size parameter to 1:

llm = LLM(
    model="/data/Qwen3-8B",
    tensor_parallel_size=1,
    enforce_eager=True,
)

Alternatively, you can try setting the OMP_NUM_THREADS environment variable to a value that is a power of 2 (e.g., 2, 4, 8, etc.):

import os
os.environ['OMP_NUM_THREADS'] = '2'

If the above steps do not work, you can try updating the PyTorch and torch-npu libraries to the latest versions.

Verification

To verify that the fix worked, you can run the same code with the modified tensor_parallel_size parameter or the updated environment variable. If the model runs without errors, it should generate the expected output.

Extra Tips

Make sure to check the documentation for the LLM model and the tensor_parallel_size parameter to understand the expected behavior and any limitations.
If you are using a multi-threaded environment, make sure to set the OMP_NUM_THREADS environment variable to a value that is compatible with your system's configuration.
If you are still experiencing issues, you can try debugging the code using tools like gdb or pdb to get more information about the error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [BUG] When using multiprocessing backend, import torch_npu._inductor fails on torch/torch_npu 2.9.0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Environment

Description:

But when it's > 1 Inside NPUWorker._init_device():

Inside torch_npu._triton it fails on function `register_fa_pass()` . I have this error from torch:

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [BUG] When using multiprocessing backend, import torch_npu._inductor fails on torch/torch_npu 2.9.0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Environment

Description:

But when it's > 1 Inside NPUWorker._init_device():

Inside torch_npu._triton it fails on function register_fa_pass() . I have this error from torch:

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Inside torch_npu._triton it fails on function `register_fa_pass()` . I have this error from torch: