ascend pytorch 踩坑.
在910b上安装pytorch 和 pytorch_npu, 因为后续准备装vllm, 所以torch_npu是特殊的版本.
代码语言:shell复制pip install torch==2.5.1 --extra-index /
pip install numpy==1.26.4
mkdir pta
cd pta
wget .5.1/20250320.3/pytorch_v2.5.1_py310.tar.gz
tar -xvf pytorch_v2.5.1_py310.tar.gz
pip install ./torch_npu-2.5.1.dev20250320-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
安装完毕后, 执行下example_npu.py, 内容如下:
代码语言:python代码运行次数:0运行复制import torch
import torch_npu
x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)
print(z)
但是执行python example_npu.py
报错:
/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
warnings.warn(f"Warning: The {path} owner does not match the current user.")
/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2/x86_64-linux/ascend_toolkit_install.info owner does not match the current user.
warnings.warn(f"Warning: The {path} owner does not match the current user.")
[W NPUCachingAllocator.cpp:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
Traceback (most recent call last):
File "./pta/example_npu.py", line 4, in <module>
x = torch.randn(2, 2).npu()
File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 153, in wrap_tensor_to
device_idx = _normalization_device(custom_backend_name, device)
File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 109, in _normalization_device
return _get_current_device_index()
File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 103, in _get_current_device_index
return getattr(getattr(torch, custom_backend_name), _get_device_index)()
File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 59, in current_device
torch_npu.npu._lazy_init()
File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/npu/__init__.py", line 214, in _lazy_init
torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:217 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[ERROR] 2025-04-23-11:06:03 (PID:4586, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The internal ACL of the system is incorrect.
Rectify the fault based on the error information in the ascend log.
EC0010: 2025-04-23-11:06:03.331.980 Failed to import Python module [ModuleNotFoundError: No module named 'scipy'.].
Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
TraceBack (most recent call last):
AOE Failed to call InitCannKB
[GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter][LINE:1719]
[SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager][LINE:79]
[SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager][LINE:120]
[FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager][LINE:117]
PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager][LINE:82]
OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib][LINE:234]
GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib][LINE:162]
GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api][LINE:334]
[Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
第一眼看上去眼花缭乱, 完全不知道哪里有问题. 仔细分析, 可以看到"Failed to import Python module ModuleNotFoundError: No module named 'scipy'."
pip install scipy
, 在次执行, 又报了一个类似的错误, 缺另一个依赖, 循环几次. 在次执行, 即可正常.
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 1.0745, 1.2646],
[-1.1924, -2.2859]], device='npu:0')