前言

部署一套GPU计算服务器,用于私有AI模型训练,此次部署以Ubuntu系统为例

名称版本架构
Ubuntu22.04x86_64
NVIDIA驱动570.124.06x86_64
CUDA11520.61.05x86_64
CUDNN9.8.0.87x86_64

⚠️ 注意 在配置服务之前,请查找各个版本之间兼容性问题,否则部署训练环境出现各种错误!

NVIDIA显卡驱动下载 CUDA驱动各版本下载列表 CUDNN库各版本下载列表

准备Ubuntu安装NVIDIA显卡环境

2.1 安装系统基础依赖环境

Terminal window
koevn@localhost:~$ sudo apt install -y build-essential dracut-core linux-headers-$(uname -r)

2.2 检查Linux是否识别到NVIDIA显卡

Terminal window
koevn@localhost:~$ sudo lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

2.3 查看linux nouveau是否禁用

Terminal window
koevn@localhost:~$ sudo lsmod | grep nouveau
nouveau 2306048 0
mxm_wmi 16384 1 nouveau
i2c_algo_bit 16384 1 nouveau
drm_ttm_helper 16384 1 nouveau
ttm 86016 3 vmwgfx,drm_ttm_helper,nouveau
drm_kms_helper 311296 2 vmwgfx,nouveau
video 65536 1 nouveau
wmi 32768 2 mxm_wmi,nouveau
drm 622592 7 vmwgfx,drm_kms_helper,drm_ttm_helper,ttm,nouveau

如果有显示以上信息,则说明系统nouveau正在加载,执行以下操作禁用nouveau

Terminal window
koevn@localhost:~$ sudo cat > /etc/modprobe.d/blacklist-nouveau.conf << EOF
blacklist nouveau
options nouveau modset=0
EOF
koevn@localhost:~$ sudo dracut --force
koevn@localhost:~$ sudo reboot

这里为什么要禁用系统nouveau,是因为我们要安装NVIDIA官方提供的驱动,属于闭源的,而nouveau是开源的,如果不禁用,Linux系统默认加载nouveau,这就导致两个驱动发生冲突,会产生奇怪的问题。

系统重启完成后,执行sudo lsmod | grep nouveau命令检查是否有输出,没有则完成。

安装NVIDIA驱动

将下载好的NVIDIA驱动包上传到Linux上,然后执行安装

Terminal window
koevn@localhost:~$ cd /tmp
koevn@localhost:/tmp$ sudo chmod +x NVIDIA-Linux-x86_64-570.124.06.run
koevn@localhost:/tmp$ sudo ./NVIDIA-Linux-x86_64-570.124.06.run -no-opengl-files -no-nouveau-check
  • -no-opengl-files: 不使用NVIDIA 提供的 OpenGL 动态库,因为使用的系统非图形桌面;
  • -no-nouveau-check: 跳过nouveau检查验证NVIDIA驱动是否安装成功
Terminal window
koevn@localhost:~$ sudo nvidia-smi
Tue Apr 8 16:12:06 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:03:00.0 Off | 0 |
| N/A 50C P0 25W / 70W | 1MiB / 15360MiB | 9% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

安装CUDA

根据CUDA驱动各版本下载列表,选择系统版本与架构,选择下载安装类型为runfile(local)安装包,下载后上传到Linux并安装。

Terminal window
koevn@localhost:/tmp$ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
koevn@localhost:/tmp$ sudo chmod +x cuda_11.8.0_520.61.05_linux.run
koevn@localhost:/tmp$ sudo ./cuda_11.8.0_520.61.05_linux.run --no-opengl-libs --toolkit

安装过程

⚠️ 注意 由于之前已经安装了NVIDIA显卡驱动,这一步要按空格键,取消选择安装显卡驱动,然后选择install

安装完成,根据提示配置系统环境变量

Terminal window
koevn@localhost:~$ sudo cat > /etc/profile.d/cuda.sh << EOF
export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH
EOF

验证CUDA是否安装成功

Terminal window
koevn@localhost:~$ sudo nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

添加CUDNN

下载好对应的cudnn版本并上传到Linux上,执行以下操作

Terminal window
koevn@localhost:/tmp$ tar -xvf cudnn-linux-x86_64-9.8.0.87_cuda11-archive.tar.xz
koevn@localhost:/tmp$ mv cudnn-linux-x86_64-9.8.0.87_cuda11-archive cudnn
koevn@localhost:/tmp$ cd cudnn
koevn@localhost:/tmp/cudnn$ sudo cp lib/* /usr/local/cuda-11.8/lib64/
koevn@localhost:/tmp/cudnn$ sudo cp include/* /usr/local/cuda-11.8/include/
koevn@localhost:/tmp/cudnn$ sudo chmod a+r /usr/local/cuda-11.8/lib64/*
koevn@localhost:/tmp/cudnn$ sudo chmod a+r /usr/local/cuda-11.8/include/*

验证CUDNN版本

Terminal window
koevn@localhost:/tmp/cudnn$ sudo cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 9
#define CUDNN_MINOR 8
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
/* cannot use constexpr here since this is a C-only file */

至此结束!