本来Ubuntu Server已经安装完NVIDIA显卡驱动,执行nvidia-smi显示状态正常,再去安装CUDA驱动后,然后执行nvidia-smi看看状态,结果出现这个提示。

Terminal window
root@localhost:~# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

我还以为是系统没识别到显卡出现这错误,就查看PCI信息

Terminal window
root@localhost:~# lspci | grep -i nvidia
0b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

显卡设备还在,那就是驱动存在问题,那么使用dkms编译安装nvidia驱动。

dkms全称动态内核模块支持(Dynamic Kernel Module Support),是用来生成Linux的内核模块的一个框架,其源代码一般不在Linux内核源代码树。当新的内核安装时,DKMS支持的内核设备驱动程序 到时会自动重建。DKMS可以用在两个方向:如果一个新的内核版本安装,自动编译所有的模块,或安装新的模块(驱动程序)在现有的系统版本上,而不需要任何的手动编译或预编译软件包需要。

—— 摘自维基百科

安装dkms

Terminal window
root@localhost:~# apt-get install dkms

查看NVIDIA驱动版本

Terminal window
root@localhost:~# ls /usr/src | grep nvidia
nvidia-550.25.65

执行dkms编译安装NVIDIA驱动模块

Terminal window
root@localhost:~# dkms install -m nvidia -v 550.25.65
/bin/bash: /usr/local/anaconda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Creating symlink /var/lib/dkms/nvidia/550.25.65/source -> /usr/src/nvidia-550.25.65
Kernel preparation unnecessary for this kernel. Skipping...
Building module:
cleaning build area...
'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.15.0-131-generic modules.....................
cleaning build area...
nvidia.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-uvm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-modeset.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-drm.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-peermem.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
depmod....
root@localhost:~#

查看NVIDIA驱动信息

Terminal window
root@localhost:~# nvidia-smi
Thu Feb 20 15:11:42 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.25.65 Driver Version: 550.25.65 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:0B:00.0 Off | 0 |
| N/A 56C P0 26W / 70W | 1MiB / 15360MiB | 9% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

显示正常,完美!