本来Ubuntu Server已经安装完NVIDIA显卡驱动,执行nvidia-smi显示状态正常,再去安装CUDA驱动后,然后执行nvidia-smi看看状态,结果出现这个提示。
root@localhost:~# nvidia-smiNVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.我还以为是系统没识别到显卡出现这错误,就查看PCI信息
root@localhost:~# lspci | grep -i nvidia0b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)显卡设备还在,那就是驱动存在问题,那么使用dkms编译安装nvidia驱动。
dkms全称动态内核模块支持(Dynamic Kernel Module Support),是用来生成Linux的内核模块的一个框架,其源代码一般不在Linux内核源代码树。当新的内核安装时,DKMS支持的内核设备驱动程序 到时会自动重建。DKMS可以用在两个方向:如果一个新的内核版本安装,自动编译所有的模块,或安装新的模块(驱动程序)在现有的系统版本上,而不需要任何的手动编译或预编译软件包需要。
—— 摘自维基百科
安装dkms
root@localhost:~# apt-get install dkms查看NVIDIA驱动版本
root@localhost:~# ls /usr/src | grep nvidianvidia-550.25.65执行dkms编译安装NVIDIA驱动模块
root@localhost:~# dkms install -m nvidia -v 550.25.65/bin/bash: /usr/local/anaconda/lib/libtinfo.so.6: no version information available (required by /bin/bash)Creating symlink /var/lib/dkms/nvidia/550.25.65/source -> /usr/src/nvidia-550.25.65
Kernel preparation unnecessary for this kernel. Skipping...
Building module:cleaning build area...'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.15.0-131-generic modules.....................cleaning build area...
nvidia.ko:Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-uvm.ko:Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-modeset.ko:Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-drm.ko:Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
nvidia-peermem.ko:Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.15.0-131-generic/updates/dkms/
depmod....root@localhost:~#查看NVIDIA驱动信息
root@localhost:~# nvidia-smiThu Feb 20 15:11:42 2025+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.25.65 Driver Version: 550.25.65 CUDA Version: 12.8 ||-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 Tesla T4 Off | 00000000:0B:00.0 Off | 0 || N/A 56C P0 26W / 70W | 1MiB / 15360MiB | 9% Default || | | N/A |+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| No running processes found |+-----------------------------------------------------------------------------------------+显示正常,完美!
nvidia-smi提示无法与nvidia驱动程序通信
https://huoshen.pages.dev/cn/p/dc535881/