0%

Ubuntu重启后找不到NVIDIA-GPU驱动

最近一台机器(环境为:Ubuntu+NVIDIA-384.130)重启后发生了找不到GPU驱动的问题:

1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

这个问题的原因一般是Ubuntu的内核版本更新了,而显卡驱动是在低版本的内核时安装的,因此发生了不兼容的问题。以往的解决方法是修改Ubuntu默认开机启动的内核版本:需要找到之前使用的内核版本(查看系统已安装内核版本时发现有好几个,也忘记之前安装驱动时内核版本是哪个),并修改grub开机配置,之后便是删除无用内核并禁止内核更新(记得之前已做过这个步骤,但这次内核还是更新了?)

鉴于上述方法过于复杂,这次采用新的方法:基于新的内核重新生成GPU的驱动模块。

  1. 安装DKMS
    DKMS全称是Dynamic Kernel Module Support,它可以帮我们维护内核外的驱动程序,在内核版本变动之后可以自动重新生成新的模块。

    1
    sudo apt-get install dkms
  2. 查看安装的NVIDIA-GPU驱动版本

    1
    ls /usr/src
  3. 重新生成驱动模块

    1
    sudo dkms install -m nvidia -v 384.130
  4. 检验

    1
    nvidia-smi
  5. 重新设置内核禁止更新

    1
    2
    uname -a  # 查看正在使用的内核,e.g. linux-image-4.15.0-88-generic
    sudo apt-mark hold linux-image-4.15.0-88-generic

若成功,可以看到显卡信息。


插曲

在安装dkms时出现了两个小问题:

1、当前源中找不到相应的安装包

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1)使用 sudo vim /etc/apt/sources.list 修改镜像源
2)然后执行 sudo apt-get update 更新

## 阿里源
deb http://mirrors.aliyun.com/ubuntu/ trusty main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ trusty-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ trusty-backports main restricted universe multiverse

2、该死的samba服务报错信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dpkg: error processing package samba (--configure):
dependency problems - leaving unconfigured
Errors were encountered while processing:
samba-common
samba-common-bin
samba
E: Sub-process /usr/bin/dpkg returned an error code (1)

## 解决方案:
$ sudo mv /var/lib/dpkg/info /var/lib/dpkg/info_old //现将info文件夹更名
$ sudo mkdir /var/lib/dpkg/info //再新建一个新的info文件夹
$ sudo apt-get update
$ sudo apt-get -f install
$ sudo mv /var/lib/dpkg/info/* /var/lib/dpkg/info_old
//执行完上一步操作后会在新的info文件夹下生成一些文件,现将这些文件全部移到info_old文件夹下
$ sudo rm -rf /var/lib/dpkg/info //把自己新建的info文件夹删掉
$ sudo mv /var/lib/dpkg/info_old /var/lib/dpkg/info //把以前的info文件夹重新改回名字

Ref:

CASE SOLVED:NVIDIA-SMI has failed because it couldnt communicate with the NVIDIA driverr_运维_Felaim的博客-CSDN博客
https://blog.csdn.net/Felaim/article/details/100516282

NVIDIA-SMI has failed because it couldnt communicate with the NVIDIA driver问题排查_运维_u014447845的博客-CSDN博客
https://blog.csdn.net/u014447845/article/details/103012088

ubuntu 禁止内核更新 - 天道酬勤、 - 博客园
https://www.cnblogs.com/zxj9487/p/11386227.html

ubuntu映射网络驱动器失败,以及samba服务 - 简书
https://www.jianshu.com/p/89b7831181ab