前言

在自己的开发机器上,使用docker管理不同的pytorchcuda版本是一件非常干净清爽的事情,可以随时切换不同的软件环境进行测试开发。下面就记录一下配置过程。

当前ubuntu 18.04的机器上安装了NVIDIA的显卡驱动。

安装Docker及NVIDIA Container Toolkit

安装Docker

可以通过apt直接安装:

sudo apt update
sudo apt install docker.io

也可以通过如下命令行安装:

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

通过如下命令可以验证docker是否安装成功:

sudo docker --version
# 在我的机器上返回
# Docker version 20.10.2, build 20.10.2-0ubuntu1~18.04.3

root用户使用Docker

安装好的docker只能通过root权限操作,会有诸多不方便的地方。可以通过配置,让普通用户也能正常使用docker

# 创建docker用户组
sudo groupadd docker
# 将当前用户添加到docker用户组中
sudo usermod -aG docker $USER
# 登入docker用户组,让当前配置生效
newgrp docker

经过配置后,可以在非root用户下使用Docker

国内docker镜像源配置

编辑文件/etc/docker/daemon.json,在配置文件中新增registry-mirrors配置,其他配置保持不变:

    "registry-mirrors": [
            "http://hub-mirror.c.163.com",
            "https://docker.mirrors.ustc.edu.cn"
    ]

比如在我的PC上,变成这样子:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "registry-mirrors": [
            "http://hub-mirror.c.163.com",
            "https://docker.mirrors.ustc.edu.cn"
    ]
}
~

重启docker

sudo service docker restart

安装配置NVIDIA Container Toolkit

# 增加stable的apt源,配置GPG秘钥
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 更新apt
sudo apt update
# 安装NVIDIA Container Toolkit
sudo apt install -y nvidia-docker2

拉取pytorchdocker

可以在https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated上找到不同cudacudnn版本的docker镜像。我这里拉取的是pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime

docker pull pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime

返回如下

$ sudo docker pull pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
1.7.1-cuda11.0-cudnn8-runtime: Pulling from pytorch/pytorch
f22ccc0b8772: Pull complete 
3cf8fb62ba5f: Pull complete 
e80c964ece6a: Pull complete 
20dbc2116049: Pull complete 
7178fad0656f: Pull complete 
8573fc1d93aa: Pull complete 
Digest: sha256:db6086be92f439b918c96dc002f4cf40239e247f0b1b6c32e3fb36de70032bf9
Status: Downloaded newer image for pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
docker.io/pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime

即为拉取成功。执行docker run --gpus=all pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime nvidia-smi,在pytorchdocker中执行nvidia-smi命令,输入类似如下信息即为成功:

Sun Aug  1 14:03:59 2021   
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3070    Off  | 00000000:2B:00.0  On |                  N/A |
|  0%   41C    P8    14W / 240W |    423MiB /  7979MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                   
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

其他

zsh支持docker补全

编辑~/.zshrc文件,在其中plugins=(...)这一行中配置过的插件后面增加docker docker-compose,变成plugins=(... docker docker-compose)(其中...是配置文件中已经配置的插件)

报错:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

如果出现这样子的报错

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

就是没有安装或是正确配置NVIDIA Container Toolkit。可以按照上面的说明进行配https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated

报错:nvcc fatal : Unsupported gpu architecture 'compute_86'

我的机器是RTX3070的显卡,在执行一些编译时碰到了这个错误nvcc fatal : Unsupported gpu architecture 'compute_86'

使用了export TORCH_CUDA_ARCH_LIST="7.5"这个命令解决,原因还没搞清楚(https://github.com/NVIDIA/apex/issues/1023

更换conda源为国内清华源

~/.condarc文件的内容修改为

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

修改完成后执行

conda clean -i

清理conda缓存

参考

  1. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
  2. https://docs.docker.com/engine/install/linux-postinstall/
  3. https://docs.docker.com/compose/completion/
  4. https://ngc.nvidia.com/catalog/containers/nvidia:pytorch
  5. https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated
  6. https://github.com/NVIDIA/apex/issues/1023
Last modification:August 27, 2021
If you think my article is useful to you, please feel free to appreciate