Loading... # 前言 在自己的开发机器上,使用`docker`管理不同的`pytorch`和`cuda`版本是一件非常干净清爽的事情,可以随时切换不同的软件环境进行测试开发。下面就记录一下配置过程。 当前`ubuntu 18.04`的机器上安装了`NVIDIA`的显卡驱动。 # 安装`Docker`及NVIDIA Container Toolkit ## 安装`Docker` 可以通过`apt`直接安装: ```bash sudo apt update sudo apt install docker.io ``` 也可以通过如下命令行安装: ```bash curl https://get.docker.com | sh \ && sudo systemctl --now enable docker ``` 通过如下命令可以验证`docker`是否安装成功: ```bash sudo docker --version # 在我的机器上返回 # Docker version 20.10.2, build 20.10.2-0ubuntu1~18.04.3 ``` ## 非`root`用户使用`Docker` 安装好的`docker`只能通过`root`权限操作,会有诸多不方便的地方。可以通过配置,让普通用户也能正常使用`docker`: ```bash # 创建docker用户组 sudo groupadd docker # 将当前用户添加到docker用户组中 sudo usermod -aG docker $USER # 登入docker用户组,让当前配置生效 newgrp docker ``` 经过配置后,可以在非`root`用户下使用`Docker`啦 ## 国内`docker`镜像源配置 编辑文件`/etc/docker/daemon.json`,在配置文件中新增registry-mirrors配置,其他配置保持不变: ```json "registry-mirrors": [ "http://hub-mirror.c.163.com", "https://docker.mirrors.ustc.edu.cn" ] ``` 比如在我的PC上,变成这样子: ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "registry-mirrors": [ "http://hub-mirror.c.163.com", "https://docker.mirrors.ustc.edu.cn" ] } ~ ``` 重启`docker`: ```bash sudo service docker restart ``` ## 安装配置NVIDIA Container Toolkit ```bash # 增加stable的apt源,配置GPG秘钥 distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list # 更新apt sudo apt update # 安装NVIDIA Container Toolkit sudo apt install -y nvidia-docker2 ``` # 拉取`pytorch`的`docker` 可以在https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated上找到不同`cuda`和`cudnn`版本的`docker`镜像。我这里拉取的是`pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime` ```bash docker pull pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime ``` 返回如下 ``` $ sudo docker pull pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime 1.7.1-cuda11.0-cudnn8-runtime: Pulling from pytorch/pytorch f22ccc0b8772: Pull complete 3cf8fb62ba5f: Pull complete e80c964ece6a: Pull complete 20dbc2116049: Pull complete 7178fad0656f: Pull complete 8573fc1d93aa: Pull complete Digest: sha256:db6086be92f439b918c96dc002f4cf40239e247f0b1b6c32e3fb36de70032bf9 Status: Downloaded newer image for pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime docker.io/pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime ``` 即为拉取成功。执行`docker run --gpus=all pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime nvidia-smi`,在`pytorch`的`docker`中执行`nvidia-smi`命令,输入类似如下信息即为成功: ``` Sun Aug 1 14:03:59 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3070 Off | 00000000:2B:00.0 On | N/A | | 0% 41C P8 14W / 240W | 423MiB / 7979MiB | 8% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ ``` # 其他 ## `zsh`支持`docker`补全 编辑`~/.zshrc`文件,在其中`plugins=(...)`这一行中配置过的插件后面增加`docker docker-compose`,变成`plugins=(... docker docker-compose)`(其中`...`是配置文件中已经配置的插件) ## 报错:`docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].` 如果出现这样子的报错 ```bash docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. ``` 就是没有安装或是正确配置NVIDIA Container Toolkit。可以按照上面的说明进行配https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated置 ## 报错:`nvcc fatal : Unsupported gpu architecture 'compute_86'` 我的机器是`RTX3070`的显卡,在执行一些编译时碰到了这个错误`nvcc fatal : Unsupported gpu architecture 'compute_86'`。 使用了`export TORCH_CUDA_ARCH_LIST="7.5"`这个命令解决,原因还没搞清楚(https://github.com/NVIDIA/apex/issues/1023) ## 更换`conda`源为国内清华源 将`~/.condarc`文件的内容修改为 ``` channels: - defaults show_channel_urls: true default_channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 custom_channels: conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud ``` 修改完成后执行 ```bash conda clean -i ``` 清理`conda`缓存 # 参考 1. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker 2. https://docs.docker.com/engine/install/linux-postinstall/ 3. https://docs.docker.com/compose/completion/ 4. https://ngc.nvidia.com/catalog/containers/nvidia:pytorch 5. https://hub.docker.com/r/pytorch/pytorch/tags?page=1&ordering=last_updated 6. https://github.com/NVIDIA/apex/issues/1023 Last modification:August 27th, 2021 at 04:30 pm © 允许规范转载