Featured image of post Configuring Nvidia Graphics Card on Ubuntu: Games, CUDA Programming, Deep Learning, Docker Containers, etc.

Configuring Nvidia Graphics Card on Ubuntu: Games, CUDA Programming, Deep Learning, Docker Containers, etc.

Configuring Nvidia Graphics Card on Ubuntu for games, CUDA programming, deep learning, Docker containers, etc.

Motivation

Recently, I bought an Nvidia RTX 4060 Ti (8GB) graphics card. To make full use of it, I want to use this graphics card on Ubuntu for games, CUDA programming, deep learning, etc. However, using Nvidia graphics cards on Ubuntu is not an easy task and requires some settings. Here I record the settings I use Nvidia graphics cards on Ubuntu.

Installing Nvidia Graphics Card Driver on Ubuntu

Check Graphics Card Information

First, we need to check our graphics card information. Open the terminal and enter the following command:

1
lspci | grep VGA

If you have an Nvidia graphics card on your computer, you will see output similar to the following:

1
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2803 (rev a1)

For some reason, my computer displays NVIDIA Corporation Device 2803 instead of RTX 4060 Ti, but it doesn’t matter. We just need to know that this is an Nvidia graphics card.

Install Nvidia Graphics Card Driver

  1. First, remove any Nvidia graphics card drivers that may have been installed:
1
2
3
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
  1. Install any missing dependencies:
1
sudo apt-get install linux-headers-$(uname -r)
  1. Add the Nvidia graphics card driver PPA repository and update:
1
2
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
  1. Install the Nvidia graphics card driver:
1
sudo ubuntu-drivers autoinstall
  1. Restart the computer

  2. Confirm that the graphics card driver is installed successfully:

1
nvidia-smi

If you see output similar to the following, congratulations, your Nvidia graphics card driver is installed successfully:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 4060 Ti  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8    10W /  N/A |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Disable Nouveau Graphics Card Driver

Nouveau is an open-source Nvidia graphics card driver, but its performance is not as good as the official Nvidia closed-source driver, so we need to disable the Nouveau graphics card driver.

  1. Check if the Nouveau graphics card driver is loaded:
1
lsmod | grep nouveau

If you see output similar to the following, it means that the Nouveau graphics card driver is loaded:

1
2
3
4
5
6
7
nouveau              2457600  1
mxm_wmi                16384  1 nouveau
ttm                   106496  1 nouveau
drm_kms_helper        217088  1 nouveau
drm                   552960  3 drm_kms_helper,nouveau,ttm
wmi                    36864  2 mxm_wmi,nouveau
video                  49152  1 nouveau
  1. Disable the Nouveau graphics card driver:
1
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf\nblacklist nouveau\noptions nouveau modeset=0\nEOF
  1. Update initramfs:

Then we need to update the kernel modules loaded at boot time:

1
sudo update-initramfs -u
  1. Restart the computer

  2. Confirm that the Nouveau graphics card driver is disabled:

1
lsmod | grep nouveau

If there is no output, it means that the Nouveau graphics card driver has been disabled.

Test Graphics Card

We can use glmark2 to test the graphics card performance.

  1. Install glmark2:
1
sudo apt-get install glmark2
  1. Run glmark2:
1
glmark2

If you see output similar to the following in the terminal, it means that the graphics card performance test is successful:

1
2
3
4
5
6
7
8
=======================================================
    glmark2 2021.02
=======================================================
    OpenGL Information
    GL_VENDOR:     NVIDIA Corporation
    GL_RENDERER:   NVIDIA GeForce RTX 4060 Ti/PCIe/SSE2
    GL_VERSION:    4.6.0 NVIDIA 555.58.02
=======================================================

And a window will pop up showing the content being tested. After the test is completed, the terminal will display the test score.

glmark2

Games

Linux systems are not very suitable for playing games, but with the promotion of Steam, more and more games can be run on Linux using Proton. Proton is a tool developed by Valve based on Wine, which can run Windows games on Linux.

Install Steam

  1. Download the Steam installation package:
1
wget https://cdn.cloudflare.steamstatic.com/client/installer/steam.deb
  1. Install Steam:
1
sudo dpkg -i steam.deb
  1. Install any missing dependencies:
1
sudo apt-get install -f
  1. Run Steam:
1
steam
  1. Log in to your Steam account

Steam

Install Proton

In Steam, select games that can run on Linux, then click Settings, in the Steam Play tab, check Enable Steam Play for supported titles and Enable Steam Play for all other titles, then select a Proton version from the Steam Play drop-down menu, and click OK.

After Proton is installed, you can run Windows games on Linux.

Install Games

If you are using Steam for the first time and have not purchased any games, you can choose some free games to test, such as “Dota 2”, “Counter-Strike: Global Offensive”, etc.

Dota 2

CUDA Programming

CUDA is a parallel computing platform and programming model developed by Nvidia, which can use the parallel computing power of the GPU to accelerate compute-intensive applications. CUDA programming requires the installation of the Nvidia graphics card driver and the CUDA toolkit. The CUDA version and the Nvidia graphics card driver version have a certain correspondence, and you need to choose the appropriate CUDA version according to your graphics card driver version.

Install CUDA

  1. Check the CUDA version required by the Nvidia graphics card driver:
1
nvidia-smi

In the CUDA Version line, you can see the CUDA version required by the Nvidia graphics card driver, for example, CUDA Version: 12.5. That means we need to install CUDA 12.5.

  1. Download the CUDA installation package:

Download the appropriate CUDA installation package from the Nvidia website, select the appropriate operating system, architecture, distribution, version, etc.

CUDA

Note that we need to install CUDA 12.5, but the default display on the official website is 12.6. However, this is not a problem. We choose to install via the network, and you will see the following installation guide:

CUDA

We only need to change the last line of the command to the CUDA version we need.

  1. Install CUDA:
1
2
3
4
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-5
  1. Configure environment variables:

If the installation is successful, you will see the CUDA installation files in the /usr/local/cuda-12.5 directory. We need to configure environment variables so that CUDA can be found.

If you are using bash, you can add the following content to the ~/.bashrc file:

1
2
3
echo 'export PATH=/usr/local/cuda-12.5/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
source ~/.bashrc

If you use another shell, you can add the above content to the corresponding configuration file.

  1. Test CUDA:
1
nvcc --version

If you see output similar to the following, it means that the CUDA programming environment is set up successfully:

1
2
3
4
5
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

Compile CUDA Program

You can use the CUDA sample program provided by Nvidia to test the CUDA programming environment.

  1. Download the CUDA sample program:
1
git clone https://github.com/NVIDIA/cuda-samples.git
  1. Compile the CUDA sample program:
1
2
cd cuda-samples
make
  1. Run the CUDA sample program:
1
2
cd Samples/1_Utilities/deviceQuery
./deviceQuery

If you see output similar to the following, it means that the CUDA programming environment is set up successfully:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4060 Ti"
  CUDA Driver Version / Runtime Version          12.5 / 12.5
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 7810 MBytes (8188919808 bytes)
  (034) Multiprocessors, (128) CUDA Cores/MP:    4352 CUDA Cores
  GPU Max Clock rate:                            2565 MHz (2.57 GHz)
  Memory Clock rate:                             9001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.5, CUDA Runtime Version = 12.5, NumDevs = 1
Result = PASS

Here you can see that the graphics card on my computer is NVIDIA GeForce RTX 4060 Ti, the CUDA version is 12.5, and other information such as memory size, CUDA core count, GPU clock frequency, etc. If you see similar output, it means that the CUDA programming environment is set up successfully.

Now you can start writing your own CUDA program.

Deep Learning

Deep learning is one of the most commonly used machine learning methods, which can be used to solve problems such as image recognition, natural language processing, recommendation systems, etc. Deep learning usually requires a large amount of data and computing resources, so using GPUs to accelerate deep learning training is very common.

Install Deep Learning Framework

Currently, popular deep learning frameworks include TensorFlow, PyTorch, Keras, etc., all of which support GPU acceleration. Before installing a deep learning framework, we need to install CUDA and cuDNN.

  1. Install cuDNN:

cuDNN is a deep learning library provided by Nvidia, which can accelerate the operation of deep learning frameworks. We need to download the corresponding cuDNN installation package from the Nvidia website, and then install it.

  1. Install the deep learning framework:

For example, PyTorch:

1
pip install torch torchvision torchaudio
  1. Test the deep learning framework:
1
2
import torch
print(torch.cuda.is_available())

If you see True, it means that PyTorch is installed successfully and can be used to accelerate with GPU.

Train Model

Now you can use the GPU to accelerate the training of deep learning models. For example, we can download the example provided by the PyTorch official repository to test.

1
2
3
4
git clone https://github.com/pytorch/examples.git
cd examples/time_sequence_prediction
python generate_sine_wave.py
python train.py

This is an example of using a Long Short-Term Memory (LSTM) network to predict a sine wave. You can modify the model and data according to your needs.

Docker Containers

I have run many applications using Docker before, but they were all running on the CPU. Now that I have an Nvidia graphics card, I want to use GPU acceleration in Docker containers, such as migrating the large language model I previously ran to the GPU.

Install Nvidia Container Toolkit

Nvidia Container Toolkit is a tool provided by Nvidia that allows Docker containers to access Nvidia graphics cards.

  1. Add the Nvidia Container Toolkit PPA repository:
1
2
3
4
5
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
  1. Install Nvidia Container Toolkit:
1
2
sudo apt-get install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
  1. Restart the Docker service:
1
sudo systemctl restart docker

Use GPU Acceleration in Docker Containers

Now we can use GPU acceleration in Docker containers, for example:

1
docker run --gpus all nvidia/cuda:12.5-base nvidia-smi

I usually use docker-compose to manage Docker containers, and you can add GPU configuration in the docker-compose.yml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
version: '3'

services:
  service-name:
    container_name: container-name
    image: image-source/image-name:tag
    environment:
      - SOME_ENV_VAR=some_value
    ports:
      - "host_port:container_port"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: ["gpu"]
            count: all
    volumes:
      - /path/on/host:/path/in/container
    restart: always

After defining the docker-compose.yml file, you can start the container using the docker-compose command:

1
docker-compose up -d

Troubleshooting

The first time I installed the Nvidia driver and restarted the computer, the computer could not enter the desktop and was stuck in the command line interface, and I could not enter any commands. I suspect that the Nouveau graphics card driver was not disabled, causing the Nvidia graphics card driver to not work properly.

I seem to have not set up a recovery mode, so I can’t enter recovery mode to fix the problem. I don’t want to reinstall the system because there are many important things in the original system.

Finally, I had to start a Ubuntu system with a USB flash drive, mount the original system partition, enter the original system through the chroot command, uninstall the Nvidia driver, and then restart the computer.

But there were some small twists and turns in this process. Because the Ubuntu system in the USB flash drive is 24.04, and the system in the original computer is 22.04, so after entering the original system through chroot, when I ran apt-get update, it seemed to install some wrong drivers, which caused the computer to lose the network after restarting, and neither wireless nor wired connections could be made. And the kernel version number was updated to a new version, which seemed to cause some compatibility issues.

So I had to create a new Ubuntu system with 22.04 in the USB flash drive, then chroot into the original system again, update with apt-get update, and add some additional Linux modules dpkg -s linux-modules-extra-$(uname -r) | grep status, and finally update the initramfs update-initramfs -u, update the grub update-grub. After restarting the computer, everything was back to normal, except that the system’s kernel was updated to a new version, and everything else should have been restored to its original state.

Finally, I reinstalled the Nvidia graphics card driver, this time without restarting the computer, but first disabled the Nouveau graphics card driver, then updated the initramfs and grub, and finally restarted the computer, and finally successfully installed the Nvidia graphics card driver.

comments powered by Disqus