Linux distribution choices for GPU development

2020-10-31 Tags: linux gpu nvidia ubuntu debian redhat

As of late 2020, Nvidia have the highest performing consumer GPUs and the best support for GPU compute development. GPU compute development includes both direct GPU programming with Nvidia’s CUDA tools and machine learning applications that use ML frameworks such as Tensorflow and PyTorch. However, Nvidia release their Linux drivers under a proprietary license, rather than open sourcing the drivers and having them included directly in the Linux kernel. There are open source drivers for Nvidia hardware from the Nouveau project, but these don’t support computing use of the GPU. As the Nvidia drivers are not open source, they need to be installed as a binary kernel module, and the installation process is more complex as the Linux kernel freely changes its in-kernel interfaces. These in-kernel interfaces are considered as internal implementation details and aren’t covered by Linux’s “don’t break userspace” policy.

For example, there’s currently a banner on the Nvidia download page warning that the drivers are not compatible with Linux kernel 5.9+ and to wait for an updated driver version expected in mid-November 2020.

The Nvidia graphics driver installer files for Linux are distributed by Nvidia as an all-in-one shell script self-unpacking file, which is intended to be generic across all Linux distributions. Linux distributions generally use a package manager, such as Red Hat rpm or Debian dpkg which tracks what software is installed on the system and dependencies on other software packages. Package managers also provide the ability to uninstall packages and upgrade them to new versions.

Installing the Nvidia binary drivers is a significant change to the system, and if the Nvidia shell script installer gets used, that change is made outside of the distribution’s package manager. There’s a significant risk that the driver will no longer work after applying kernel upgrades provided by the distribution.

Some Linux distributions such as Ubuntu have distribution specific re-packs of the Nvidia binary drivers. These are generally a much safer choice than the shell script self installer, since the distribution’s package manager is able to track the drivers, manage dependencies and build updated kernel modules when needed.

Installation methods

The Nvidia developer libraries are used for GPU computing development - this is branded as the “CUDA Toolkit”. GPU accelerated machine learning frameworks usually depend on having the developer libraries installed. The developer library packages come with the drivers included - this is often the easiest way to install the drivers, even if the developer libraries aren’t needed.

Nvidia provide three options for installing the developer libraries:

runfile, an all-in-one shell script based unpacker that installs outside of the distribution package manager, similar to the one for the graphics drivers. The single file contains all the developer tools, libraries and examples (about 2GB).
local package, a distribution specific package (.rpm or .deb file) with everything included (about 2GB).
network package, a distribution specific package which is a small download (about 100KB). This only contains package manager instructions. The package manager downloads and install the larger CUDA packages itself.

The network package is the neatest installation method:

it only downloads what is needed
it’s managed by the Linux distribution package manager
it’s officially provided by Nvidia, rather than being a repackaged by a third party However, Nvidia only provide these packages for a few Linux distributions and versions at a time.

The local package option is useful for systems with slow or no internet connection, since the large package can be predownloaded elsewhere, but once installed, everything is managed by the package manager and can be automatically updated. The runfile option has the same problem as before with working outside the package manager, so I don’t recommend it, except for systems that have no other way to install the drivers.

The CUDA developer drivers generally get a major release 1-2 times per year. There are sometimes minor releases at other times during the year. If minor releases occur, those releases are usually to add support for new GPU models.

Distribution choices

What I’m looking for in a Linux distribution for this purpose: (highest priority first)

Local and network CUDA packages provided directly by Nvidia, so the drivers install easily and aren’t dependent on volunteers to repackage for the distribution and keep them up to date.
The distribution version being in its supported lifetime, so that all the other software on the system gets security updates.
The distribution being recently released, so that the software on the system is up to date with upstream development.
The distribution being easy to upgrade (and this working in practice on a messy system that’s been used for many months), so that it doesn’t need to be installed from scratch after a new version of the distribution is released.

The other differences between Linux distributions have been narrowing in the last few years:

Ubuntu, Red Hat/CentOS, Fedora and Debian are all using the GNOME desktop. SUSE is the only major distribution that’s using a non-GNOME desktop (SUSE is KDE based).
All the major Linux distributions can install and run the user’s choice of desktop environment (GNOME, KDE, XFCE and so on) rather than their default desktop environment.
All the major Linux distributions have adopted systemd as the init process to manage system boot and login. Only a few less widely used Linux distributions use another init and service management system, such as Gentoo, Alpine and Slackware.
All the major desktop Linux distributions are using NetworkManager to control network interfaces, WiFi network login and VPNs.

The significant differences in ease of setting up and maintaining Nvidia drivers and diminishing other differences makes this an important point in choosing which Linux distribution to use.

On the CUDA developer download page, there are a range of packages for various Linux distributions.

CentOS

Nvidia currently provide CUDA developer packages for CentOS 7 and 8.

CentOS 7 is kind of old, but still modern enough to be usable. It was originally released in 2014 and the command line tools (eg. coreutils, compilers) are from that era. There have been some major desktop upgrades provided in the point releases which makes it more up to date. Installing a recent Python 3 version into the system directories (eg. /usr/bin/python3) requires some fiddling as it’s not available from the main/default package repositories such as EPEL. The usual solution is to use a separate Python installation such as conda, pipenv or poetry.

CentOS 8 was released in mid-2019 and is currently fairly up to date. CentOS releases are supported with security updates for 7-10 years, so this is a good option for situations that favour long support periods over having the latest software available. Upgrades between CentOS major versions are generally not supported, so choosing a major version of CentOS will mean staying on it for a long time.

CentOS kernels keep their original release version number, but are extensively patched by Red Hat to add new features where possible while retaining compatibility.

Red Hat

Nvidia currently provide packages for Red Hat Enterprise Linux (RHEL) 7 and 8.

RHEL is practically the same as CentOS, except for licensing and support. Red Hat will provide support such as security updates for even longer lifecycles than CentOS if you pay for the extended support, but it’s usually best to move to a newer distribution version.

Red Hat generally won’t provide support for the Nvidia drivers since it’s a large binary driver from a third party (Nvidia) rather than from Red Hat.

Fedora

Nvidia currently provide packages for Fedora 32.

Fedora 33 was recently released. As Fedora release and support lifecycles are short, there’s some risk that Nvidia won’t update to target a new Fedora version before Fedora 32 falls out of support - this has happened multiple times in the past.

Fedora generally put out a new release every 8-9 months and provide updates for release N until release N+2 is available. Fedora update the Linux kernel version quite often within each release, rather than stabilising on a particular kernel version. These rapid kernel updates increase the risk that incompatible kernel updates will break the binary drivers - either you find out ahead of time and avoid updating, or you find out when drivers break, the system won’t boot properly and are then stuck in a messy situation of trying to rollback to earlier package versions.

Ubuntu

Nvidia currently provide packages for Ubuntu 16.04, 18.04 and 20.04.

16.04 is an older previous long term support (LTS) release, and is still in support until April 2021. 18.04 is a recently superseded LTS release and will have support until April 2023. 20.04 was released in April 2020 and is the current LTS release with support until April 2025.

Ubuntu release optional kernel updates for their LTS releases a few months after the corresponding general release, called “hardware enablement (HWE)”. My experience has been that these HWE kernel updates aren’t as well tested with the Nvidia drivers and tend to be at higher risk of breakage. It’s best to stay on the original kernel version from the LTS release, if there’s no need for the newer hardware support (eg. CPU or motherboard) provided by the HWE kernel updates.

In the past, Nvidia have released packages for some non-LTS releases, but by the time the Nvidia packages are available, often there’s only a short time of distribution support support remaining. This isn’t a good option.

Other Ubuntu flavours such as Kubuntu and Xubuntu get the same kernel updates and are equivalent to Ubuntu as far as the Nvidia drivers are concerned. Ubuntu seems to be the most widely used and best tested of the Linuxes for Nvidia development, which means bugs will tend to get noticed by users, reported and fixed.

SUSE

Nvidia currently provide packages for OpenSUSE and SLES 15.

OpenSUSE 15 is already out of support. SUSE Linux Enterprise Server (SLES) 15 will be supported until 2028, but this is a commercial/paid distribution similar to Red Hat.

Nvidia’s intention in providing packages for this distribution is likely to support HPC users where SLES is used as the basis for Cray Linux.

Summary

It’s annoying that binary drivers are needed, but the other benefits of using Nvidia, such as compute performance, and development tool support are attractive over other GPU brands (such as AMD) or compute accelerators (such as Xeon Phi).

Having a smooth experience using the Nvidia drivers and development tools is a significant part of choosing which Linux distribution to use, particularly when other differences between distributions are much less significant than they used to be (such as low level plumbing) or can be customised after install (such as choice of desktop environment).