NVMe disks on Linux

2020-10-26 Tags: linux nvme ssd hardware

NVMe solid state drives (SSDs) have become much more common in the last few years. NVMe offers the potential for much better performance than older SATA SSDs, due to the faster PCI Express hardware connection to the system, and from the system using the NVMe protocol to communicate with the storage.

The coming-soon generation of game consoles Xbox Series X and PlayStation 5 both have NVMe SSDs included. The faster storage has been promoted as a major feature of this game console generation, with it said to provide 100-1000x faster IO compared to mechanical hard disks in current generation Xbox One and PlayStation 4 game consoles. The major game consoles featuring this hardware will provide a strong incentive for home gaming PCs to contain similar high performance storage. IO performance may well become a requirement for PC games in a few years time, similar to how games currently list CPU, memory and graphics hardware requirements. High performance NVMe SSDs will become an expectation on desktops and laptops, rather than an optional high end feature that is only taken advantage of by specialist applications.

Most of these NVMe drives for laptops and desktops are in the M.2 format, which is a small sized board-only format (usually 22mm x 80mm) that can either sit flat against the motherboard, or be stacked into a PCI Express slot adaptor card. Some high performance and enterprise drives are in PCI Express card format, which is larger, but provides more circuit board space, more power and better cooling. M.2 format is much more compact than older 2.5 inch SATA SSDs, which is great for making laptops smaller. For desktops the connection is more reliable due to being directly attached to the board and screwed in, rather than plugged in power and data cables that can become loose.

NVMe and M.2 don’t inherently make SSDs faster than a SATA SSD, but they do provide a much faster connection and interface to the system. There are cheaper and slower SSDs available in the M.2 format as well, some of which connect using SATA, but via the M.2 connector rather than the traditional SATA plugs.

High performance M.2 drives can do around 5GB/sec sequential transfer and around 1 million small IOs per second, compared to 550MB/sec and 100,000 small IOs per second for a high performance SATA drive.

Linux kernel support

All major Linux distributions have good support for NVMe in their current releases.

NVMe support was added to the Linux kernel in version 3.3, which was released in March 2012. There were major changes in 3.13 (January 2014) which improved performance - see the section about disk schedulers below.

Ubuntu 14.04 supports NVMe, but some of the tools are early versions and incomplete. Ubuntu 16.04 (released April 2016) or newer works well.

Debian 7 (squeeze) doesn’t support NVMe, because the included kernel version 3.2 is too old. Debian 8 (jessie, released April 2015) or newer works well.

RHEL and CentOS 6 didn’t originally support NVMe, but Red Hat added it as a backported feature (on Linux kernel version 2.6.32) in RHEL 6.5. RHEL and CentOS 7 (released June 2014) or newer works well.

Paths in /dev

Device paths are /dev/nvme0n1 instead of /dev/sda.

The first number is the device identifier, the second number is the namespace.

One hardware device can have multiple namespaces identified by numbers - these are kind of like hardware level partitions that can be allocated various slices of the hardware’s storage capacity.

The main use for namespaces is to allocate them to virtual machines at a hardware level. There are potentially some performance benefits as each VM has its own hardware level IO queues, rather than the VM host needing to centralise IO from multiple VMs before passing it on to the hardware. Linux uses numbers to identify NVMe devices rather than letters.

Paths to partitions are /dev/nvme0n1p1 instead of /dev/sda1. Partition numbers have the “p” prefix to separate them from the namespace number.

Partitioning tools

Most partitioning tools will work, but newer versions usually add explicit support and nicer display for NVMe disks. There’s also a new nvmectl tool which is NVMe specific, and controls NVMe specific features that didn’t exist or are significantly different on other drive types like SATA and SAS.

Partition table types are the same as for other drive types - GPT as the modern standard, and MBR (msdos) for backwards compatibility where needed. MBR only supports up to 2TB size. Generally, use GPT unless there’s some particular reason not to.

Common partitioning tools:

Smartctl added support in version 6.5, released May 2016.
GNU Parted added support in June 2016, but there wasn’t a release including these changes until 3.3 in October 2019. Some Linux distributions are used patched versions of parted, so that NVMe support is included in the distribution packages. Earlier versions do work ok, but they show “(unknown)” when displaying the hardware type.
Gparted (the GUI wrapper for GNU parted) added support in version 0.24, released October 2016. Earlier versions didn’t include NVMe devices in the list of available disks, so were difficult to use.

`nvmectl` commands

These will all need to be run as root or using sudo, since they’re low level hardware control commands.

/dev/nvme0n1 is used as an example here - insert the appropriate device path instead.

nvme help lists the commands available.

nvme list displays the installed NVMe drives and basic information such as device path, model and serial number.

nvme list-ns /dev/nvme0n1 lists namespaces.

nvme id-ctrl /dev/nvme0n1 displays detailed information about the controller chip.

nvme smart-log /dev/nvme0n1 displays disk health information such as read/write counts, temperature, power cycles, error counts and amount of spare flash area available.

Filesystems

Linux filesystems should all “just work” and nothing specific is needed for NVMe compared to other drive types.

ext4 and XFS both work well. XFS may be slightly faster on high performance NVMe drives as it generally handles parallel workloads better, but there’s not a lot of difference between the two filesystems.

BTRFS has an “ssd mode” which automatically configures the filesystem to be more appropriate for use on SSD rather than mechanical hard disk. This “ssd mode” works the same on NVMe as with other SSDs such as SATA.

Disk schedulers

Short story: use a recent Linux version for best performance, leave the rest to your distribution’s defaults unless you have a special need.

Older disk interfaces, both SATA (AHCI) and SAS, work via the CPU centrally managing disk requests. This originated in systems where there was only a single CPU core, but became managed by locks/queueing on multiple CPU core systems, so that there’s a consistent whole-system disk queue state. The disk scheduler gathers all read and write requests into a single location, then carries out some optimisations on the queue to increase performance. Disk scheduler optimisations were generally targeted at mechanical hard disks such as re-ordering to minimise the number of disk head movements needed, and prioritisation to share disk access time fairly between multiple processes using the disk. Disk scheduler optimisations were helpful when disks were relatively slow compared to CPU, but with a high performance SSD, the CPU time spent on scheduler optimisations becomes a performance limitation - it’s better to just pass the work to the hardware as-is, than spend time trying to optimise it.

When NVMe was introduced in Linux 3.3 (March 2012), it used the existing disk scheduler infrastructure, but this lead to not being able to take full advantage of the hardware. It was previously possible to reach these limitations, but only with a large RAID array of SATA or SAS SSDs which was rare at the time. High performance storage hardware (including NVMe) became more common and Linux 3.13 (January 2014) came with a new request layer called blk-mq - block multiqueue. The name refers to its handling of multiple parallel queues and the option to pass these multiple queues direct to NVMe hardware, rather than having a single centralised request queue. Initially blk-mq was only for NVMe and some other specialised storage hardware, but over time all the previous functionality was moved across to blk-mq. Linux 5.0 (March 2019) removed the traditional single queue disk schedulers and everything (even mechanical hard disks) now uses blk-mq.

There are four blk-mq scheduler options:

none - no scheduler, pass everything through to hardware as-is.
mq-deadline - operations that exceed a deadline time get re-allocated higher priority. Minimal processing time, default for fast storage.
bfq - “budget fair queuing”, targeted at slower disks, prioritising fairness between processes and response time. More CPU processing time, but the processing is worthwhile for mechanical hard disks and slow SSDs.
kyber - reuses latency/queue size concepts from network interfaces. Minimal processing time, an alternative option for fast storage.