Modern command line tools

2020-11-10  Tags: unixlinuxsoftwareshell

The basic unix style command line tools, such as cat, cut and ls, have been around since the 1970s. Various additional features have been added to them over the years, such as BSD extensions and GNU style --long-options, but the basic tools still function the same way they did in the early 1970s. Evolution and improvement in this area is generally slow compared to many other fields of computing, but there have been some useful new tools created in the last 10-20 years.

These tools mostly (but not always) work with unix pipes and command lines. All commonly exist in Linux distribution packages

pv - pipe viewer

pv displays the flow of data through a pipe, usually for interactive viewing on a terminal. It shows the current speed, how long the transfer has been running for, and (if possible) a progress bar and completion ETA.

pv is handy for long running transfers where a progress indication is useful, such as to check on network or disk speed. It can handle multiple gigabytes per second of data piped through it, so generally won’t be a performance limitation, except in very extreme cases.

The progress bar and completion ETA depend on knowing ahead of time how much data will be transferred. In some situations, such as piping a normal file, pv is able to detect the size. In other situations, such as piped output from another program, the user needs to provide the expected size as a command line option, otherwise no progress bar and ETA will be shown, just the transfer time and current speed.


# Raw disk to image file copy
sudo pv /dev/nvme0n1p1 > ~/disk.img

# tar a directory on the local machine, use pv to show progress of expected 100GB size
# then pipe the tar over SSH to another machine and unpack the tar
tar -cf - data | pv -s 100g | ssh otherbox 'cd dest-dir; tar -xf -'

mbuffer - measuring buffer

mbuffer inserts a buffer in a pipe, which can smooth out or limit the transfer rate through the pipe.

It can help join between fast sender/slow receiver programs (and vice versa). This situation often happens when writing to slow USB sticks on Linux - the in-memory write buffer is bigger than the whole stick’s capacity, so data appears to copy quickly, but then takes a long time to synchronise at the end of the copy. mbuffer can also be used for network transfers to avoid completely saturating a network with bad latency behaviour, such as wifi.


# Copy disk to image file with an 8MB memory buffer, limiting read speed to 4MB/second
sudo pv /dev/nvme0n1p1 | mbuffer -m 8M -r 4M > /mnt/slowusb/disk.img

sponge - soak up output of pipe

sponge reads its input until the end is reached, before passing it all to its output stream. The intermediate storage is kept in a temporary file (default: in /tmp) sponge is particularly useful when writing output back to the same file as input, when this would usually cause the input to be overwritten before it is all processed.


# Sort a text file, with the output going in the original file name
sort words.txt | sponge words.txt
# Longer pipelines work too
grep animal words.txt | sort | sponge words.txt

parallel - run tasks in parallel

There are a number of “run commands in parallel” tools, but the most common of them is GNU parallel. These tools manage parallel jobs, and can even distribute running commands across machines via SSH.

The command options are complex and can be fiddly to get right. However, the examples section is really good and usually one of the examples can be adapted for the intended usage. For basic usage, the -P parallel jobs option of xargs often has enough functionality and can be simpler to use.


# Sort many text files, replace .txt extension with .sorted
parallel sort -o {.}.sorted {} ::: *.txt
# Use quotes to enclose commands with pipes
parallel "cut -f 2 -d ',' {} > {.}.cut2" ::: *.csv

wdiff - word diff

Identify differences at the word level, rather than line-by-line as usual for the diff command. This is useful for making smaller changes and text changes (such as a few words in amongst a paragraph) more visible. By default, changes are displayed as [-before-]{+after+}. Combine with colordiff below

Similar functionality is built into git, such as git diff –-word-diff


# basic usage is the same as standard diff
wdiff a.txt b.txt

colordiff - add colour to diff output

Adding colour to diffs makes them easier to distinguish added and removed sections. Some recent versions of GNU diff include colour functionality rather than needing to use this separate tool.


# Use colordiff to wrap diff
colordiff file1 file2
# Pipe diff output to colordiff
diff -u file1 file2 | colordiff
# Combine colordiff with wdiff (see above)
wdiff -n file1 file2 | colordiff

fish - friendly interactive shell

Fish shell is targeted at interactive use. It comes with built in functionality that’s often added onto other shells, such as history based autosuggestion, completion based on man pages, syntax highlighting and git status prompt.

Fish shell has a different syntax to POSIX sh shells like bash and zsh - this syntax is nicer for short one-liner type scripts. Variables are not further split after substitution, so there’s no need to put double quotes around all variables to make scripts handle filenames containing spaces safely.


# POSIX shell
for i in *.txt
    if [ -r "$i" ]
        cat "$i"
# fish shell
for i in *.txt
    if test -r $i
        cat $i

jq - json query

jq is like sed for JSON data. It provides filtering and rewriting capability, but based on the JSON object/array structure rather than line based like traditional UNIX tools. It has its own syntax, rather than re


# Get some JSON data from a HTTP API
curl,75/forecast > wx.json
# Pretty print the JSON API response
jq wx.json
# Also works via pipe
cat wx.json | jq
# Select from the object hierarchy
jq '.properties.periods[0].detailedForecast' wx.json
# Filter and rename object names/values
jq '.properties.periods[] | {start: .startTime, wind: .windDirection}' wx.json

ack - beyond grep

ack is targeted at searching source code rather recursively rather than generic text searching provided by grep. By default, it skips repository directories (eg. .git) and binary files. Ack is perl based, so the regular expression syntax matches programming languages that use PCRE style regexes.


# search for python 'from x import y' style imports
ack 'from [\w\.]+ import \w+'
# search for literal string rather than regex
ack -Q 'printf('
# search only files with C++ like extensions
ack --cpp iostream

rg - ripgrep, improved grep

Ripgrep is a similar source code targeted search tool to ack, but with particular focuses on quickly searching large numbers of files, handling unicode text and ability to pre-filter input data. rg searches files in parallel, which will take advantage of multiple CPU cores when files have already been read from storage and are in memory cache. When files are not in memory cache, the parallel operation will help keep SSDs busy reading rather than waiting for some processing between each read operation.


# default recursive search through files and subdirectories
rg M_PI
# search for duplicated words in text files using PCRE regex
rg -P '(\w+) \1' *.txt
# search inside compressed files (gzip, bzip2, lz4, xz)
rg -z total

exa - enhanced file listing

exa provides similar functionality to ls but targeted at interactive viewing. The main advantage over the standard ls is git integration which displays the git status of files as part of the listing. There’s an option to show column headers, which is handy for looking at the output, but will cause problems for programs which process the output such as part of a pipe.

Replacing ls with exa will break some shell scripts, but using it in convenience aliases such as ll and la is no problem.


# long listing with git status
exa --long --git
# alias to save typing
alias ll='exa --long --header --git --time-style=long-iso'

xsv - csv processor

xsv is a tool for processing CSV files. Much of the functionality can be obtained from other traditional UNIX text processing tools such as cut and grep, but additional options are needed to handle CSV rather than this being the default. For example, cut -d , There’s an option to pre-index CSV files, which makes operations that index to particular lines/rows significantly faster. XSV can also pretty-print CSV, which makes it much easier to read in the terminal compared to mentally handling varying column widths and alignment.


# pretty print to table
curl | xsv table
# cut out columns based on headers
xsv select id,total statistics.csv
# randomly choose 10 example lines and pretty print
xsv sample 10 statistics.csv | xsv table

entr - event notify test runner

entr monitors a group of files and automatically runs a command when any of the files is changed.

The command line interface is a bit unusual in that the list of files to monitor is provided via stdin and the command to run is provided as part of the command line.


# watch for changes to python source files and run pytest on change
# note: not smart enough to skip version control directories such as .git and .svn
find . -name '*.py' | entr pytest
# use ack file type listing to identify C++ source files
# then entr with a shell interpreter
ack -t cpp -f | entr -s 'make && make test'