PBS 156 of X: Extracting Data with jq (jq)
In the previous instalment we got a glimpse of what jq can do, and we looked at some examples of jq
in action, but we didn’t explain the code in any of the filters. We did draw attention to how dense the language was, and how much opportunity there is for confusion. In an attempt to avoid confusion, we’re going to learn the jq language in a slow and incremental way, building on our knowledge as we go.
We’re going to start our jq journey today by exploring how we can use jq
to extract specific pieces of information from JSON files.
Matching Podcast Episodes
Listen along to this instalment on episode 779 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript: CCATP_2023_11_25
Episode Resources
- The instalment ZIP file — pbs156.zip
It Starts with .
In jq, a leading period (.
) represents the item currently being processed.
So, the jq filter '.'
simply means the entire input.
Generally speaking, a JSON data structure has a top-level element that is a dictionary or an array, so to extract a specific piece of data, we need to descend into the current item, which we do by adding extra syntax to the right of the leading dot.
Descending into Dictionaries
To access a given property in a dictionary we simply use its name.
In the instalment ZIP you’ll find an example NPM package information file (this-ti.me-package.json
), which has a dictionary as the top-level element. We can access the value for the key-value pair with the key name
with the jq filter '.name'
:
jq '.name' this-ti.me-package.json
If a dictionary contains more dictionaries we can keep descending down key by key by concatenating the keys with further dots. For example, in the sample NPM package file the top-level key bugs
has a value that’s also a dictionary which contains the key url
, we can access that key’s value with the jq filter '.bugs.url'
:
jq '.bugs.url' this-ti.me-package.json
This basic syntax works great when the keys are free from special characters, but what if the keys have spaces or other special characters in them? In that case, we need to double-quote the key name. As an example, the dependencies
key in the example NPM package file has a value that is another dictionary, and it has many key-value pairs with special characters in their key name, we can access them all by double-quoting the key names, e.g.:
jq '.dependencies."is-it-check"' this-ti.me-package.json
jq '.dependencies."@fontawesome/fontawesome-free"' this-ti.me-package.json
Descending into Arrays
To descend into an array, the syntax is very Javascript-like — append a numeric index wrapped by square brackets. Note that indexes start at zero as is the norm for programming languages.
In the instalment ZIP you’ll find a JSON data file holding information on all the Nobel prizes (NobelPrizes.json
). The file is structured as a dictionary defining a single key, prizes
which is an array of dictionaries, one for each prize. The array is sorted in reverse-chronological order, so the most recent prize is the first item in the array, we can access this most recent prize with the following command:
jq '.prizes[0]' NobelPrizes.json
A nice extra feature offered by jq is negative indexes to count from the end of the array, with -1
being the last element, -2
the second last etc..
Using our Nobel prizes database again, we can get the first ever prize with the command:
jq '.prizes[-1]' NobelPrizes.json
Slicing Arrays
As well as extracting a single value from an array, we can also extract a sub-set of the original array as a new array. This is equivalent to the .slice()
function many programming languages provide.
To specify a slice you use two array indexed separated by a colon (:
), but most annoyingly, the selection is not inclusive, instead, you specify the first index to include, and the index after the last one you want.
This is very confusing, to see how it works, let’s pipe a JSON representation of an array of the numbers from zero to five to jq
and extract different slices.
Let’s start by extracting the first three elements, to do that we specify the first index to include in our slice, 0
, and the index after the last element we want, so since we want indexes 0
, 1
& 2
, we specify 3
as the end of our slice:
echo '[0, 1, 2, 3, 4, 5]' | jq '.[0:3]'
This outputs:
[
0,
1,
2
]
You might think this means the second number is a length not an index, but no, and we can prove it by selecting the second to fourth elements instead with:
echo '[0, 1, 2, 3, 4, 5]' | jq '.[1:4]'
This outputs:
[
1,
2,
3
]
If the second number were a length rather than the index after the end, then this would have returned four numbers, not three!
Thankfully, you can get the last part of an array by simply omitting the second index, so we can get the elements from the third up to and including the end of the array with:
echo '[0, 1, 2, 3, 4, 5]' | jq '.[2:]'
Using a negative index on the end allows you to specify the number of elements to omit from the end, so all but the last two elements can be extracted with:
echo '[0, 1, 2, 3, 4, 5]' | jq '.[0:-2]'
This outputs:
[
0,
1,
2,
3
]
Finally, if you want to start from the start of the array, you can simply omit the first index, so the command above is equivalent to:
echo '[0, 1, 2, 3, 4, 5]' | jq '.[:-2]'
Note that unlike some slice functions, you can’t reverse the order, if the second index would be before the first in the array, an empty array is returned.
Output Un-Quoted Strings
By default, jq
outputs all values in JSON syntax. For data structures, i.e. arrays and dictionaries, it’s difficult to see what else jq
could do, but for single values the default behaviour can cause problems with strings.
The JSON for booleans and numbers are indistinguishable from strings from the shell’s point of view, but JSON strings are wrapped in double quotation marks, which are often unwanted when using the output from jq
in shell scripts or on the command line.
This is where the --raw-output
or -r
flag comes into play.
As an example, let’s switch back to our example NPM package file, and extract the author using jq
’s defaults:
jq '.author' this-ti.me-package.json
If we try to use this in another terminal command we soon realise those quotation marks are not what we want:
echo "Check out this cool tool by $(jq '.author' this-ti.me-package.json)"
This outputs the follow, which makes it look like I’m not the real author or something (ironic quotes):
Check out this cool tool by "Bart Busschots"
To remove the quotes we can add the -r
flag:
echo "Check out this cool tool by $(jq -r '.author' this-ti.me-package.json)"
Now we get the output we want:
Check out this cool tool by Bart Busschots
jq
Works in Parallel
If you pass jq
input with multiple top-level JSON items, whether that be from STDIN
, from a single file, or from multiple files, jq
runs its filter on each separately, and outputs all the answers on separate lines.
In the instalment folder you’ll find two JSON files containing IP information, one for bartb.ie
(ip-bartb.json
), and one for podfeet.com
(ip-podfeet.json
), both contain a single JSON dictionary.
If we pipe both of these files into jq
we’ll see that it outputs one dictionary after the other with nothing but a newline character separating the end of the first dictionary form the start of the second:
cat ip* | jq
If we add a jq filter to extract the continent code we’ll see we get two strings, one for each top-level JSON item in the input. The command:
cat ip* | jq '.continentCode'
Produces the output:
"EU"
"AM"
Sometimes it’s useful to automatically combine all the inputs into a single top-level JSON object so the filter only gets applied once, you can use the --slurp
or -s
flag to do that.
We can see that if we slurp our two IP details JSON files into jq
we now get a single output which is an array containing the objects from both files:
cat ip* | jq -s
It’s also possible for a jq filter to expand a single input into multiple outputs. When a filter expands one input into many, and when there are many inputs, all outputs from the first input will appear before those from the second etc..
There are many ways in which a jq filter can expand outputs, but we’ll look at just two in this instalment.
Extracting Multiple Values with ,
To extract multiple values from a single input, simply separate the values you want to extract with a comma (,
). For example, we can extract the name and version from our example NPM package file with:
jq '.name, .version' this-ti.me-package.json
We can similarly extract the city, continent name, and continent code from our two sample IP data files with:
cat ip* | jq '.cityName, .continent, .continentCode'
This outputs:
"Amsterdam"
"Europe"
"EU"
"San Francisco"
"Americas"
"AM"
As you can see, we first get all three outputs from the first input, then all three values from the second.
Exploding Arrays with []
Another way to get multiple outputs from a single input is extract multiple elements from a single array. This is done with variations on the syntax for extracting a single element.
Firstly, to extract all the elements simply use a completely blank array index, i.e. append []
. Switching back to our example NPM package file, we can explode the array of keywords into separate values with the command:
jq '.keywords[]' this-ti.me-package.json
This produces the output:
"JavaScript"
"timezones"
We can also extract multiple specific values by separating indexes with a comma. Switching back to our Nobel Prizes data file, we can get the first and last prizes with:
jq '.prizes[0,-1]' NobelPrizes.json
Suppressing Errors with ?
Before we start triggering errors, note that when you ask jq to extract an element that does not exist you get null
as the result. Let's use our example NPM package file for these examples — if we ask for the value from the non-existent top-level key waffles
we get null
:
jq '.waffles' this-ti.me-package.json
We also get null if we treat this non-existent key as a dictionary or even an array:
jq '.waffles.pancakes' this-ti.me-package.json
jq '.waffles[1]' this-ti.me-package.json
Where we start to get errors by default is when we try to explode a non-existent array into its values. For example, when we use the command below to try explode the non-existent array waffles
we get the error Cannot iterate over null:
jq '.waffles[]' this-ti.me-package.json
We can append a ?
suppress this error, in which case the filter will return no output at all rather than throwing an error:
jq '.waffles[]?' this-ti.me-package.json
Final Thoughts
We’ve now seen how to use jq filters to reach into JSON data structures and extract specific pieces of information. We’ve also seen that the jq
command can process multiple inputs in parallel, and produce multiple outputs. More than that, we’ve seen that jq filters can expand single inputs into multiple outputs.
This support for parallelism is critical to the next conceptual leaps — jq filter chaining, and jq functions. These are the next concepts we’ll be exploring in the series.