Logo
Logo

Programming by Stealth

A blog and podcast series by Bart Busschots & Allison Sheridan.

PBS 156 of X: Extracting Data with jq (jq)

25 Nov 2023

In the previous instalment we got a glimpse of what jq can do, and we looked at some examples of jq in action, but we didn’t explain the code in any of the filters. We did draw attention to how dense the language was, and how much opportunity there is for confusion. In an attempt to avoid confusion, we’re going to learn the jq language in a slow and incremental way, building on our knowledge as we go.

We’re going to start our jq journey today by exploring how we can use jq to extract specific pieces of information from JSON files.

Matching Podcast Episodes

Listen along to this instalment on episode 779 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript: CCATP_2023_11_25

Episode Resources

It Starts with .

In jq, a leading period (.) represents the item currently being processed.

So, the jq filter '.' simply means the entire input.

Generally speaking, a JSON data structure has a top-level element that is a dictionary or an array, so to extract a specific piece of data, we need to descend into the current item, which we do by adding extra syntax to the right of the leading dot.

Descending into Dictionaries

To access a given property in a dictionary we simply use its name.

In the instalment ZIP you’ll find an example NPM package information file (this-ti.me-package.json), which has a dictionary as the top-level element. We can access the value for the key-value pair with the key name with the jq filter '.name':

jq '.name' this-ti.me-package.json

If a dictionary contains more dictionaries we can keep descending down key by key by concatenating the keys with further dots. For example, in the sample NPM package file the top-level key bugs has a value that’s also a dictionary which contains the key url, we can access that key’s value with the jq filter '.bugs.url':

jq '.bugs.url' this-ti.me-package.json

This basic syntax works great when the keys are free from special characters, but what if the keys have spaces or other special characters in them? In that case, we need to double-quote the key name. As an example, the dependencies key in the example NPM package file has a value that is another dictionary, and it has many key-value pairs with special characters in their key name, we can access them all by double-quoting the key names, e.g.:

jq '.dependencies."is-it-check"' this-ti.me-package.json
jq '.dependencies."@fontawesome/fontawesome-free"' this-ti.me-package.json

Descending into Arrays

To descend into an array, the syntax is very Javascript-like — append a numeric index wrapped by square brackets. Note that indexes start at zero as is the norm for programming languages.

In the instalment ZIP you’ll find a JSON data file holding information on all the Nobel prizes (NobelPrizes.json). The file is structured as a dictionary defining a single key, prizes which is an array of dictionaries, one for each prize. The array is sorted in reverse-chronological order, so the most recent prize is the first item in the array, we can access this most recent prize with the following command:

jq '.prizes[0]' NobelPrizes.json

A nice extra feature offered by jq is negative indexes to count from the end of the array, with -1 being the last element, -2 the second last etc..

Using our Nobel prizes database again, we can get the first ever prize with the command:

jq '.prizes[-1]' NobelPrizes.json

Slicing Arrays

As well as extracting a single value from an array, we can also extract a sub-set of the original array as a new array. This is equivalent to the .slice() function many programming languages provide.

To specify a slice you use two array indexed separated by a colon (:), but most annoyingly, the selection is not inclusive, instead, you specify the first index to include, and the index after the last one you want.

This is very confusing, to see how it works, let’s pipe a JSON representation of an array of the numbers from zero to five to jq and extract different slices.

Let’s start by extracting the first three elements, to do that we specify the first index to include in our slice, 0, and the index after the last element we want, so since we want indexes 0, 1 & 2, we specify 3 as the end of our slice:

echo '[0, 1, 2, 3, 4, 5]' | jq '.[0:3]'

This outputs:

[
  0,
  1,
  2
]

You might think this means the second number is a length not an index, but no, and we can prove it by selecting the second to fourth elements instead with:

echo '[0, 1, 2, 3, 4, 5]' | jq '.[1:4]'

This outputs:

[
  1,
  2,
  3
]

If the second number were a length rather than the index after the end, then this would have returned four numbers, not three!

Thankfully, you can get the last part of an array by simply omitting the second index, so we can get the elements from the third up to and including the end of the array with:

echo '[0, 1, 2, 3, 4, 5]' | jq '.[2:]'

Using a negative index on the end allows you to specify the number of elements to omit from the end, so all but the last two elements can be extracted with:

echo '[0, 1, 2, 3, 4, 5]' | jq '.[0:-2]'

This outputs:

[
  0,
  1,
  2,
  3
]

Finally, if you want to start from the start of the array, you can simply omit the first index, so the command above is equivalent to:

echo '[0, 1, 2, 3, 4, 5]' | jq '.[:-2]'

Note that unlike some slice functions, you can’t reverse the order, if the second index would be before the first in the array, an empty array is returned.

Output Un-Quoted Strings

By default, jq outputs all values in JSON syntax. For data structures, i.e. arrays and dictionaries, it’s difficult to see what else jq could do, but for single values the default behaviour can cause problems with strings.

The JSON for booleans and numbers are indistinguishable from strings from the shell’s point of view, but JSON strings are wrapped in double quotation marks, which are often unwanted when using the output from jq in shell scripts or on the command line.

This is where the --raw-output or -r flag comes into play.

As an example, let’s switch back to our example NPM package file, and extract the author using jq’s defaults:

jq '.author' this-ti.me-package.json

If we try to use this in another terminal command we soon realise those quotation marks are not what we want:

echo "Check out this cool tool by $(jq '.author' this-ti.me-package.json)"

This outputs the follow, which makes it look like I’m not the real author or something (ironic quotes):

Check out this cool tool by "Bart Busschots"

To remove the quotes we can add the -r flag:

echo "Check out this cool tool by $(jq -r '.author' this-ti.me-package.json)"

Now we get the output we want:

Check out this cool tool by Bart Busschots

jq Works in Parallel

If you pass jq input with multiple top-level JSON items, whether that be from STDIN, from a single file, or from multiple files, jq runs its filter on each separately, and outputs all the answers on separate lines.

In the instalment folder you’ll find two JSON files containing IP information, one for bartb.ie (ip-bartb.json), and one for podfeet.com (ip-podfeet.json), both contain a single JSON dictionary.

If we pipe both of these files into jq we’ll see that it outputs one dictionary after the other with nothing but a newline character separating the end of the first dictionary form the start of the second:

cat ip* | jq

If we add a jq filter to extract the continent code we’ll see we get two strings, one for each top-level JSON item in the input. The command:

cat ip* | jq '.continentCode'

Produces the output:

"EU"
"AM"

Sometimes it’s useful to automatically combine all the inputs into a single top-level JSON object so the filter only gets applied once, you can use the --slurp or -s flag to do that.

We can see that if we slurp our two IP details JSON files into jq we now get a single output which is an array containing the objects from both files:

cat ip* | jq -s

It’s also possible for a jq filter to expand a single input into multiple outputs. When a filter expands one input into many, and when there are many inputs, all outputs from the first input will appear before those from the second etc..

There are many ways in which a jq filter can expand outputs, but we’ll look at just two in this instalment.

Extracting Multiple Values with ,

To extract multiple values from a single input, simply separate the values you want to extract with a comma (,). For example, we can extract the name and version from our example NPM package file with:

jq '.name, .version' this-ti.me-package.json

We can similarly extract the city, continent name, and continent code from our two sample IP data files with:

cat ip* | jq '.cityName, .continent, .continentCode'

This outputs:

"Amsterdam"
"Europe"
"EU"
"San Francisco"
"Americas"
"AM"

As you can see, we first get all three outputs from the first input, then all three values from the second.

Exploding Arrays with []

Another way to get multiple outputs from a single input is extract multiple elements from a single array. This is done with variations on the syntax for extracting a single element.

Firstly, to extract all the elements simply use a completely blank array index, i.e. append []. Switching back to our example NPM package file, we can explode the array of keywords into separate values with the command:

jq '.keywords[]' this-ti.me-package.json

This produces the output:

"JavaScript"
"timezones"

We can also extract multiple specific values by separating indexes with a comma. Switching back to our Nobel Prizes data file, we can get the first and last prizes with:

jq '.prizes[0,-1]' NobelPrizes.json

Suppressing Errors with ?

Before we start triggering errors, note that when you ask jq to extract an element that does not exist you get null as the result. Let's use our example NPM package file for these examples — if we ask for the value from the non-existent top-level key waffles we get null:

jq '.waffles' this-ti.me-package.json

We also get null if we treat this non-existent key as a dictionary or even an array:

jq '.waffles.pancakes' this-ti.me-package.json
jq '.waffles[1]' this-ti.me-package.json

Where we start to get errors by default is when we try to explode a non-existent array into its values. For example, when we use the command below to try explode the non-existent array waffleswe get the error Cannot iterate over null:

jq '.waffles[]' this-ti.me-package.json

We can append a ? suppress this error, in which case the filter will return no output at all rather than throwing an error:

jq '.waffles[]?' this-ti.me-package.json

Final Thoughts

We’ve now seen how to use jq filters to reach into JSON data structures and extract specific pieces of information. We’ve also seen that the jq command can process multiple inputs in parallel, and produce multiple outputs. More than that, we’ve seen that jq filters can expand single inputs into multiple outputs.

This support for parallelism is critical to the next conceptual leaps — jq filter chaining, and jq functions. These are the next concepts we’ll be exploring in the series.

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack