PBS 157 of X: Querying JSON with jq (jq)

09 Dec 2023

So far in this mini-series we’ve looked at how jq can be used to pretty-print JSON and to extract specific pieces of information, in this instalment we’re taking things to the next level by working our way towards querying JSON data structures for information like it were a database.

In this instalment we’re going to work our way towards answering questions like who won the 2000 Nobel Prize for Medicine?, and which prizes were won by people with the surname Curie?

To get to our desired queries we’re going to need to explore three important jq concepts:

Filter chaining
Operators
Functions

Matching Podcast Episode

Listen along to this instalment on episode 781 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript: CCATP_2023_12_09

Episode Resources

The instalment ZIP file — pbs157.zip

Filter Chaining

The key to doing powerful things with jq is not writing complex filters, but combining many simple filters to build powerful flows. This is the same philosophy underpinning shell programming and terminal commands. Perhaps because it’s so heavily inspired by shell programming, jq has chosen to use very similar syntax. This is a blessing and a curse, because it makes jq filter chains intuitive to those comfortable on the terminal, and, very prone to breaking when you forget to quote your filters correctly, and it’s the shell that sees your pipes, not jq!

So, a timely reminder, when using the jq command, always single-quote your filters!

In the previous instalment we learned that jq filters are designed to work in parallel on arbitrarily many inputs, and each input can be dropped, or, exploded into multiple outputs. We can use the | symbol to route the output of one jq filter to become the input to another.

As in our previous jq instalments, we’ll be using the file NobelPrizes.json (in the Instalment’s ZIP file) for our examples. It contains a JSON data structure that stores information about all Nobel Prizes up to and including those announced in 2023. At the top level there is a dictionary with one key, prizes which maps to an array of dictionaries, one for each prize.

To get started with filter chaining, let’s combine a filter to explode the array of prizes into separate values, and then chain that with a pair of very simple filters to extract the year and category of each prize:

jq '.prizes[] | .year, .category' NobelPrizes.json

In the previous instalment we saw that empty square braces explodes an array into separate values, and that we can use the comma (,) to run multiple filters against each input.

As you start to combine more and more filters, you’ll inevitably reach a point where you need to group filters so they get applied in the right order, jq uses regular parentheses for this (AKA round brackets, i.e. ()).

If we want to expand our query above to also list the surnames of each recipient of each prize, we need to add another filter after .category, but since the laureates are in an array, we actually need two more filters, one to explode the laureates, and one to extract the surnames. If we just add .laureates[] | .surname we have a problem, the new pipe is seen as ending a triplet of filters (.year, .category & .laureates) and starting a third top-level filter, not as a sub-filter of our new third filter. We need to group these last two filters together with parentheses.

Finally, the prizes were suspended for some war years, so there are entries in the data structure without laureates array, so we need to use the ? symbol to indicate that it’s OK for .laureates[] to produce zero outputs. Putting it all together we get:

jq '.prizes[] | .year, .category, (.laureates[]? | .surname)' NobelPrizes.json

Operators in jq

Like other programming languages, jq supports operators, i.e. symbols or keywords that apply some kind of operation to the values to their left and right to produce a new value. We’ll meet more operators later in the series, but we’ll start with those most relevant for querying data — comparison and boolean operators.

Literal Values in JSON & jq

Operators work with values, and we’re going to be extracting some those values from the input JSON, but we’re also going to need to express literal values in our jq filters. Let’s start by refreshing our memories of the types of data JSON can represent, and how it represents them:

Type	JSON syntax
Null (a value that means ‘no value’)	`null`
Booleans	`true` and `false`
Numbers	Unquoted numeric values, e.g. `1`, `3.14`, and `-23`
Strings	Zero or more characters in double quotes, e.g. `"a string"`
Arrays	Zero or more comma-separated values of any type enclosed in square brackets, e.g. `[null, false, 11, "some String"]`
Dictionaries	Zero or more comma-separated key-value pairs wrapped in curly braces, with colons separating the keys from the values, the keys being strings, and the values being values of any type , e.g. `{"key1": "some value", "anotherKey": 42}`

Thankfully, the designers of the jq language chose to inherit the syntax for literal values directly from JSON, so we use the JSON syntax to represent the special null value, booleans, numbers, and strings within our jq filters, specifically:

Type	jq Syntax
Null	`null`
Booleans	`true` and `false`
Numbers	Unquoted numeric values, e.g. `1`, `3.14`, and `-23`
Strings	Zero or more characters in double quotes, e.g. `"a string"`

Invoking the `jq` Command Without Input

To help us experiment with the various operators, it’s useful to learn how to execute the jq command with no input. If you try to run the command without passing it any files or piping something into it, jq tries to read from the keyboard, which is not always what you want. To explicitly tell jq not to expect any input from anywhere, use the --null-input or -n flag. You’ll see this flag used extensively in the examples in this instalment.

Comparison Operators

The comparison operators all generate boolean values (true or false), and the jq language supports the usual selection of operators you’d expect.

We can check for equality or non-equality with == or !=, we can check if one number is less than another, or one string is alphabetically before another, with <, and the opposite with >, and there are the or equal to variations you’d expect too, i.e. <= and >=.

We can demonstrate these with some simple filters that don’t take any input (by calling jq with the -n flag described above), and by piping the JSON syntax for more complex data structures to the jq command:

# is equal to
jq -n '"waffles" == "waffles"' # true
jq -n '"waffles" == "pancakes"' # false

# is not equal to
jq -n '"waffles" != "waffles"' # false
jq -n '"waffles" != "pancakes"' # true

# is less than
jq -n '42 < 2' # false
jq -n '42 < 42' # false
jq -n '42 < 442' # true

# is less than or equal to
jq -n '42 <= 2' # false
jq -n '42 <= 42' # true
jq -n '42 <= 442' # true

# is greater than
jq -n '42 > 2' # true
jq -n '42 > 42' # false
jq -n '42 > 442' # false

# is greater than or equal to
jq -n '42 >= 2' # true
jq -n '42 >= 42' # true
jq -n '42 >= 442' # false

One thing to watch out for is that the jq equality operators behave like the strict equality operators in JavaScript (i.e. like === & !==) — in other words, to be considered equal, the values and types must be the same. That means that jq does not consider the number 42 and the string "42" to be equal:

jq -n '42 == 2' # false
jq -n '42 == 42' # true
jq -n '42 == "42"' # false

Boolean Operators

The jq language allows you to combine boolean values using the and and or operators. These will cast the values they process to booleans. Every language gets to define its own set of rules for casting values of one type to value of another type, and when it comes to converting any value that’s not already a boolean to a boolean in the jq language, there is just one rule — everything that’s not the boolean false or the special value null is converted to true, and every other value is converted to false.

This simplicity makes jq both unusual and potentially counterintuitive. We’ve met this concept of casting values to true or false in our exploration of Javascript, and the rules there were much more complex. Given our experience with Javascript, I want to highlight some of the differences. In jq, the number 0 , empty strings (""), empty arrays, and empty dictionaries all convert to true, unlike in Javascript, where all these empty values convert to false.

Let’s prove this to ourselves by forcing jq to do boolean conversions on a collection of example values. We need to do this a little indirectly by making use of how the Boolean and operator works. If you apply the and operator to the value true and some test value, then if the test value gets converted to true by jq, the output will be true, and if jq converts the test value to false the output will be false.

OK, let’s test some values and see what we get:

# Boolean values
jq -n 'true and true' # true
jq -n 'true and false' # false

# the special value null
jq -n 'true and null' # false

# numbers, including zero
jq -n 'true and 42' # true
jq -n 'true and 0' # true

# strings, including the string "false" and the empty string
jq -n 'true and "waffles"' # true
jq -n 'true and "false"' # true
jq -n 'true and ""' # true

# arrays, including the empty array
# (the JSON syntax for the array is piped to jq as the current input, i.e. as .)
echo '[false, 0, "no"]' | jq 'true and .' # true
echo '[]' | jq 'true and .' # true

# dictionaries, including te empty dictionary
# (the JSON syntax for the dictionaries is piped to jq)
echo '{"breakfast": "pancakes", "desert": "waffles"}' | jq 'true and .' # true
echo '{}' | jq 'true and .' # true

Note that there isn’t a not operator in the jq language, but there is a not function, which brings us nicely to functions 🙂

Functions in jq

In jq, functions are used within filters, so they have implicit access to the filter’s input (i.e. .), which means many functions don’t need any arguments at all. To call a function without arguments, simply use its name.

As you would expect, jq functions can take optional additional inputs in the form of arguments. To call a function with arguments, append them to the function name in regular round brackets, separated by semi-colons (;). So, in jq function calls take forms like:

# no arguments
functionName

# one argument
functionName(firstArgument)

# two arguments
functionName(firstArgument; secondArgument)

When it comes to return values, jq functions behave like jq filters, producing zero, one, or more outputs.

Something that might surprise you, but will prove extremely useful, is that functions can take filters as arguments.

The `not` Function

As mentioned before, in jq, there is no operator for boolean inversion, instead, we have a function named not which takes no arguments and returns true if . evaluates to false, and false if it evaluates to true. To use not, simply add it in a filter pipeline after the boolean value to be inverted has been calculated. Here’s a rather contrived example:

jq -n 'true and true | not' # false

Introducing the `any` and `all` Functions

not isn’t the only boolean function jq provides, there is also the very useful pair of any and all. These two functions both come in three variants, one of which we’ll ignore for now.

The first form is the simplest — the functions expect to be passed an array as the current item to be processed (.), and no arguments. Both functions will cast every value in the input array to a boolean, and if any of them are true the any function will return true, if all of them are true the all function will returns true, and in all other cases, both functions will return false. That’s a complex way of saying “they do what their name implies they would”!

# NOTE - for efficieny, the jq command uses the comma to run two filters, one after the
#        other, each filter being just the name of the function to be called on the input
echo '[false, false, false]' | jq 'any, all' # false & false
echo '[false, true, false]' | jq 'any, all' # true & false
echo '[true, true, true]' | jq 'any, all' # true & true

While this zero-argument version of these boolean functions is useful, the one-argument version really takes their usefulness to the next level. Instead of simply checking if at least one element in an array is true, or if all elements are true, we can pass a filter as the only argument, and then these functions will first apply the filter to each element in the array they are passed, and then do their boolean logic check.

That’s more difficult to say than it is to show, so let’s use all to verify that all numbers in an array are greater than or equal to zero:

echo '[42, 3.1415, 11]' | jq 'all(. >= 0)' # true
echo '[42, 3.1415, -42]' | jq 'all(. >= 0)' # false

Remember that . always represents the item being processed by a filter, so in this case the filter passed to all gets applied three times, and the first time . is 42, the second time it’s 3.1415, and the third time it’s 11 for the first command and -42 for the second command.

The `length` Function

Another function you’ll often find yourself using is the length function, it can count the number of characters in a string, the number of elements in an array, or the number of key-value pairs in a dictionary:

echo '"pancakes"' | jq 'length' #8
echo '"passé"' | jq 'length' #5
echo '"🥞"' | jq 'length' #1
echo '["pancake", "waffle", "cookie"]' | jq 'length' # 3
echo '{"day": 25, "month": 12}' | jq 'length' # 2

Note that when calculating the length of a string, jq does it like a human would — that is to say, it counts visible graphemes, not unicode code points, so all emoji have a length of 1, as do accented characters.

Querying Data with the `select` Function

We’ve been building up to this for the entire episode — we’re finally ready to start querying data!

The key do doing this is the select function, which takes a filter that produces a boolean as its one required argument, applies that filter to the current input. If the filter returns true the select function returns the input, unchanged, otherwise, it returns nothing at all. In effect, select lets us filter many inputs down to just those that meet a criteria of our choosing.

Let’s use our Nobel Laureate data structure to demonstrate the concept. Remember, at the top level this file contains a dictionary with a single key-value pair, prizes which is an array of dictionaries, one for each Nobel prize. To get all the prizes awarded in 2023 we can use:

jq '.prizes[] | select(.year == "2000")' NobelPrizes.json

Let’s break this down — the first thing we do is explode the array of prizes into separate values, then pipe those values to the select function in parallel. The select function runs once for each prize, and checks the year against our desired value using the equality operator, and if the year matches, that prize gets returned, otherwise, it effectively gets disappeared into oblivion. The end result is that we get only the prizes awarded in 2000. Note that the data file incorrectly encodes the years as strings, so we needed to do the same or the == operator would not have worked.

If we just want the medicine prize for 2000 we could add another select into the pipeline:

jq '.prizes[] | select(.year == "2000") | select(.category == "medicine")' NobelPrizes.json

Or, we could be a little more clever, and use the and operator to combine our two conditions into one call to select:

jq '.prizes[] | select(.year == "2000" and .category == "medicine")' NobelPrizes.json

If we only want to see the laureates, and not the rest of the detail we can add another filter to the chain to extract just those:

jq '.prizes[] | select(.year == "2000" and .category == "medicine") | .laureates[]' NobelPrizes.json

Descending into Arrays with the Two-Argument form of `any` (or `all`)

As you can see, it’s quite straightforward to apply a condition to the top level of a dictionary and then return the entire dictionary if that condition is met, but what if you want to return the entire dictionary if any one element of an array within it meets some criteria?

It may not sound like it, but that is a very common scenario, and it’s important to understand how to do these kinds of deeper searches. To illustrate why, let’s try extract the full details for every Nobel prize where one of the winners was a Curie.

To illustrate why this problem is different, let’s try get the answer with just our current knowledge:

jq '.prizes[] | .laureates[] | select(.surname == "Curie")' NobelPrizes.json

This results in the following error:

jq: error (at NobelPrizes.json:0): Cannot iterate over null (null)

Why?

The errors in jq are not always as clear as they could be, but the most common cause of this particular error is an attempt to explode something that’s not an array or does not exist. We know the top-level prizes array exists, so that implies there are some prizes which don’t have any winners, is that possible?

We can use jq to answer that question for us by searching for prizes with no laureates:

jq '.prizes[] | select((.laureates | length) == 0)' NobelPrizes.json

This returns a surprising number of results, each of which has no laureates key at all, but instead has an overallMotivation key with an explanation that there was no prize awarded, and a description of what was done with the money instead.

This is a good example of why we need the ? operator we learned about in the previous instalment — if there are no laureates, then .laureates throws an error, but .laureates[]? silently evaluates to an empty list, so, we can fix our Curie query like so:

jq '.prizes[] | .laureates[]? | select(.surname == "Curie")' NobelPrizes.json

Our query now returns three results:

{
  "id": "6",
  "firstname": "Marie",
  "surname": "Curie",
  "motivation": "\"in recognition of her services to the advancement of chemistry by the discovery of the elements radium and polonium, by the isolation of radium and the study of the nature and compounds of this remarkable element\"",
  "share": "1"
}
{
  "id": "5",
  "firstname": "Pierre",
  "surname": "Curie",
  "motivation": "\"in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel\"",
  "share": "4"
}
{
  "id": "6",
  "firstname": "Marie",
  "surname": "Curie",
  "motivation": "\"in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel\"",
  "share": "4"
}

Clearly, our logic is correct in that it has found all laureates in any prize with the surname Curie, but we’re none the wiser as to which prizes they were!

This is where the two-argument version of any comes to our rescue.

When you pass either any or all two arguments, the first one is interpreted as a filter that will produce multiple values, AKA the generator, and the second argument as a filter to apply to each value produced by the generator to produce the booleans to apply the any or all logic to.

That extra level of indirection can be a little challenging, and it may take a few attempts for the proverbial penny to drop, but let’s try shake it loose with a practical example — the correct solution to our Curie question:

jq '.prizes[] | select(any(.laureates[]?; .surname == "Curie"))' NobelPrizes.json

This query returns two results — the 1911 prize for Chemistry won by Marie Curie alone, and the 1903 prize for Physics shared by Marie Curie, Pierre Curie, and Henri Becquerel.

OK, so it does work, but why?

We are now calling the select function just once for each Nobel prize in the dataset. When we call it, we pass a call to the any function with the filter .laureates[]? as the generator, which will explode the array of winners into separate values, each of those values is then tested against the second argument to any, the filter .surname == "Curie", to produce one boolean for each laureate depending on whether or not their surname matched. If any one surname matches, then any will return true, so select will output the entire input it received, i.e., the entire prize, not just the laureate.

Some Optional Challenges

Can you develop jq commands to answer the following questions:

What prize did friend of the NosillaCast podcast Dr. Andrea Ghez win? List the year, category, and motivation.
How many laureates were there for each prize? List the year, category, and number of winners for each.
Which prizes were won outright, i.e. not shared? List the year, category, first name, last name, and motivation for each.

Final Thoughts

We’ve been introduced to a lot of concepts in this instalment — filter changing, operators, and functions — and we’ve learned how they can all be combined to search JSON data. There was a lot to absorb in this instalment, so we’re going to take things a little easier in the next instalment and work our way through some of the many very useful functions jq provides. While we will learn new techniques, we won’t be learning any new concepts next time.

Extracting Data with jq jq Miniseries More Queries

156. Extracting Data with jq Programming by Stealth 158. More Queries

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack