PBS 158 of X: More Queries (jq)

20 Dec 2023

In the previous instalment we learned about jq literal values, operators, and functions so we could learn to query JSON files for data matching given criteria using the select() function. In this instalment we’ll mirror the structure of the previous instalment somewhat, starting with some functions related to data types, then we’ll learn about one new operator, and then we’ll move on to looking at some of jq’s more advanced searching functions, specifically the contains function and regular expression searches.

Matching Podcast Episodes

This installment is broken up into two podcast episodes:

PBS 158A Podcast Episode:

Listen along to this instalment on episode 782 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript: CCATP_2023_12_20

PBS 158B Podcast Episode

Listen along to this instalment on episode 783 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript: CCATP_2023_12_30

Installment Resources

The instalment ZIP file — pbs158.zip

PBS 157 Challenge Solutions

At the end of the previous instalment I offered some optional challenges to get some practice querying JSON with the jq command. All the challenges involved querying the Nobel Prises data set we’ve been using in our examples throughout this mini series.

Question 1 — Dr. Ghez’s Prize

“What prize did friend of the NosillaCast podcast Dr. Andrea Ghez win? List the year, category, and motivation.”

jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | .year, .category, (.laureates[] | select(.surname == "Ghez") | .motivation)' NobelPrizes.json

The first thing I want to draw your attention to is the use of the combination of the select() and any() functions to query within the laureates arrays for each prize without exploding them, ensuring its the entire dictionary for each matching prize that is passed to the next filter in the pipeline by select.

Next, note the use of the , (or as we call it the and also) operator to separate the multiple filters needed to extract the three requested pieces of data for the matching prize. Finally, note the need to a nested filter chain to extract the final value from within the matching element of the laureates array.

Question 2 — Laureates Per Prize

“How many laureates were there for each prize? List the year, category, and number of winners for each.”

jq '.prizes[] | .year, .category, (.laureates | length )' NobelPrizes.json

Note that we don’t explode the laureates array when calculating its length because the length function needs to receive the entire array as a single input.

Question 3 — Single Winner Prizes

“Which prizes were won outright, i.e. not shared? List the year, category, first name, last name, and motivation for each.”

jq '.prizes[] | select((.laureates? | length) == 1) | .year, .category, .laureates[0].firstname, .laureates[0].surname, .laureates[0].motivation' NobelPrizes.json

Note that because we know there is exactly one winner, we can easily access the details for the laureate using laureates[0] to access the first element in the laureates array.

As we learned in the previous instalment, jq users the same basic data types as JSON.

Determining a Value’s Type

We can get the type of a value as a string with the type function. The possible return values from type are "null", "boolean", "number", "string", "array", and "object" (for dictionaries).

Converting Between Types

We know that our example Nobel Prizes data set wrongly encodes years as strings, and we can now verify that programatically:

jq '.prizes[0].year | type' NobelPrizes.json # "string"

Note that our example data set is by no means alone in this — there is a lot of poorly encoded JSON out there!

We also learned that the equality operator (==) uses strict type matching, and that the comparison operators (<, >, <= & >=) treat strings and numbers differently, sorting the former alphabetically and the latter numerically.

Let’s remind ourselves what these facts mean in practical terms:

# strings of digits are not the same as numbers
jq -n '2001 == "2001"' # false

# strings get compared alphabetically, even if they are strings of digits
jq -n '"2001" > "23"' # false

Given all this it should come as no surprise that jq provides functions for converting between numbers and strings — tonumber & tostring, and neither take any arguments.

Note that tonumber gives an error if it’s passed anything but a numeric string.

Let’s quickly see these functions in action:

jq -n '(2001 | tostring) == "2001"' # true
jq -n '2001 == ("2001" | tonumber)' # true

jq -n '("2001" | tonumber) > ("23" | tonumber)' # true

jq -n '"waffles" | tonumber' # throws 'Invalid numeric literal' error

Filtering by Type

To make it easier to skip over items of the wrong type, jq provides a collection of select-like functions that apply a type criteria and either pass the value through, or silently drop it. This pass or drop behaviour is just like what we saw in the previous instalment with select.

These functions don’t require any arguments since they’re basically ‘does exactly what it says on the tin’ functions 😉

It’s easier to show what these functions do with code snippets than to try describe them. To help us do that we’ll use the file sampleData.json which contains a top-level array containing entries of different data types:

[
  null,
  true,
  -1,
  0,
  11,
  3.1415,
  "42",
  "waffles",
  [],
  ["dogs", "cats"],
  {},
  {"apples": 12, "pears": 3}
]

Let’s now use this file to demonstrate the select-like functions nulls, values, booleans, numbers, strings, arrays, objects, iterables & scalars in action:

# select all nulls (only really useful if you want to check if there are any or count them)
jq '.[] | nulls' sampleData.json
# null

# select everything that's not null
jq '.[] | values' sampleData.json
# true
# -1
# 0
# 11
# 3.1415
# "42"
# "waffles"
# []
# [
#   "dogs",
#   "cats"
# ]
# {}
# {
#   "apples": 12,
#   "pears": 3
# }

# select all booleans
jq '.[] | booleans' sampleData.json
# true

# select all numbers
jq '.[] | numbers' sampleData.json
# -1
# 0
# 11
# 3.1415
# Note: the string "42" is not selected because strict types are used

# select all strings
jq '.[] | strings' sampleData.json
# "42"
# "waffles"

# select all arrays
jq '.[] | arrays' sampleData.json
# []
# [
#   "dogs",
#   "cats"
# ]
# Note: the empty array gets returned

# select all dictionaries
jq '.[] | objects' sampleData.json
# {}
# {
#   "apples": 12,
#   "pears": 3
# }
# Note: the empty dictionary gets returned

# select all things you can loop over (arrays or dictionaries)
jq '.[] | iterables' sampleData.json
# []
# [
#   "dogs",
#   "cats"
# ]
# {}
# {
#   "apples": 12,
#   "pears": 3
# }

# select everything that's a single value (i.e. that you can't loop over)
jq '.[] | scalars' sampleData.json
# null
# true
# -1
# 0
# 11
# 3.1415
# "42"
# "waffles"

The Alternate Operator (`//`)

One of the advantages JSON offers over more rigid data structures like CSV files and traditional relational databases is the flexibility for each piece of data to have a different structure to the others. You can have an array of dictionaries each defining different key-value pairs in JSON, or, you can have the same key map to data of a completely different type from one dictionary to the next.

A very common example is how APIs treat data fields that can have zero, one, or more values. The API may decide to omit the key completely if there are no values for it, add it as a string if there is one value, or an array of strings if there are more values. This is easy for a human to read, but difficult to query with jq because the you need to accommodate the different possibilities without triggering errors.

We have an example of this in the Nobel Prizes dataset we’ve been using in our examples throughout this series — when a Nobel Prize is awarded, the dictionary representing the prize has a key named laureates which is an array of dictionaries, each representing a winner, but on the years there was no prize the laureates key is completely missing, being replaced instead by a different key named overallMotivation which maps to a string explaining what was done with the prize money that was not awarded.

Because alternative structures are so common in JSON, jq provides an operator for handling alternative possibilities, the alternative operator //. This operator confuses a lot of C, Java, and JavaScript developers because it looks like a comment, but it’s not.

The way the operator works is that the filter to the left of the operator is evaluated first, and if it produces any value other than null or false that value is returned, and the filter to the right of the operator is never even executed. If the filter on the left does produce null or false, then the filter to the right is executed and its result returned.

That probably sounds more complicated than it is, the way I think of it is that if the filter to the left doesn’t produce a real value, then the filter to the right is used instead.

As an example, let’s print out the overall motivation for each prize that doesn’t have any winners, or, the number of winners for each prize in our Nobel Prizes data set:

jq '.prizes[] | .overallMotivation // (.laureates | length )' NobelPrizes.json

More Advanced Searching – Beginning of PBS 158B

In the previous instalment we learned how to use comparison operations with the select function to filter our results by basic criteria like equality checks and numeric comparisons. Let’s now look at a few functions that let us add a little more power into the filters we pass to the select function.

Testing for Containment (the `contains` & `inside` Functions)

We can check if one item contains another with the contains function, and this function is surprisingly powerful, supporting strings, arrays, and even dictionaries!

The contains function always searches the input currently being processed for some value passed as the first and only argument, and it always returns a boolean. The type of the input and the argument mush match, otherwise an error is thrown.

Finally, and most importantly, the rules for evaluating a match depend on the input type.

String Containment

When the input being processed and the argument are strings, contains will return true if the input string contains the entire argument string (contiguously), otherwise it will return false:

echo '"I love waffles"' | jq 'contains("waffles")' # true
echo '"I do enjoy the odd waffle"' | jq 'contains("waffles")' # false (missing final s)
echo '"did you say pan cake?"' | jq 'contains("pancake")' # false (not contiguous)

Array Containment

When the input and argument are arrays, contains will return true if every element in the argument array is contained in any element of the input array, otherwise, it will return false. Note that the order doesn’t matter, and that the test is for containment not equality.

echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancakes"])' # true
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancake"])' # true (containment not equality!)
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["waffles", "pancakes"])' # true
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancakes", "waffles"])' # true (order irrelevant)
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["waffles", "apples"])' # true (do not have to be contiguous)

echo '["waffles", "pancakes", "apples"]' | jq 'contains(["popcorn"])' # false
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["popcorn", "waffles"])' # false (all must be contained)

Dictionary Containment

When the input and argument are dictionaries, contains will return true if the input dictionary’s value for every key in the argument dictionary contains the value in the argument dictionary, and false otherwise. Again, notice that the check is containment not equality.

To demonstrate this without using overly large commands, we’ll use the file menu.json from the instalment zip which has the following contents:

{
  "breakfast": ["bacon", "eggs", "toast", "waffles", "pancakes"],
  "lunch": ["sandwiches", "rolls", "baps", "wraps"],
  "dinner": ["pizza", "pasta", "burgers"]
}

We can now see how contains handles dictionaries:

jq 'contains({"breakfast": ["bacon"]})' menu.json # true

This returns true because out input data does define a key named breakfast, and, its value is an array which contains the array ["bacon"].

jq 'contains({"breakfast": ["bacon", "waffle"]})' menu.json # true

This is still true because the rules for array containment allow substrings.

But, as soon as one key is missing or one value is missing, false is returned:

jq 'contains({"breakfast": ["bacon", "waffle", "popcorn"]})' menu.json # false (missing value)
jq 'contains({"breakfast": ["bacon"], "dessert": ["cake"]})' menu.json # false (missing key)

Default Containment

When the input and argument are both booleans, numbers, or null, the contains function falls back to a simple equality check, so the number 420 does not contain the number 42, but the number 42 does ‘contain’ the number 42:

echo '420' | jq 'contains(42)' # false
echo '342' | jq 'contains(42)' # false (equality check)
echo '42' | jq 'contains(42)' # true
echo 'false' | jq 'contains(false)' # true (equality check)
echo 'null' | jq 'contains(null)' # true (equality check)

Reverse Containment with the `inside` Function

With the contains function we can test if the input currently being processed contains a given value, but what is we want to test if the input is entirely contained within a given value? That’s what the inside function is for. The official documentation on this function is very sparse, describing inside as:

“essentially, an inversed version of contains.”

Like contains, inside throws an error when the input and argument types don’t match, and it similarly falls back to an equality check when the input is not a string, array, or dictionary.

When the input is a string, array, or dictionary, inside behaves like contains but with the input and argument values swapped:

# subset as input, superset as argument
echo '"waffles"' | jq 'inside("I like waffles")' # true
echo '["waffles"]' | jq 'inside(["waffles", "pancakes", "popcorn"])' # true
echo '{"breakfast": ["pancakes"]}' | jq 'inside({"breakfast": ["pancakes", "mueseli"], "snacks": ["popcorn", "waffles"]})' # true

# equality fall-back
echo '42' | jq 'inside(42)' # true
echo '42' | jq 'inside(342)' #false
echo 'false' | jq 'inside(false)' # true
echo 'null' | jq 'inside(null)' # true

Searching with Regular Expressions (the `test` Function)

Before we look at the function for matching strings to regular expressions, let’s briefly explore Jq’s support for regular expressions in general.

First and foremost, there are no regular expression literals in jq, so we must represent regular expressions as strings. This is annoying, because it means we sometimes have to double-escape the \ character which has meaning within both strings and regular expressions 🙁

While the lack of literals is annoying, we can take comfort in the fact that jq uses Perl Compatible Regular Expressions (PCRE), just like JavaScript.

To use regular expressions when searching we can use the test function. test requires one argument, a string representing a regular expression, and it applies that regular expression to the input currently being processed and returns true or false depending on whether or not the input matches the regular expression. The input must be a string or jq will throw an error. PCRE flags like i for case insensitive can be specified by passing a single string as an optional second argument.

# match an IP address
echo '"37.139.7.12"' | jq 'test("[0-9]{1,3}([.][0-9]{1,3}){3}")' # true

# match for words starting with w
echo '"waffles"' | jq 'test("^w")' # true
echo '"Waffles"' | jq 'test("^w")' # false (case)

# case-insensitive match for words starting with w
echo '"waffles"' | jq 'test("^w"; "i")' # true
echo '"Waffles"' | jq 'test("^w"; "i")' # true

Putting it all together, let’s search for Nobel Prize winners with surnames starting with a vowel.

In our regular expression we can use the character class [aeiou] to match any vowel, combine that with the starts with operator ^, and make it case-insensitive with the i flag. Putting it all together that gives us the following as a reasonable first attempt:

jq '.prizes[] | .laureates[]? | select(.surname | test("^[aeiou]"; "i")) | (.firstname, .surname)' NobelPrizes.json

This starts by producing some sane output, but then we hit an error:

null (null) cannot be matched, as it is not a string

That’s unexpected! Perhaps some laureates don’t have surnames? We test that theory with the command:

jq '.prizes[] | .laureates[]? | select(.surname == null)' NobelPrizes.json

That returns entries like:

{
  "id": "818",
  "motivation": "\"for their efforts to build up and disseminate greater knowledge about man-made climate change, and to lay the foundations for the measures that are needed to counteract such change\"",
  "share": "2",
  "firstname": "Intergovernmental Panel on Climate Change"
}

So, when a prize is awarded to an organisation rather than a person the organisation’s name is stored in the firstname field, and there is no surname field. This means that when there is no surname we should match against the first name instead. This gives us an excellent opportunity to use the alternate operator (//), both to handle the input to the test function, and to provide nicer output when there is no surname:

jq '.prizes[] | .laureates[]? | select(.surname // .firstname | test("^[aeiou]"; "i")) | (.firstname, .surname // "*none*")' NobelPrizes.json

An Optional Challenge

Find all the laureates awarded their prize for something to do with quantum physics, i.e. the first name, surname, and motivation for each winner where the motivation contains a word that starts with ‘quantum’ in any case.

Hint: there is a PCRE operator for specifying word boundaries, and be aware of the potential need for double-escaping when passing regular expressions as strings!

Final Thoughts

When introducing jq we described it has having three major use-cases:

Pretty-printing JSON
Searching JSON
Transformatting JSON

We covered the pretty-printing use case in the very first instalment of this miniseries, and while we’ve not covered every possible mechanism for searching JSON with jq, we’ve covered the most most useful techniques, operators & functions, so we’ll consider this second use-case completed too. That means that in the next instalment we’ll move on to using jq to transform JSON data.

Querying JSON with jq jq Miniseries Building Data Structures

157. Querying JSON with jq Programming by Stealth 159. Building Data Structures

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack