PBS 158 of X: More Queries (jq)
In the previous instalment we learned about jq literal values, operators, and functions so we could learn to query JSON files for data matching given criteria using the select()
function. In this instalment we’ll mirror the structure of the previous instalment somewhat, starting with some functions related to data types, then we’ll learn about one new operator, and then we’ll move on to looking at some of jq’s more advanced searching functions, specifically the contains
function and regular expression searches.
Matching Podcast Episodes
This installment is broken up into two podcast episodes:
PBS 158A Podcast Episode:
Listen along to this instalment on episode 782 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript: CCATP_2023_12_20
PBS 158B Podcast Episode
Listen along to this instalment on episode 783 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript: CCATP_2023_12_30
Installment Resources
- The instalment ZIP file — pbs158.zip
PBS 157 Challenge Solutions
At the end of the previous instalment I offered some optional challenges to get some practice querying JSON with the jq
command. All the challenges involved querying the Nobel Prises data set we’ve been using in our examples throughout this mini series.
Question 1 — Dr. Ghez’s Prize
“What prize did friend of the NosillaCast podcast Dr. Andrea Ghez win? List the year, category, and motivation.”
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | .year, .category, (.laureates[] | select(.surname == "Ghez") | .motivation)' NobelPrizes.json
The first thing I want to draw your attention to is the use of the combination of the select()
and any()
functions to query within the laureates
arrays for each prize without exploding them, ensuring its the entire dictionary for each matching prize that is passed to the next filter in the pipeline by select
.
Next, note the use of the ,
(or as we call it the and also) operator to separate the multiple filters needed to extract the three requested pieces of data for the matching prize. Finally, note the need to a nested filter chain to extract the final value from within the matching element of the laureates
array.
Question 2 — Laureates Per Prize
“How many laureates were there for each prize? List the year, category, and number of winners for each.”
jq '.prizes[] | .year, .category, (.laureates | length )' NobelPrizes.json
Note that we don’t explode the laureates
array when calculating its length because the length
function needs to receive the entire array as a single input.
Question 3 — Single Winner Prizes
“Which prizes were won outright, i.e. not shared? List the year, category, first name, last name, and motivation for each.”
jq '.prizes[] | select((.laureates? | length) == 1) | .year, .category, .laureates[0].firstname, .laureates[0].surname, .laureates[0].motivation' NobelPrizes.json
Note that because we know there is exactly one winner, we can easily access the details for the laureate using laureates[0]
to access the first element in the laureates
array.
Type-Related jq Functions
As we learned in the previous instalment, jq users the same basic data types as JSON.
Determining a Value’s Type
We can get the type of a value as a string with the type
function. The possible return values from type
are "null"
, "boolean"
, "number"
, "string"
, "array"
, and "object"
(for dictionaries).
Converting Between Types
We know that our example Nobel Prizes data set wrongly encodes years as strings, and we can now verify that programatically:
jq '.prizes[0].year | type' NobelPrizes.json # "string"
Note that our example data set is by no means alone in this — there is a lot of poorly encoded JSON out there!
We also learned that the equality operator (==
) uses strict type matching, and that the comparison operators (<
, >
, <=
& >=
) treat strings and numbers differently, sorting the former alphabetically and the latter numerically.
Let’s remind ourselves what these facts mean in practical terms:
# strings of digits are not the same as numbers
jq -n '2001 == "2001"' # false
# strings get compared alphabetically, even if they are strings of digits
jq -n '"2001" > "23"' # false
Given all this it should come as no surprise that jq provides functions for converting between numbers and strings — tonumber
& tostring
, and neither take any arguments.
Note that tonumber
gives an error if it’s passed anything but a numeric string.
Let’s quickly see these functions in action:
jq -n '(2001 | tostring) == "2001"' # true
jq -n '2001 == ("2001" | tonumber)' # true
jq -n '("2001" | tonumber) > ("23" | tonumber)' # true
jq -n '"waffles" | tonumber' # throws 'Invalid numeric literal' error
Filtering by Type
To make it easier to skip over items of the wrong type, jq provides a collection of select-like functions that apply a type criteria and either pass the value through, or silently drop it. This pass or drop behaviour is just like what we saw in the previous instalment with select
.
These functions don’t require any arguments since they’re basically ‘does exactly what it says on the tin’ functions 😉
It’s easier to show what these functions do with code snippets than to try describe them. To help us do that we’ll use the file sampleData.json
which contains a top-level array containing entries of different data types:
[
null,
true,
-1,
0,
11,
3.1415,
"42",
"waffles",
[],
["dogs", "cats"],
{},
{"apples": 12, "pears": 3}
]
Let’s now use this file to demonstrate the select-like functions nulls
, values
, booleans
, numbers
, strings
, arrays
, objects
, iterables
& scalars
in action:
# select all nulls (only really useful if you want to check if there are any or count them)
jq '.[] | nulls' sampleData.json
# null
# select everything that's not null
jq '.[] | values' sampleData.json
# true
# -1
# 0
# 11
# 3.1415
# "42"
# "waffles"
# []
# [
# "dogs",
# "cats"
# ]
# {}
# {
# "apples": 12,
# "pears": 3
# }
# select all booleans
jq '.[] | booleans' sampleData.json
# true
# select all numbers
jq '.[] | numbers' sampleData.json
# -1
# 0
# 11
# 3.1415
# Note: the string "42" is not selected because strict types are used
# select all strings
jq '.[] | strings' sampleData.json
# "42"
# "waffles"
# select all arrays
jq '.[] | arrays' sampleData.json
# []
# [
# "dogs",
# "cats"
# ]
# Note: the empty array gets returned
# select all dictionaries
jq '.[] | objects' sampleData.json
# {}
# {
# "apples": 12,
# "pears": 3
# }
# Note: the empty dictionary gets returned
# select all things you can loop over (arrays or dictionaries)
jq '.[] | iterables' sampleData.json
# []
# [
# "dogs",
# "cats"
# ]
# {}
# {
# "apples": 12,
# "pears": 3
# }
# select everything that's a single value (i.e. that you can't loop over)
jq '.[] | scalars' sampleData.json
# null
# true
# -1
# 0
# 11
# 3.1415
# "42"
# "waffles"
The Alternate Operator (//
)
One of the advantages JSON offers over more rigid data structures like CSV files and traditional relational databases is the flexibility for each piece of data to have a different structure to the others. You can have an array of dictionaries each defining different key-value pairs in JSON, or, you can have the same key map to data of a completely different type from one dictionary to the next.
A very common example is how APIs treat data fields that can have zero, one, or more values. The API may decide to omit the key completely if there are no values for it, add it as a string if there is one value, or an array of strings if there are more values. This is easy for a human to read, but difficult to query with jq because the you need to accommodate the different possibilities without triggering errors.
We have an example of this in the Nobel Prizes dataset we’ve been using in our examples throughout this series — when a Nobel Prize is awarded, the dictionary representing the prize has a key named laureates
which is an array of dictionaries, each representing a winner, but on the years there was no prize the laureates
key is completely missing, being replaced instead by a different key named overallMotivation
which maps to a string explaining what was done with the prize money that was not awarded.
Because alternative structures are so common in JSON, jq provides an operator for handling alternative possibilities, the alternative operator //
. This operator confuses a lot of C, Java, and JavaScript developers because it looks like a comment, but it’s not.
The way the operator works is that the filter to the left of the operator is evaluated first, and if it produces any value other than null
or false
that value is returned, and the filter to the right of the operator is never even executed. If the filter on the left does produce null
or false
, then the filter to the right is executed and its result returned.
That probably sounds more complicated than it is, the way I think of it is that if the filter to the left doesn’t produce a real value, then the filter to the right is used instead.
As an example, let’s print out the overall motivation for each prize that doesn’t have any winners, or, the number of winners for each prize in our Nobel Prizes data set:
jq '.prizes[] | .overallMotivation // (.laureates | length )' NobelPrizes.json
More Advanced Searching – Beginning of PBS 158B
In the previous instalment we learned how to use comparison operations with the select
function to filter our results by basic criteria like equality checks and numeric comparisons. Let’s now look at a few functions that let us add a little more power into the filters we pass to the select
function.
Testing for Containment (the contains
& inside
Functions)
We can check if one item contains another with the contains
function, and this function is surprisingly powerful, supporting strings, arrays, and even dictionaries!
The contains
function always searches the input currently being processed for some value passed as the first and only argument, and it always returns a boolean. The type of the input and the argument mush match, otherwise an error is thrown.
Finally, and most importantly, the rules for evaluating a match depend on the input type.
String Containment
When the input being processed and the argument are strings, contains
will return true
if the input string contains the entire argument string (contiguously), otherwise it will return false
:
echo '"I love waffles"' | jq 'contains("waffles")' # true
echo '"I do enjoy the odd waffle"' | jq 'contains("waffles")' # false (missing final s)
echo '"did you say pan cake?"' | jq 'contains("pancake")' # false (not contiguous)
Array Containment
When the input and argument are arrays, contains
will return true
if every element in the argument array is contained in any element of the input array, otherwise, it will return false
. Note that the order doesn’t matter, and that the test is for containment not equality.
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancakes"])' # true
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancake"])' # true (containment not equality!)
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["waffles", "pancakes"])' # true
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["pancakes", "waffles"])' # true (order irrelevant)
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["waffles", "apples"])' # true (do not have to be contiguous)
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["popcorn"])' # false
echo '["waffles", "pancakes", "apples"]' | jq 'contains(["popcorn", "waffles"])' # false (all must be contained)
Dictionary Containment
When the input and argument are dictionaries, contains
will return true
if the input dictionary’s value for every key in the argument dictionary contains the value in the argument dictionary, and false
otherwise. Again, notice that the check is containment not equality.
To demonstrate this without using overly large commands, we’ll use the file menu.json
from the instalment zip which has the following contents:
{
"breakfast": ["bacon", "eggs", "toast", "waffles", "pancakes"],
"lunch": ["sandwiches", "rolls", "baps", "wraps"],
"dinner": ["pizza", "pasta", "burgers"]
}
We can now see how contains
handles dictionaries:
jq 'contains({"breakfast": ["bacon"]})' menu.json # true
This returns true
because out input data does define a key named breakfast
, and, its value is an array which contains the array ["bacon"]
.
jq 'contains({"breakfast": ["bacon", "waffle"]})' menu.json # true
This is still true
because the rules for array containment allow substrings.
But, as soon as one key is missing or one value is missing, false
is returned:
jq 'contains({"breakfast": ["bacon", "waffle", "popcorn"]})' menu.json # false (missing value)
jq 'contains({"breakfast": ["bacon"], "dessert": ["cake"]})' menu.json # false (missing key)
Default Containment
When the input and argument are both booleans, numbers, or null
, the contains
function falls back to a simple equality check, so the number 420
does not contain the number 42
, but the number 42
does ‘contain’ the number 42
:
echo '420' | jq 'contains(42)' # false
echo '342' | jq 'contains(42)' # false (equality check)
echo '42' | jq 'contains(42)' # true
echo 'false' | jq 'contains(false)' # true (equality check)
echo 'null' | jq 'contains(null)' # true (equality check)
Reverse Containment with the inside
Function
With the contains
function we can test if the input currently being processed contains a given value, but what is we want to test if the input is entirely contained within a given value? That’s what the inside
function is for. The official documentation on this function is very sparse, describing inside
as:
“essentially, an inversed version of
contains
.”
Like contains
, inside
throws an error when the input and argument types don’t match, and it similarly falls back to an equality check when the input is not a string, array, or dictionary.
When the input is a string, array, or dictionary, inside
behaves like contains
but with the input and argument values swapped:
# subset as input, superset as argument
echo '"waffles"' | jq 'inside("I like waffles")' # true
echo '["waffles"]' | jq 'inside(["waffles", "pancakes", "popcorn"])' # true
echo '{"breakfast": ["pancakes"]}' | jq 'inside({"breakfast": ["pancakes", "mueseli"], "snacks": ["popcorn", "waffles"]})' # true
# equality fall-back
echo '42' | jq 'inside(42)' # true
echo '42' | jq 'inside(342)' #false
echo 'false' | jq 'inside(false)' # true
echo 'null' | jq 'inside(null)' # true
Searching with Regular Expressions (the test
Function)
Before we look at the function for matching strings to regular expressions, let’s briefly explore Jq’s support for regular expressions in general.
First and foremost, there are no regular expression literals in jq, so we must represent regular expressions as strings. This is annoying, because it means we sometimes have to double-escape the \
character which has meaning within both strings and regular expressions 🙁
While the lack of literals is annoying, we can take comfort in the fact that jq uses Perl Compatible Regular Expressions (PCRE), just like JavaScript.
To use regular expressions when searching we can use the test
function. test
requires one argument, a string representing a regular expression, and it applies that regular expression to the input currently being processed and returns true
or false
depending on whether or not the input matches the regular expression. The input must be a string or jq will throw an error. PCRE flags like i
for case insensitive can be specified by passing a single string as an optional second argument.
# match an IP address
echo '"37.139.7.12"' | jq 'test("[0-9]{1,3}([.][0-9]{1,3}){3}")' # true
# match for words starting with w
echo '"waffles"' | jq 'test("^w")' # true
echo '"Waffles"' | jq 'test("^w")' # false (case)
# case-insensitive match for words starting with w
echo '"waffles"' | jq 'test("^w"; "i")' # true
echo '"Waffles"' | jq 'test("^w"; "i")' # true
Putting it all together, let’s search for Nobel Prize winners with surnames starting with a vowel.
In our regular expression we can use the character class [aeiou]
to match any vowel, combine that with the starts with operator ^
, and make it case-insensitive with the i
flag. Putting it all together that gives us the following as a reasonable first attempt:
jq '.prizes[] | .laureates[]? | select(.surname | test("^[aeiou]"; "i")) | (.firstname, .surname)' NobelPrizes.json
This starts by producing some sane output, but then we hit an error:
null (null) cannot be matched, as it is not a string
That’s unexpected! Perhaps some laureates don’t have surnames? We test that theory with the command:
jq '.prizes[] | .laureates[]? | select(.surname == null)' NobelPrizes.json
That returns entries like:
{
"id": "818",
"motivation": "\"for their efforts to build up and disseminate greater knowledge about man-made climate change, and to lay the foundations for the measures that are needed to counteract such change\"",
"share": "2",
"firstname": "Intergovernmental Panel on Climate Change"
}
So, when a prize is awarded to an organisation rather than a person the organisation’s name is stored in the firstname
field, and there is no surname
field. This means that when there is no surname we should match against the first name instead. This gives us an excellent opportunity to use the alternate operator (//
), both to handle the input to the test
function, and to provide nicer output when there is no surname:
jq '.prizes[] | .laureates[]? | select(.surname // .firstname | test("^[aeiou]"; "i")) | (.firstname, .surname // "*none*")' NobelPrizes.json
An Optional Challenge
Find all the laureates awarded their prize for something to do with quantum physics, i.e. the first name, surname, and motivation for each winner where the motivation contains a word that starts with ‘quantum’ in any case.
Hint: there is a PCRE operator for specifying word boundaries, and be aware of the potential need for double-escaping when passing regular expressions as strings!
Final Thoughts
When introducing jq we described it has having three major use-cases:
- Pretty-printing JSON
- Searching JSON
- Transformatting JSON
We covered the pretty-printing use case in the very first instalment of this miniseries, and while we’ve not covered every possible mechanism for searching JSON with jq, we’ve covered the most most useful techniques, operators & functions, so we’ll consider this second use-case completed too. That means that in the next instalment we’ll move on to using jq to transform JSON data.