PBS 159 of X: Building Data Structures (jq)
When we started our exploration of the jq
command and the jq language, we described them as solving three distinct problems:
- Pretty printing JSON
- Searching JSON
- Transforming JSON
We covered pretty printing right at the start of this mini series, and we’ve spent the rest of our time learning how to filter down the input JSON to just the information we are interested in. Now it’s time to start making better use of that filtered information by transforming it into the form we want. Those transformations can be thought of as taking two distinct forms — altering existing values by applying some kind of rule to them, and assembling the information into new data structures that fit our needs.
The jq language offers a wonderfully wide array of functions and operators for altering existing data, but we’re going to start our exploration of jq data transformation by looking at jq’s syntax for building new data structures, specifically assembling strings, building arrays, and building dictionaries.
Matching Podcast Episode
Listen along to this instalment on episode 784 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript: CCATP_2024_01_20
Installment Resources
- The instalment ZIP file — pbs159.zip
PBS 158 Challenge Solution
The challenge set at the end of the previous instalment was to find all the laureates awarded their prize for something to do with quantum physics, i.e. the first name, surname, and motivation for each winner, where the motivation contains a word that starts with ‘quantum’ in any case. There was also a hint that there is a PCRE (Perl-Compatible Regular Expression) escape sequence for matching word boundaries.
Overall this is a very similar problem to the final worked example in the previous instalment which listed all laureates with surnames starting with a vowel. The difference is the regular expression to used to filter down the laureates within the select
function.
In other words, the overall structure of the filter will be:
.prizes[] | .laureates[]? | select(SOME_TEST)
We just need to figure out what SOME_TEST
should be.
We need to apply a regular expression to the motivation
field in each laureate dictionary. The overall structure of the test will be:
.motivation | test("SOME_RE"; "SOME_FLAGS")
Which brings us to the key of the qestion — what’s the regular expression for finding words starting with ‘quantum’? The escape sequence for word boundary is \b
, and we need to double-escape regular expression stings in jq. So, the regular expression string we need is "\\bquantum"
, and since we need to perform a case-insensitive search, the flag we need is simply, "i"
.
Putting it all togehter we get the following jq
command:
jq '.prizes[] | .laureates[]? | select(.motivation | test("\\bquantum"; "i"))' NobelPrizes.json
This returns 23 prizes, with the oldest (last on the list) being to Werner Heisenberg for the creation of quantum mechanics, and the most recent (first on the list) being to Moungi Bawendi, Louis Brus & Aleksey Yekimov for the quantum dots that are making our most modern screens so nice!
Building Strings with Interpolation
When working with JavaScript, we learned that we could use so-called template strings to build strings that contained both directly specified characters, and values read from variables. This type of string construction is known as string interpolation, and the jq language does support it, even if it does so in a somewhat unusual way.
We’ve already seen that we can use double quotes to create simple strings within our filters. Those strings support the standard espcae sequences you would expect from our experience with JavaScript and Bash. For example; \"
to include a double-quote within a string, \n
for a new line character, and \t
for a tab character.
The jq lanaguge adds a non-standard escape sequence for injecting the result of a filter into a string — \()
, with the filter specified between the parentheses.
For example, using our Nobel Prizes data set (NobelPrizes.json
in the instalment ZIP file) we can use our existing search experience to find the laureates dictionary for friend of the NosillaCast podcast Dr. Andrea Ghez, and we can then use string interpolation to build a nice human-friendly description of her citation:
# the -r flag to output the text directly instead of as a JSON string
jq -r '.prizes[] | .laureates[]? | select(.surname == "Ghez") | "\(.firstname) \(.surname) was awarded her prize for \(.motivation)"' NobelPrizes.json
This outputs the text:
Andrea Ghez was awarded her prize for “for the discovery of a supermassive compact object at the centre of our galaxy”
Breaking the example down, the first few filters in the jq pipeline select out just Andrea’s entry in the laureates
array in the prize that contains her as a laureate.
.prizes[] | .laureates[]? | select(.surname == "Ghez")
If we run the command with just this much of the filter chain we can see the dictionary that acts as the input to the final filter in the chain. That is the one that creates a new string using string interpolation:
# no -r flag since we want pretty printed JSON output
jq '.prizes[] | .laureates[]? | select(.surname == "Ghez")' NobelPrizes.json
This produces the dictionary:
{
"id": "990",
"firstname": "Andrea",
"surname": "Ghez",
"motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
"share": "4"
}
Now, let’s look at the final filter, the one that builds the output string using interpolation:
"\(.firstname) \(.surname) was awarded her prize for \(.motivation)"
You can see the filter contains a single string definition, and that string definition contains three inserted values because there are three \()
escapes — \(.firstname)
to inject Dr. Andrea’s first name from the input to the filter (i.e. the dictionary above), \(.surname)
to inject Dr. Andrea’s surname, and \(.motivation)
to inject the reason she was awarded the prize.
Building Arrays
In our various examples, we’ve already seen that we can create arrays containing basic values by wrapping those values in square brackets ([]
). This same syntax can actually be used to add the results of executing filters to an array by simply including a filter as an array element. If a filter wrapped in square brackets produces multiple values, each value returned by the filter will become an element in the constructed array.
The most common use for wrapping filters in square brackets is to recombine exploded arrays back into a single array.
If we go back to our Nobel Prizes data set, we know that we can explode the top-level array of prizes with .prizes[]
. We also know we can filter those exploded values down using the select
function, for example, we can get all prizes awarded after 2020 with:
jq '.prizes[] | select((.year | tonumber) > 2020)' NobelPrizes.json
(Note the user of the tonumber
function introduced in the previous instalment to ensure the greater than operator is making numeric rather than alphabetic comparisons as explained in instalment 157.)
This will produce a list of many individual dictionaries as outputs — note there is no opening square brace at the top of the output, and no commas separating each dictionary. If we want to combine all those separate outputs back into a single array, we can wrap the entire jq filter with square braces:
jq '[.prizes[] | select((.year | tonumber) > 2020)]' NobelPrizes.json
This time, note that there are opening and closing square braces at the top and bottom of the output, and, that each dictionary is now separated from the next with a comma. In other words, we now have one single output that is an array of dictionaries.
Converting Between Strings & Arrays with split
& join
Some jq fuctions only accept arrays or strings. Some always output arrays or strings. Sometimes those requirements don’t line up with the inputs you need to send to the jq
command for processing, or, that you need out of the jq
command for further use. In those situations, it can be very helpful to convert strings with some kind of separator into arrays, or arrays into strings with some kind of separator. As their names suggest, that’s what the built-in split
& join
functions do.
split
requires a string as the input, and a string representing the characters to split on as the first argument, and returns an array. For example:
# the -n flag to tell jq not to expect any input
jq -n '"1,2,3" | split(",")'
# [
# "1",
# "2",
# "3"
# ]
join
is effectively the inverse and requires an array as the input, a separator string to place between the array elements as the first argument, and returns a string. For example:
echo '[1, 2, 3]' | jq 'join(",")'
# "1,2,3"
As well as splitting on a string separator, jq can also split on a regular expression. When the split
function receives just one argument, it’s interpreted as a string, but when it recieves two, the first is interpreted as a regular expression, and the second as PCRE flags. This means that when you don’t need PCRE flags but you do need a regular expression, you have to remember to pass an empty string as a second argument.
As a quick example we can split on a comma followed by an optional space by using the regular expression [ ]?
with the two-argument form of the split
function:
jq -n '"1,2,3" | split(",[ ]?"; "")'
# [
# "1",
# "2",
# "3"
# ]
jq -n '"1, 2, 3" | split(",[ ]?"; "")'
# [
# "1",
# "2",
# "3"
# ]
Building Dictionaries
The jq
command is often used to process some input JSON data structure into a new output JSON data structure for further processing somewhere else. To do this effectively we need to be able to construct custom dictionaries with just the keys of our choosing, and values we have computed.
The syntax to construct dictionaries in jq is very like the syntax for constructing dictionaries in JavaScript — the entire dictionary is wrapped in curly braces ({}
), and keys and values are separated by colons (:
). Like in JavaScript, and unlike in JSON, key names don’t need to be quoted unless they contain spaces or special characters. Finally, the values can be literal values, or, filters. If a value is specified with a filter, the filter will be executed, and the resuting value stored.
Worked Example — Building a Custom Dictionary to Describe a Specific Nobel Prize
As an example, let’s use the jq
command to process the Nobel Prizes data set to produce a single dictionary describing friend of the Nosillacast Dr. Andrea Ghez’s Nobel prize. We’d like the command we built up to produce the following final dictionary:
{
"year": 2020,
"prize": "physics",
"name": "Andrea Ghez",
"citation": "for the discovery of a supermassive compact object at the centre of our galaxy"
}
Note that with the exception of the key named year
, this final dictionary does not use any of the same names as the input data set, that the year is correctly represented as a number rather than a string, and that the first and surnames have been collapsed into a single name
field.
Let’s start simple and build the command to output a dictionary with just the year and the category.
Based on our previous experience, we know we can use the jq filter chain below to extract just the dictionary describing the prize Dr. Andrea won:
.prizes[] | select(any(.laureates[]?; .surname == "Ghez"))
Let’s use the following jq
command to view the dictionary this filter chain produces:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez"))' NobelPrizes.json
Here’s the resulting dictionary:
{
"year": "2020",
"category": "physics",
"laureates": [
{
"id": "988",
"firstname": "Roger",
"surname": "Penrose",
"motivation": "\"for the discovery that black hole formation is a robust prediction of the general theory of relativity\"",
"share": "2"
},
{
"id": "989",
"firstname": "Reinhard",
"surname": "Genzel",
"motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
"share": "4"
},
{
"id": "990",
"firstname": "Andrea",
"surname": "Ghez",
"motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
"share": "4"
}
]
}
We can see that the two pieces of information we’re interested in are contained in this dictionary, so we can build a custom dictionary with the following jq filter:
{year: .year, prize: .category}
Note that we are defining a new dictionary with two keys: year
, and prize
. The value for each key is specified as a very simple filter — a key path.
We can add this filter to the filter chain in our jq
command like so:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | {year: .year, prize: .category}' NobelPrizes.json
This will produce the following dictionary:
{
"year": "2020",
"prize": "physics"
}
As it stands, this dictionary is wrongly storing the year as a string, so let’s fix that by calling the tonumber
function:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | {year: .year | tonumber, prize: .category}' NobelPrizes.json
Now we get the more correct dictionary:
{
"year": 2020,
"prize": "physics"
}
Next, let’s expand the dictionary to include Dr. Andrea’s name as a single key with the name name
.
We’ll be adding this key into the dictionary we’re assembling in the last filter in the chain, but unlike the filters for the year and the prize, the value for this key can’t just be read with a key name. We’ll need to use a more complex filter instead.
To get at the two keys containing Dr. Andrea’s names we need to explode the laureates
array and filter it down to just the dictionary with her details. We can use our knowledge from previous instalments to develop that filter:
.laureates[] | select(.surname == "Ghez")
To see the dictionary this filter would produce we can run the following jq
command:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | .laureates[] | select(.surname == "Ghez")' NobelPrizes.json
This show us the keys we have at our disposal to assemble the name:
{
"id": "990",
"firstname": "Andrea",
"surname": "Ghez",
"motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
"share": "4"
}
Based on those keys, the following simple filter will build the name using string interpolation:
"\(.firstname) \(.surname)"
Putting all this back together we can now update our jq
command to add a key named name
to the outputted dictionary:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | {year: .year | tonumber, prize: .category, name: .laureates[] | select(.surname == "Ghez") | "\(.firstname) \(.surname)"}' NobelPrizes.json
Here’s where we stand now:
{
"year": 2020,
"prize": "physics",
"name": "Andrea Ghez"
}
The final step is to add in the citation for Dr. Andrea’s award. We can re-use the filter we used to assemble the name, and simply use the value from the motivation
key as the value for our citation
key, in other words, we can get the needed value with the following filter:
.laureates[] | select(.surname == "Ghez") | .motivation
This almost gives us what we need, but the original data set has wrapped this value in a set of quotation marks. We don’t want those! This gives us an excuse to preview two very useful string manipulation functions we’ll look at in more detail in a future instalment — ltrimstr
and rtrimstr
which trim given characters from the left and right of a string. In this case we want to trim any quotation marks present on left and right of the string, so we add calls to both functions into the chain, this gives us the following filter for generating the value for our citation
key:
.laureates[] | select(.surname == "Ghez") | .motivation | ltrimstr("\"") | rtrimstr("\"")
Putting it all together we can now build this final jq
command:
jq '.prizes[] | select(any(.laureates[]?; .surname == "Ghez")) | {year: .year | tonumber, prize: .category, name: .laureates[] | select(.surname == "Ghez") | "\(.firstname) \(.surname)", citation: .laureates[] | select(.surname == "Ghez") | .motivation | ltrimstr("\"") | rtrimstr("\"")}' NobelPrizes.json
Which produces our desired dictionary:
{
"year": 2020,
"prize": "physics",
"name": "Andrea Ghez",
"citation": "for the discovery of a supermassive compact object at the centre of our galaxy"
}
Dictionaries from Strings with capture
Not only does the jq language support standard Perl-Compatible-Regular-Expressions (PCRE), it also supports the named-capture-groups extension to the standard. The built-in capture
function makes use of this feature to use regular expressions to extract named pieces of data from strings.
Note that PCRE regular expressions are explained and described in detail in sister-series Taming the Terminal instalments 17 & 18. A regular expression expresses a pattern, and a capture group is a subset of a pattern. Capture groups are useful when a pattern consists of multiple separate parts, e.g. an email address has a username part before the @
and a domain after the @
, so capture groups could be used to specify those two parts of email addresses.
In regular PCRE, each opening round bracket ((
) within a regular expression starts a new numbered capture group. With named capture groups you assign a name to each group instead of relying on automatic numbering. The name is specified within angled brackets (<>
) after a question mark (?
) at the start of the capture group.
That sounds more complicated than it really is. To capture the hours, minutes, and seconds of a time with capture groups you would use a regular expression like:
(?<hours>[0-9]{1,2}):(?<minutes>[0-9]{2}):(?<seconds>[0-9]{2})
This RE has three named capture groups with colons between them. The first capture group is named hours
and matches one or two digits. The second is named minutes
and matches exactly 2 digits. The third is named seconds
and also matches exactly two digits. Remember that the character class [0-9]
matches a digit, and the cardinality {1,2}
means at least one and at most two, and the cardinarlity {2}
means exactly two.
The built-in jq function capture
requires the input be the string to be processed, and the first argument to be a string containing a regular expression with named capture groups. The output will be a dictionary with key-value pairs for each named capture group in the regular expression. An optional second argument can be passed to specify PCRE flags like i
for case-insensitibve as a single string.
Using our example RE above with jq we can parse times into their components like so:
jq -n '"9:00:00" | capture("(?<hours>[0-9]{1,2}):(?<minutes>[0-9]{2}):(?<seconds>[0-9]{2})")'
# {
# "hours": "9",
# "minutes": "00",
# "seconds": "00"
# }
String Formatting & Escaping with @
When using the jq
command to prepare strings for use in other applications, you may need to escape or format the strings in some way. The jq language offers built-in support for both popular escaping schemes and common data formats.
Regardless of whether you need to escape special characters or to format data as a string, the encoding operator @
, is used. The syntax is to simply prefix a supported encoding name with the @
symbol, e.g. @uri
for URL escaping, and @csv
for CSV formatting. The encoding operator can be used in two ways:
- As a stand-alone filter, usually as the last filter in a pipeline
- As a prefix for an interpolated string, in which case the formatting is only applied to the values injected into the string with the
\()
escape sequence.
Different encodings support different input types, but all encodings support output strings.
When outputting to a file, be sure to specify raw output (-r
or --raw-output
) or your output will unexpectedly wrapped in quotation marks, and hence corrupted!
Worked Example 1 — CSV Formatted Menu
Let’s convert our JSON formatted menu (menu.json
from the instalment ZIP) to a CSV file. As a reminder, this is the contents of the file — a top-level array containing three dictionaries:
[
{
"name": "hotdogs",
"price": 5.99,
"stock": 143
},
{
"name": "pancakes",
"price": 3.10,
"stock": 43
},
{
"name": "waffles",
"price": 7.50,
"stock": 14
}
]
The @csv
filter requires an array as input, and will produce one line of CSV-formatted output. So, to get our CVS menu we’ll need multiple outputs, one for a header row followed by one for each item on the menu. Before we assemble all this into a single command, let’s build the individual filters we’ll need.
Firstly, to get a header row, we need to create an array with the column headers as strings, and then pipe it to the encoding operator as a filter, i.e.:
["Name", "Price", "Stock"] | @csv
Secondly, to format each row we need to explode the top-level array and create a new array with just the values in the right order for each entry:
.[] | [.name, .price, .stock] | @csv
We can now combine these two filters into a single jq
command using the ‘and also’ operator, i.e. ,
:
jq -r '(["Name", "Price", "Stock"] | @csv), (.[] | [.name, .price, .stock] | @csv)' menu.json > menu.csv
If you open the newly created menu.csv
file in your favourite spreadsheet app you’ll see it has been correctly formatted as a CSV file. The raw CSV formatted data looks like this:
"Name","Price","Stock"
"hotdogs",5.99,143
"pancakes",3.10,43
"waffles",7.50,14
Worked Example 2 — Building a Search URL
When including a value in a URL query string, special characters, including spaces, need to be encoded using the URI encoding scheme which maps each special character to an escape sequence starting with the %
symbol, e.g. %20
is a space. To assemble a URL from a string that may include special characters, you need to escape all those characters.
As an example, let’s build the URL to search Google for the name of the first ever Nobel prize winner.
The base search URL is https://www.google.com/search?q=QUERY
, where QUERY
is the URI-encoded search string.
Let’s start by building a jq filter to get the first ever prize. Since our data file contains the prizes in reverse-chronological order, that means the first prize is the last item in the top-level prizes
array, which we can access with .prizes[-1]
. We can get the first laureate by descending into the zero-th element of the laureates
array, i.e. .prizes[-1].laureates[0]
. We can then assemble the name with string interpolation, i.e. .prizes[-1].laureates[0] | "\(.firstname) \(.surname)"
. We can check this works as expected with the following command:
jq '.prizes[-1].laureates[0] | "\(.firstname) \(.surname)"' NobelPrizes.json
This does indeed return "Emil von Behring"
.
We can now insert this entire long filter into another string interpolation URL like so:
jq '"https://www.google.com/search?q=\(.prizes[-1].laureates[0] | "\(.firstname) \(.surname)")"' NobelPrizes.json
This produces the invalid URL "https://www.google.com/search?q=Emil von Behring"
. We need to URI encode the querty string. As a first try, let’s do it in the way we used the @
in the previous example — as a filter entirely onto itself at the end of the pipeline:
jq -r '"https://www.google.com/search?q=\(.prizes[-1].laureates[0] | "\(.firstname) \(.surname)")" | @uri' NobelPrizes.json
That produces a different invalid URL: https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3DEmil%20von%20Behring
What’s happened is that we have URI encoded the entire URL, not just the query string, what we need is to apply URI encoding to just the part inserted by interpolation, and that’s what the second syntax for the @
operator allos us to do. Rather than using the operator as a stand-alone filter, we insert the operator into the filter that does the string interpolation, pre-fixed before the start of the string:
jq -r '@uri "https://www.google.com/search?q=\(.prizes[-1].laureates[0] | "\(.firstname) \(.surname)")"' NobelPrizes.json
This generates the valid and correctURL https://www.google.com/search?q=Emil%20von%20Behring
, which you can test in your favourite browser.
Note that none of the special characters directly specified in the URL string got escaped, only the spaces in the injected name were escaped.
Supported Formats & Escapes
The following data formats are supported by jq:
Format | Description |
---|---|
@text |
The input can be any type, and this format is effectively a shortcut for calling the tostring function. |
@json |
The input can be any type, and the output will be a JSON string representing the input. |
@csv |
The input must be an array, and the output will be one line in CSV (Coma-Separated Value) format. |
@tsv |
The input must be am array, and the output will be one line in TSV (Tab-Separated Value) format. |
The string escaping formats all require strings as inputs, and the following encodings are supported:
Encoding | Description | Examples |
---|---|---|
@html |
Apply HTML escape sequences to the special characters in the input string. | echo '"this & that"' | jq '@html' outputs "this & that" |
@uri |
Apply URL/URI percent-encodings to the special characters in the input string. | echo '"this & that"' | jq '@uri' outputs "this%20%26%20that" |
@base64 |
Apply Base64 encoding to the input string. | echo '"this & that"' | jq '@base64' outputs "dGhpcyAmIHRoYXQ=" |
@base64d |
Apply Base64 decoding to the input string, i.e. the inverse of @base64 . |
echo '"dGhpcyAmIHRoYXQ="' |jq '@base64d' outputs "this & that" |
@sh |
Apply POSIX Shell string escaping to the input string. It’s important to use raw output with this format, i.e. use jq -r . |
echo '"this & that"' | jq -r '@sh' outputs 'this & that' |
An Optional Challenge
Using the Nobel Prizes data set as the input, build a simpler data structure to represent just the most important information.
The output data structure should be an array of dictionaries, one for each prize that was actually awarded, indexed by the following fields:
Field | Type | Description |
---|---|---|
year |
Number | The year the prize was awarded. |
prize |
String | The category the prize was awarded in. |
numWinners |
Number | The number of winners. |
winners |
Array of Strings | The names of all the winners. |
Years where there were no prizes awarded should be skipped.
As an example, when pretty printed, the entry for the 1907 peace prize should look like this:
{
"year": 1907,
"prize": "peace",
"numWinners": 2,
"winners": [
"Ernesto Teodoro Moneta",
"Louis Renault"
]
}
The final output should be in JSON format (all on one line not pretty printed) and should be sent to a file named NobelPrizeList.json
.
A Warning and a Tip
The prizes awarded to organisations rather than specific people are likely to trigger a subtle bug in your output — a trailing space at the end of the name due to the fact that the dictionaries representing laureates that are organisations rather than people have no surnames.
To get full credit you should remove this trailing space using the same technique used to remove the leading and trailing quotation marks in one of the examples in this instalment.
A good test for your logic is the 1904 peace prize, the dictionary for which should look like:
{
"year": 1904,
"prize": "peace",
"numWinners": 1,
"winners": [
"Institute of International Law"
]
}
Purely for bonus credit, you can avoid the need to trim the space from the end of organisational winners by ensuring it never gets added. One way to achieve this is to combine the following jq functions and operators:
- The alternate operator (
//
) - The
empty
function — we’ve not seen it yet, but it takes no arguments and returns absolute nothingness - The
join
function
Note that ["Bob", "Dylan"] | join(" ")
results in "Bob Dylan"
, but ["Bob"] | join(" ")
results in just "Bob"
.
Final Thoughts
We’ve now seen how we can construct our own data structures by cherry picking information from the input data structure. In this instalment, that data was un-changed. In the next instalment we’ll take data transformation to the next level by looking at some of the many operators and functions the jq language provides for processing data, including arithmetic operators, string transformations, and manipulations of larger data structures like arrays and dictionaries.