PBS 162 of X: Altering Arrays & Dictionaries (jq)
In the previous instalment we made a good start on exploring jq’s data manipulation capabilities — we learned how to do math, how to assign values to elements within arrays and dictionaries, and we learned about some functions for working with strings.
In this instalment we’re going to explore some of jq’s operators and functions for manipulating arrays and dictionaries.
Matching Podcast Episode
Listen along to this instalment on episode 788 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_03_02
Installment Resources
- The instalment ZIP file — pbs162.zip
PBS 161 Challenge Solution
The Challenge set at the end of the previous instalment was to clean up the Nobel Prizes data set we have been using in this series in the following ways:
- Add a boolean key named
awarded
to every prize dictionary to indicate whether or not it was awarded. - Ensure all prize dictionaries have a
laureates
array. It should be empty for prizes that were not awarded. - Add a boolean key named
organisation
to each laureate dictionary, indicating whether or not the laureate is an organisation rather than a person. - Add a key named
displayName
to each laureate dictionary. For people, it should contain their first & last names, and for organisations, just the organisation name.
You’ll find the sample solution in the file pbs161-challengeSolution.jq
in the instalment ZIP:
# Sanitise the the Nobel Prizes data set as published by the Nobel Prize
# Committee to make it easier to process by normalising some existing
# fields and adding some additional ones.
# Input: JSON as published by the Nobel Committee
# Output: a dictionary indexed by a single key prizes containing an
# array of dictionaries, one for each Nobel Prize
.prizes |= [
# explode the prizes
.[]
# add an 'awarded' key to each prize
| .awarded = (has("laureates"))
# ensure all prizes have a laureates array
| .laureates //= []
# descend into the laureates arrays
| .laureates |= [
# explode the laureates
.[]
# add an organisation key
| .organisation = (has("surname") | not)
# add a displayName key
| .displayName = ([.firstname, .surname // empty] | join(" "))
]
]
You can run the filter with the command:
jq -f pbs161-challengeSolution.jq NobelPrizes.json
The first thing I want to draw your attention to is the pattern of using the basic update assignment operator |=
to update the value of the prizes
array within the top-level dictionary, and the laureates
arrays in each prize dictionary.
Because we’re doing an update assignment, the value for .
on the right-hand side of the operator is the current array, since that’s what’s being updated. We need to replace the array with a new array of updated dictionaries, so we start the filter chain for calculating those new values by exploding the current array, and we wrap the entire filter chain inside square braces to reassemble the updated pieces back into an array.
Secondly, note that we use the simple assignment operator =
to add the entirely new keys (awarded
in the prize dictionaries, and organisation
and displayName
in the laureate dictionaries).
Next, note the use of the alternate assignment operator //=
to default the laureates
array to an empty array when it’s not present while leaving it intact when it is.
Finally, note the importance of making these changes in the correct order — we could not use the absence of a laureates
array to detect an unawarded prize after we default it to an empty array!
Altering Arrays
Now that we’ve seen how to alter strings, let’s take a look at way of transforming arrays.
Reordering Arrays (sort
& reverse
)
Let’s start with the simplest type of array transformation — reorderings.
The simplest reordering is a straightforward reversing, and jq provides just that with the reverse
function:
jq -nc '[1, 2, 3] | reverse' # outputs [3,2,1]
Then of course we have array sorting.
When you have an array with simple values the built-in sort
function will generally do what you want.
The function requires an array as input, and it will first group elements by type, starting with any null
elements, then any booleans, then numbers, then strings, then arrays, and finally dictionaries. Within each grouping it will then perform the most sensible sorting, false
before true
for Booleans, numeric sorting for numbers, and alphabetic (lexical) sorting for strings. (Note that the rules for sorting arrays and dictionaries exist, but are out of the scope of this series.)
Let’s see the sort
function in action:
# numbers
jq -nc '[1, 4, 3] | sort' # outputs [1,3,4]
# strings
jq -nc '["popcorn", "waffles", "pancakes"] | sort'
# outputs ["pancakes","popcorn","waffles"]
# mixed types
jq -nc '[42, true, 11, "waffles", false, "pancakes"] | sort'
# outputs [false,true,11,42,"pancakes","waffles"]
When dealing with arrays containing more complicated elements like other arrays or dictionaries, you probably want to specify your own rule for sorting. You can do this with the sort_by
function. It takes as an argument a filter, and that filter will be applied to each array element, and the elements will be sorted based on the results of applying the filter. Usually the filter is simply a dictionary key.
For example, we can reorder our menu (from the menu.json
file in the instalment ZIP), with:
jq 'sort_by(.price)' menu.json
This outputs:
[
{
"name": "pancakes",
"price": 3.10,
"stock": 43
},
{
"name": "hotdogs",
"price": 5.99,
"stock": 143
},
{
"name": "waffles",
"price": 7.50,
"stock": 14
}
]
As you can see, the array is now sorted by price.
To see a more complex example, let’s introduce a new set of sample data — company profiles of some of the big tech companies from the free JSON API offered by www.alphavantage.co, specifically, from their Company Overview end point. You’ll find the data in techStocks.json
in the instalment ZIP.
We can use this data set to sort based on a calculated value, specifically, the product of the Price/Earnings ratio (a measure of how undervalued the stock is), and the earnings per share. To see how we expect the sorting to go, let’s output the relevant facts about each company first:
# -r for raw output
jq -r '.[] | "\(.Name) (\(.Symbol)): PERatio=\(.PERatio), DividendPerShare=\(.DividendPerShare) (product \((.PERatio | tonumber) * (.DividendPerShare | tonumber)))"' techStocks.json
This shows us:
Apple Inc (AAPL): PERatio=27.94, DividendPerShare=0.95 (product 26.543)
International Business Machines (IBM): PERatio=23.09, DividendPerShare=6.63 (product 153.0867)
Microsoft Corporation (MSFT): PERatio=37.6, DividendPerShare=2.86 (product 107.536)
This means we would expect the output sorted on this product to be AAPL, then MSFT, and finally, IBM.
Let’s construct a jq filter to do that sorting:
sort_by((.PERatio | tonumber) * (.DividendPerShare | tonumber))
We can test this filter with the command (using -c
for compact output):
jq -c 'sort_by((.PERatio | tonumber) * (.DividendPerShare | tonumber)) | [.[] | .Symbol]' techStocks.json
Note the the addition of the filter [.[] | .Symbol]
to the end of the filter chain to reduce the results to just the ticker symbols. This gives us the expected output:
["AAPL","MSFT","IBM"]
If you’re curious how the data set was assembled — the shell script in
buildStocksDataset.sh
shows how it was done. To use the script, you’ll need to get your own free API key and export it as a shell variable with a command likeexport key=YOURAPIKEY
before running the script.Note that this script makes use of the jq function
inputs
which we haven’t met yet — this function simply returns the contents of each input file as a separate output, so[inputs]
wraps the single dictionary in each of our input files in an array. Unless you intentionally need some duplication, you should always useinputs
in conjunction with the-n
flag so the input does not appear at the start of the pipeline as well as at the point you use theinputs
function!
Adding and Removing Elements with the +
& -
Operators
As mentioned earlier, the +
operator has been overloaded to perform a useful action when both inputs are arrays — it merges them into a new bigger array:
# Note: -n for no input & -c for compact output
jq -nc '[1, 2] + [3, 4]' # outputs [1,2,3,4]
While the -
operator is not overloaded for strings, it is overloaded for arrays, and can be used create a new array with elements removed. The array to the left of the -
is treated as the original array, and the elements in the array to the right of the -
are removed from the original array, if present, to create the output array.
jq -nc '[1, 2, 3, 4] - [4, 5]' # outputs [1,2,3]
Array Deduplication
When you start adding arrays together it’s easy to end up with duplicates that you may well not want. The jq language provides two useful functions for removing them.
Firstly, the simple unique
function returns the input array sorted with any duplicated elements removed. The input must be an array. For example:
jq -nc '[4, 1, 4, 3, 2] | unique' # outputs [1,2,3,4]
If you need a more complex definition of ‘uniqueness’ you can use the unique_by
function to supply your own filter as an argument. The output array will only contain one element which the filter evaluates to a given value. If multiple elements evaluate to the same value, one will be kept, but there’s no guarantee as to which one it will be. Also note that the elements in the output array will be sorted based on the result of the filter.
As a somewhat contrived example, we can make our menu (from menu.json
above) unique by length of name. Before we do, let’s compute the lengths see what results we expect, we can do that with:
jq '.[] | .name | length' menu.json
This shows that hotdogs and waffles have 7 letters, and pancakes 8. We would expect to always get pancakes in our answer, but only one of waffles or hotdogs when we make the menu unique by name. Let’s try:
jq 'unique_by(.name | length)' menu.json
This gives us the output:
[
{
"name": "hotdogs",
"price": 5.99,
"stock": 143
},
{
"name": "pancakes",
"price": 3.10,
"stock": 43
}
]
As expected, we get one item with a name of length 7, the one with a name of length 8, because the items are sorted on the result of the filter as well as deduplicated based on it.
Flattening Nested Arrays
When assembling an array from multiple sources you may end up with an array of arrays when you actually wanted a single unified array. This is where the built-in flatten
function comes in.
The flatten
function takes an array as an input, and if it contains other arrays, it replaces them with their entries in the output array. The function applies this logic recursively, so if your array contains arrays which contain arrays, it will still flatten them all out to a single array of all the values.
jq -nc '[1, [2, 3], [4, [5, 6]]] | flatten'
# outputs [1,2,3,4,5,6]
By default, the flatten function will keep recursing down into all nested arrays, but you can pass an optional argument to limit how deep it goes. To see what this means, let’s repeat our above example with limits of 2 and 1:
# limit to a depth of 2
# arrays in the input array, and arrays in arrays in the input array
jq -nc '[1, [2, 3], [4, [5, 6]]] | flatten(2)'
# outputs [1,2,3,4,5,6]
# limit to a depth of 1
# arrays in the input array only
jq -nc '[1, [2, 3], [4, [5, 6]]] | flatten(1)'
# outputs [1,2,3,4,[5,6]]
Dictionary Manipulation
Adding and Removing Keys
As mentioned earlier, the +
operator is overloaded for handling dictionaries. When two dictionaries are added together, a new dictionary is created containing the keys and values from both. If both input dictionaries define a value for the same key, the value from the dictionary on the right of the +
operator is used.
If the dictionaries contain dictionaries, a recursive merge can be done by using the *
operator instead of the +
operator.
You might imagine a key could somehow be removed with an overloaded subtraction operator, but alas not, we need to use the built-in del
function for that. This function requires a dictionary as the input and a key path as an argument. For example, we could remove the stock key from each item in our menu (menu.json
from the instalment ZIP) with:
jq '[.[] | del(.stock)]' menu.json
Which returns:
[
{
"name": "hotdogs",
"price": 5.99
},
{
"name": "pancakes",
"price": 3.10
},
{
"name": "waffles",
"price": 7.50
}
]
Note that the entire filter chain is wrapped in square braces to reassemble the exploded array
An Optional Challenge
Build an alphabetical sorted list with the names of all laureates. The list should not contain duplicates, and it should use sensible display names as per the previous challenge.
For bonus credit, can you sort the list so humans get sorted on surname rather than first name, but without affecting how each name is displayed?
Final Thoughts
We’re making good progress learning about jq’s data manipulation operators and functions. We’ve now covered the basics — math, assignment, string manipulation, array manipulation, and dictionary manipulation. The next two instalments will each be dedicated to just a single concept each. The next instalment will be dedicated to working with a special subset of dictionaries we’ll be referring to as lookups, focusing on jq’s two functions for building and disassembling them. That will be followed by an instalment dedicated to in-place array and dictionary manipulation — no more exploding and then reassembling our arrays!