PBS 166 of X: Processing Arrays & Dictionaries sans Explosions (jq)
We’re nearing the end of our exploration of jq, and we have just one more big idea to cover — the fact that you don’t need to explode arrays or deconstruct dictionaries to process their contents. Up to now we have been explicitly deconstructing and reconstructing complex data types to process them. We explode and recombine arrays, we deconstruct dictionaries into arrays of entries, process them, and then reassemble them, and so on. This approach has the advantage of being very explicit, and hence a great way to learn, but it has the disadvantage of being verbose. We’re now ready to explore the very powerful shortcuts jq provides for greatly shortening our code by allowing us to manipulate entire arrays and dictionaries in a single step.
Matching Podcast Episode
Listen along to this instalment on episode 794 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_05_25
Installment Resources
- The instalment ZIP file — pbs166.zip
PBS 165 Challenge Solution
The challenge set at the end of the previous installment was to write a jq script that searches a JSON formatted domain monitoring export from the Have-I-Been-Pwned (HIBP) service for all users caught up in a breach that meets two criteria:
- The name of the breach matches a given search string (ignoring case)
- The breach contained passwords
The script should return an array of dictionaries, one for each user-to-breach match that contains the following keys:
AccountName
: the user’s email account name (the part to the left of the@
)BreachName
: the name of the matching breach in the HIBP databaseBreachTitle
: the title of the matching breachBreachedDataClasses
: the list of breached data classes
In order to determine if a breach contained passwords, the data in the domain monitoring export needs to be enriched with the full database of breaches known to the HIBP service. As we learned in instalment 164, this can be downloaded for free from the HIBP API, but for convenience, the instalment ZIP includes a copy of the database as it was on the 18th of May 2024 in the file hibp-breaches-20240518.json
.
The instruction was to use a solution to the bonus challenge from instalment 164 as a starting point.
You’ll find a sample solution in the instalment ZIP as pbs165-challengeSolution-Basic.jq
:
# Find users caught up in any breach that leaked passwords that matches a given
# search string.
# Input: JSON as downloaded from the HIBP service
# Output: An array of account dictionaries describing each matching breach
# event. Each dictionary will be indexed by:
# - AccountName: The part of the email address to the left of the @
# - BreachName: The breach's unique name in the HIBP DB
# - BreachTitle: The human-friendly title of the breach
# - BreachedDataClasses: An array of compromised data types as strings
# Variables:
# - $breachSearch The string to search breach names by un a case-insensitive way
# - $breachDetails An array containing a single entry, the JSON for the
# latest lookup table of HIBP breaches indexed by breach
# name
# transform the lookup of breaches by AccountName into a list of entries:
# - keys will be account names
# - values will be arrays of breach IDs
.Breaches | to_entries
# filter down to just the users caught up in breaches matching the given search string that leaked passwords
| [
# explode the list of entries
.[]
# save the current account name to a variable
| .key as $accountName
# explode the breach names for the current entry
| .value[]
# keep only the names for breaches that meet the search criteria
| select(
# breaches that match the search string
(ascii_downcase | contains($breachSearch | ascii_downcase))
# breaches with passwords
and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)
# build the dictionary to return
| {
AccountName: $accountName,
BreachName: .,
BreachTitle: $breachDetails[0][.].Title,
BreachedDataClasses: $breachDetails[0][.].DataClasses
}
]
Before we explore this code, let’s see it in action, and before we do that, let’s remind ourselves of the data we’re processing.
The file hibp-pbs.demo.json
contains a fictionalised breach export for the imagined domain demo.bartificer.net
which was generated by anonymising a genuine export for a different domain. The .Breaches
key in the export is a lookup table that maps the username part of email addresses on the domain (the part to the left of the @
) to lists of breach names as strings:
{
"Breaches": {
"josullivan": [
"OnlinerSpambot"
],
"egreen": [
"Dropbox"
],
"mwkelly": [
"Dropbox",
"KayoMoe",
"LinkedIn",
"LinkedInScrape",
"PDL"
],
"ahawkins": [
"iMesh",
"OnlinerSpambot"
],
"ptraynor": [
"Collection1"
]
},
"Pastes": {}
}
The breach names in the .Breaches
lookup table are the unique names for the breaches in the HIBP database, and they are the keys in the lookup table of all breaches downloaded from the HIBP API and saved as hibp-breaches-20240518.json
. Each entry in this lookup has the following relevant keys; Name
, Title
& DataClasses
, e.g.:
{
…
"Name": "Dropbox",
"Title": "Dropbox",
"DataClasses": [
"Email addresses",
"Passwords"
]
…
}
Let’s now prove that the sample solution works by searching for all users caught up in any breach that matches the search string 'linkedin'
and contains passwords:
jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch linkedin hibp-pbs.demo.json
# Output:
# [
# {
# "AccountName": "mwkelly",
# "BreachName": "LinkedIn",
# "BreachTitle": "LinkedIn",
# "BreachedDataClasses": [
# "Email addresses",
# "Passwords"
# ]
# }
# ]
Why did this only return one result when mwkelly
is caught up in two breaches that match our search string (LinkedIn
& LinkedInScrape
)? Well, if we query the breach DB directly, we can see that the ‘missing’ breach did not contain passwords:
jq '.LinkedInScrape.DataClasses' hibp-breaches-20240518.json
# Output:
# [
# "Education levels",
# "Email addresses",
# "Genders",
# "Geographic locations",
# "Job titles",
# "Names",
# "Social media profiles"
# ]
To see what happens when a single user is caught up in multiple matching breaches, let’s search for the (artifically simple) string 'o'
, this will match DropBox
, KayoMoe
& OnlinerSpambot
, all of which contain passwords:
jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch o hibp-pbs.demo.json
# Output (truncated):
# [
# …
# {
# "AccountName": "mwkelly",
# "BreachName": "Dropbox",
# "BreachTitle": "Dropbox",
# "BreachedDataClasses": [
# "Email addresses",
# "Passwords"
# ]
# },
# {
# "AccountName": "mwkelly",
# "BreachName": "KayoMoe",
# "BreachTitle": "Kayo.moe Credential Stuffing List",
# "BreachedDataClasses": [
# "Email addresses",
# "Passwords"
# ]
# },
# …
# ]
Because the search term was so generic, many breaches matched, so the output shown above is truncated to show only the two matching entries for mwkelly
.
Now that we know the script works as desired, let’s look at how it works.
Big picture-wise the script converts the .breaches
lookup table to an array of entries and then explodes it and captures the remaining transformed pieces. The intersting part of the script is how each entry is processed once it’s exploded.
Because we want to capture each matching user-breach combination, we will need to explode the list of breaches (.value
in the entries), but we also need to remember the user the breaches belong to, so we need to save the username (.key
in the entries) to a variable before we explode the breaches:
# explode the list of entries
.[]
# save the current account name to a variable
.key as $accountName
# explode the breach names for the current entry
.value[]
Next, we need to filter the breaches down to just those that match our criteria. Because we are processing one breach at a time now, we can use a simple call to select
without needing to use any
like we did in our starting point. Note that since we exploded the list of breaches in the previous filter in the chain, the value currently being processed (.
) is a breach name. So, we need to filter on two conditions — a name match, and whether or not the breach contains a password (the logic is retained from the starting point):
# keep only the names for breaches that meet the search criteria
| select(
# breaches that match the search string
(ascii_downcase | contains($breachSearch | ascii_downcase))
# breaches with passwords
and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)
Note the use of data enrichment via $breachDetails[0]
which gets passed on the commandline with --slurpfile
.
Finally, we need to build the required dictionary for the matching breaches:
# build the dictionary to return
| {
AccountName: $accountName,
BreachName: .,
BreachTitle: $breachDetails[0][.].Title,
BreachedDataClasses: $breachDetails[0][.].DataClasses
}
Note the use of both our saved variable, and yet more data enrichment.
For bonus credit, the additional challenge was to search both the name and title for a match. This is because the name visible on the HIBP website is actually the Title
, and unlike the Name
it can contain spaces and characters other than just letters and numbers. You’ll find the full solution in the file pbs165-challengeSolution-Bonus.jq
in the instalment ZIP.
Before we look at the one change needed to get the bonus credit, let’s see why this functionality was worth adding with an example.
On the HIBP website there is a breach labled ‘Kayo.moe Credential Stuffing List’, but if you search for that with 'kayo.moe'
using our basic solution nothing is matched, even though mwkelly
is caught up in that breach:
jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch 'kayo.moe' hibp-pbs.demo.json
# Output: []
But, if we use our bonus solution we do see that mwkelly
was caught up in that breach:
jq -f pbs165-challengeSolution-Bonus.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch 'kayo.moe' hibp-pbs.demo.json
# Output:
# [
# {
# "AccountName": "mwkelly",
# "BreachName": "KayoMoe",
# "BreachTitle": "Kayo.moe Credential Stuffing List",
# "BreachedDataClasses": [
# "Email addresses",
# "Passwords"
# ]
# }
# ]
There’s only one change between the basic and bonus sollutions — an extra clause in the select
:
# the basic solution
| select(
# breaches that match the search string
(ascii_downcase | contains($breachSearch | ascii_downcase))
# breaches with passwords
and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)
# the bonus solution
| select(
# breaches that match the search string
(
# check the name
(ascii_downcase | contains($breachSearch | ascii_downcase))
# check the title
or ($breachDetails[0][.].Title | ascii_downcase | contains($breachSearch | ascii_downcase))
)
# breaches with passwords
and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)
Manipulating without Exploding
Fundamentally there are three big-picture tasks we’ll be exploring today, with a few little bonus extras along the way:
- Distilling multiple items down to a single value using an accumulator with the
reduce
operator. - Applying the same edit to every element in an array in one step with the
map
function. - Applying the same edit to every value in a dictionay in one step with the
map_values
function.
Distilling Data with reduce
The reduce
operator is extremely powerful, so powerful that if you peep under the hood you’ll find many standard jq functions are just convenient wrappers around reduce
.
The reduce
operator allows you to loop over multiple items and build up the output value as you go.
The syntax is reduce GENERATOR_EXPRESSION as $ITEM_VARIABLE (INITIALISER_EXPRESSION; UPDATE_EXPRESSION)
, where:
GENERATOR_EXPRESSION
— a filter that produces zero or more results. Often, this will simply be an array explosion. The value of.
within this part of thereduce
statement is the full input to the filter containing thereduce
keyword.$ITEM_VARIABLE
— a variable that will store the genereated value currently being processed on each pass through the loop.$INITIALISER_EXPRESSION
— an expression to generate the initial value for what will become the output. This value will be updated by theUPDATE_EXPRESSION
on each pass through the loop as it builds the outputted value. This expression is often as simple as just0
or1
. The value of.
within this part of thereduce
statement is the full input to the filter containing thereduce
statement.UPDATE_EXPRESSION
— an expression to generate the next current value for what will become the output based on the current future output value, available as.
, and the value currently being processed, which is available as$ITEM_VARIABLE
.
That sounds a lot more complicated than it really is, and to prove that, let’s reimplement some built-in functions with reduce
.
Finding the Length of an Array with reduce
The filter to build our own implementation of the built-in length
function is simply:
reduce .[] as $item (0; . + 1)
The way this works is that we loop over all values in the array passed to this filter by using .[]
as the generator, we set the initial value of what will become the output to 0
, then for each value produced by exploding the array (stored as the un-used $item
), we increment the output-in-the-making by one. When all the values have been processed, the final accumulated value is returned as the output.
We can see this in action with:
# -n tells jq not to expect any input
jq -n '[2, 4, 6] | reduce .[] as $item (0; . + 1)'
# outputs 3
Adding all the Numbers in an Array with reduce
The filter to build our own version of the add
function is simply:
reduce .[] as $item (0; . + $item)
Again, this works by exploding the input array to generate the items to loop over, then initialising the output to zero, and adding the current item being processed to the accumulating output each time.
We can see this filter in action with:
jq -n '[2, 4, 6] | reduce .[] as $item (0; . + $item)'
# outputs 12
Multiplying all the Numbers in an Array with reduce
As it happens, jq does not provide a built-in function to multiply all the numbers in an array together, but if it did, it could be implemented with the following simple filter:
reduce .[] as $item (1; . * $item)
We can see this filter in action with:
jq -n '[2, 4, 6] | reduce .[] as $item (1; . * $item)'
# outputs 48
Implementing Factorial with reduce
The generator expression does not have to be a simple array explosion, it can be anything that produces zero or more values. To prove the point, let’s use the range()
function as a generator to implement factorial. As a reminder, the factorial of a number is the result of multiplying every whole number from 1 to that number together. I.e. the factorial of 3
is 1 * 2 * 3
, i.e. 6
, and the factorial of 4
is 1 * 2 * 3 * 4
, i.e. 24
.
We can implement this in jq with the following filter:
reduce range(1; . + 1) as $i (1; . * $i)
Remember that the two-argument version of range
starts with the first argument, and then counts up until the number before the second argument, so range(1; 4)
will produce 1
, 2
& 3
. We need to loop over 1
to the number passed to the filter, so we need the range from 1
to . + 1
.
Let’s see this filter in action with the number 5:
jq -n '5 | reduce range(1; . + 1) as $i (1; . * $i)'
# outputs 120
Processing Arrays with the map
Function
The map
function applies a filter to every element in an array. The array to be manipulated should be passed as the input to map
, and the filter to apply is passed as the only argument. Within the passed filter .
represents ‘the array element being processed’. The function’s output is the updated array.
In effect, map
lets you loop over an aray in a single function call, and process each element without exploding the array.
For example, we can convert all the numbers in an array to their absolute values with:
# -c tells jq to render the output in a compact format
jq -nc '[1, -2, 3, -42] | map(abs)'
# outputs [1,2,3,42]
Simpler Array Filtering with map(select())
One of the most common uses of map
is in conjunction with select
, allowing you to filter an array down to only the elements that meet a particular criteria without needing to explode and re-assemble it.
As an example, let’s filter the Nobel Prizes data set down to an array with just the physics prizes, with their order reversed so the oldest prize come first, and the newest last. Here’s how we would do that without using map
:
jq '[.prizes[] | select(.category == "physics")] | reverse' NobelPrizes.json
Note the need to explode the prizes with .prizes[]
, and then to re-assemble the array by wrapping the explosion and the select
filters in square brackets ( [
& ]
).
With map
, we can simplify things:
jq '.prizes | map(select(.category == "physics")) | reverse' NobelPrizes.json
We now have a simple filter chain without the need for nesting part of it within square brackets, making it easier to read and to maintain.
Note that this technique works because the select
function returns empty
rather than null
when called with a value that does not match the specified criteria, transforming un-matched entries into absolute nothingless, hence deleting them!
Converting Dictionaries to Arrays with map
One final point to note — map
will not throw an error when the input is a dictionary — it will simply treat the values within the key-value pairs as array elements, and output an array of their processed results. In other words, even when you pass it a dictionary, map
will always return an array.
At first glance this might seem useless, how often do you have a dictionary who’s keys you don’t need? But actually, it is occassionaly useful. Imagine you have weekly sales data in a dictionary indexed by day-of-the-week, like so (weeklySales.json
in the instalment ZIP):
{
"mon": 2343,
"tue": 4324,
"wed": 4121,
"thur": 8762,
"fri": 11452,
"sat": 32394,
"sun": 0
}
To convert this to an array of numbers we can call map
with a filter that passes the values through un-changed, i.e. simply .
:
jq -c 'map(.)' weeklySales.json
# outputs [2343,4324,4121,8762,11452,32394,0]
Notice that the values in the array are in the order the keys were added to the dictionary, not in any kind of sorted order.
Processing Dictionaries with the map_values
Function
In many ways the map_values
function is very similar to the map
function, but while map
is primarily designed for processing arrays, map_values
is primarily designed for processing dictionaries.
It’s structure is similar — the input should be a dictionary, and it takes a filter to apply to each value in the dictionary as the only argument. Like with map
, within the filter passed as the argument, .
is the value currently being processed. Note that only the dictionary’s values are altered, and there is no way to access the matching keys from within the passed filter.
So, to double all the sales data in our weekly sales dictionary, we would run the command:
jq 'map_values(. * 2)' weeklySales.json
This produces the following dictionary:
{
"mon": 4686,
"tue": 8648,
"wed": 8242,
"thur": 17524,
"fri": 22904,
"sat": 64788,
"sun": 0
}
Similar to how map
doesn’t throw an error when passed a dictionary, map_values
won’t throw an error when pased an array — it will simply process the array, and return an updated array (not a dictionary).
Beware of a Subtle Difference in how map
& map_values
Handle Multiple Filter Outputs
With map
, when the filter passed as the argument produces more than one value for a given input, all those values get added to the output array, but map_values
will only include the first output produced from each input. In effect, map_values
behaves as if the result from the filter passed as the argument is piped through first
before being assigned to the output dictionary or array.
As a somewhat contrived example to illustrate this difference, let’s pass an array of arrays to both functions and see what happens when we pass a filter that explodes these sub-arrays to both map
and map_values
:
# use map to explode sub-arrays
jq -nc '[[1], [2, 2], [3, 3, 3]] | map(.[])'
# outputs [1,2,2,3,3,3]
# try use map_values to explode sub-arrays
jq -nc '[[1], [2, 2], [3, 3, 3]] | map_values(.[])'
# outputs [1,2,3]
As you can see, map
added all the values the passed filter produced to the output, but map_values
ignored all but the first value produced.
This same difference is also present when working with dictionaries:
# use map to explode array values
jq -nc '{a: [1], b: [2, 2], c: [3, 3, 3]} | map(.[])'
# outputs [1,2,2,3,3,3]
# try use map_values to explode array values
jq -nc '{a: [1], b: [2, 2], c: [3, 3, 3]} | map_values(.[])'
# outputs {"a":1,"b":2,"c":3}
Again, as you can see, map
added all the values the passed filter produced to the output, but map_values
ignored all but the first value. Also note that, as stated above, map
always outputs an array, even when passed a dictionary.
An Optional Challenge
Use map
, map_values
, and reduce
to enrich a HIBP export like hibp-pbs.dmo.json
in two ways:
- Remove the top-level
.Breaches
and.Pastes
keys, and have the contents of the.Breaches
lookup table become the top-level element. - Use
map_values
to replace the value of each key in the new top-level lookup table (initially a simple array of breach names) with a dictionary that contains two keys:Breaches
— an array of dictionaries indexed byName
,Title
&DataClasses
. Use themap
function to process the entire array in one step.ExposureScore
— a value calculated from the enriched breach details that starts at 0, and adds 1 for each breach the user is caught up in that does not contain passwords, and 10 for each breach that does. Use thereduce
function to perform this calculation.
Final Thoughts
If our exploration of the jq were a book, this would be the page that says The End, and the next page would have the heading Epilogue. It’s difficult to estimate these things, but I think we’ve now covered all the jq 90% of jq users will ever need, and then some but I’d say we’ve only covered about two-thirds, or maybe three-quarters, of the language’s full suite of features. The remainder is extremely useful to a small number of people in a limited set of scenarios. When you need these additional features you’ll be delighted they exist, but most of us, most of the time, will never need them. So, if your brain is full at this stage, that’s not a problem, because you should have all the jq you need to get your work done.
But, some of the features we haven’t covered strike me a too useful to completely omit from the series, so I’ve been collecitng them up, and I’ll combine them in one final instalment, which is coming up next. I expect there’s a good chance those of you who’ve really caught the jq bug will find at least some of those feature extremely interesting, but I doubt many people will fine all of them useful. The problem of course is that I think the subset people care about will vary widely 🙂
Consider the next instalment a kind of tasting menu — more courses than you’d normally eat, but they’ll all just be little tasters of what’s possible!