PBS 166 of X: Processing Arrays & Dictionaries sans Explosions (jq)

25 May 2024

We’re nearing the end of our exploration of jq, and we have just one more big idea to cover — the fact that you don’t need to explode arrays or deconstruct dictionaries to process their contents. Up to now we have been explicitly deconstructing and reconstructing complex data types to process them. We explode and recombine arrays, we deconstruct dictionaries into arrays of entries, process them, and then reassemble them, and so on. This approach has the advantage of being very explicit, and hence a great way to learn, but it has the disadvantage of being verbose. We’re now ready to explore the very powerful shortcuts jq provides for greatly shortening our code by allowing us to manipulate entire arrays and dictionaries in a single step.

Matching Podcast Episode

Listen along to this instalment on episode 794 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_05_25

Installment Resources

The instalment ZIP file — pbs166.zip

PBS 165 Challenge Solution

The challenge set at the end of the previous installment was to write a jq script that searches a JSON formatted domain monitoring export from the Have-I-Been-Pwned (HIBP) service for all users caught up in a breach that meets two criteria:

The name of the breach matches a given search string (ignoring case)
The breach contained passwords

The script should return an array of dictionaries, one for each user-to-breach match that contains the following keys:

AccountName: the user’s email account name (the part to the left of the @)
BreachName: the name of the matching breach in the HIBP database
BreachTitle: the title of the matching breach
BreachedDataClasses: the list of breached data classes

In order to determine if a breach contained passwords, the data in the domain monitoring export needs to be enriched with the full database of breaches known to the HIBP service. As we learned in instalment 164, this can be downloaded for free from the HIBP API, but for convenience, the instalment ZIP includes a copy of the database as it was on the 18th of May 2024 in the file hibp-breaches-20240518.json.

The instruction was to use a solution to the bonus challenge from instalment 164 as a starting point.

You’ll find a sample solution in the instalment ZIP as pbs165-challengeSolution-Basic.jq:

# Find users caught up in any breach that leaked passwords that matches a given
# search string.
# Input:    JSON as downloaded from the HIBP service
# Output:   An array of account dictionaries describing each matching breach 
#   event. Each dictionary will be indexed by:
#   - AccountName:          The part of the email address to the left of the @
#   - BreachName:			The breach's unique name in the HIBP DB
#   - BreachTitle:          The human-friendly title of the breach
#   - BreachedDataClasses:	An array of compromised data types as strings
# Variables:
# - $breachSearch   The string to search breach names by un a case-insensitive way
# - $breachDetails	An array containing a single entry, the JSON for the
#                   latest lookup table of HIBP breaches indexed by breach
#                   name

# transform the lookup of breaches by AccountName into a list of entries:
# - keys will be account names
# - values will be arrays of breach IDs
.Breaches | to_entries

# filter down to just the users caught up in breaches matching the given search string that leaked passwords
| [
	# explode the list of entries
	.[]

	# save the current account name to a variable
	| .key as $accountName

	# explode the breach names for the current entry
	| .value[]

	# keep only the names for breaches that meet the search criteria
	| select(
		# breaches that match the search string
		(ascii_downcase | contains($breachSearch | ascii_downcase))

		# breaches with passwords
		and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
	)

	# build the dictionary to return
    | {
		AccountName: $accountName,
		BreachName: .,
		BreachTitle: $breachDetails[0][.].Title,
		BreachedDataClasses: $breachDetails[0][.].DataClasses
	}
]

Before we explore this code, let’s see it in action, and before we do that, let’s remind ourselves of the data we’re processing.

The file hibp-pbs.demo.json contains a fictionalised breach export for the imagined domain demo.bartificer.net which was generated by anonymising a genuine export for a different domain. The .Breaches key in the export is a lookup table that maps the username part of email addresses on the domain (the part to the left of the @) to lists of breach names as strings:

{
  "Breaches": {
    "josullivan": [
      "OnlinerSpambot"
    ],
    "egreen": [
      "Dropbox"
    ],
    "mwkelly": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ],
    "ahawkins": [
      "iMesh",
      "OnlinerSpambot"
    ],
    "ptraynor": [
      "Collection1"
    ]
  },
  "Pastes": {}
}

The breach names in the .Breaches lookup table are the unique names for the breaches in the HIBP database, and they are the keys in the lookup table of all breaches downloaded from the HIBP API and saved as hibp-breaches-20240518.json. Each entry in this lookup has the following relevant keys; Name, Title & DataClasses, e.g.:

{
  …
  "Name": "Dropbox",
  "Title": "Dropbox",
  "DataClasses": [
    "Email addresses",
    "Passwords"
  ]
  …
}

Let’s now prove that the sample solution works by searching for all users caught up in any breach that matches the search string 'linkedin' and contains passwords:

jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch linkedin hibp-pbs.demo.json
# Output:
# [
#   {
#     "AccountName": "mwkelly",
#     "BreachName": "LinkedIn",
#     "BreachTitle": "LinkedIn",
#     "BreachedDataClasses": [
#       "Email addresses",
#       "Passwords"
#     ]
#   }
# ]

Why did this only return one result when mwkelly is caught up in two breaches that match our search string (LinkedIn & LinkedInScrape)? Well, if we query the breach DB directly, we can see that the ‘missing’ breach did not contain passwords:

jq '.LinkedInScrape.DataClasses' hibp-breaches-20240518.json
# Output:
# [
#   "Education levels",
#   "Email addresses",
#   "Genders",
#   "Geographic locations",
#   "Job titles",
#   "Names",
#   "Social media profiles"
# ]

To see what happens when a single user is caught up in multiple matching breaches, let’s search for the (artifically simple) string 'o', this will match DropBox , KayoMoe & OnlinerSpambot, all of which contain passwords:

jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch o hibp-pbs.demo.json
# Output (truncated):
# [
# …
# {
#     "AccountName": "mwkelly",
#     "BreachName": "Dropbox",
#     "BreachTitle": "Dropbox",
#     "BreachedDataClasses": [
#       "Email addresses",
#       "Passwords"
#     ]
#   },
#   {
#     "AccountName": "mwkelly",
#     "BreachName": "KayoMoe",
#     "BreachTitle": "Kayo.moe Credential Stuffing List",
#     "BreachedDataClasses": [
#       "Email addresses",
#       "Passwords"
#     ]
#   },
#   …
# ]

Because the search term was so generic, many breaches matched, so the output shown above is truncated to show only the two matching entries for mwkelly.

Now that we know the script works as desired, let’s look at how it works.

Big picture-wise the script converts the .breaches lookup table to an array of entries and then explodes it and captures the remaining transformed pieces. The intersting part of the script is how each entry is processed once it’s exploded.

Because we want to capture each matching user-breach combination, we will need to explode the list of breaches (.value in the entries), but we also need to remember the user the breaches belong to, so we need to save the username (.key in the entries) to a variable before we explode the breaches:

# explode the list of entries
.[]

# save the current account name to a variable
.key as $accountName

# explode the breach names for the current entry
.value[]

Next, we need to filter the breaches down to just those that match our criteria. Because we are processing one breach at a time now, we can use a simple call to select without needing to use any like we did in our starting point. Note that since we exploded the list of breaches in the previous filter in the chain, the value currently being processed (.) is a breach name. So, we need to filter on two conditions — a name match, and whether or not the breach contains a password (the logic is retained from the starting point):

# keep only the names for breaches that meet the search criteria
| select(
  # breaches that match the search string
  (ascii_downcase | contains($breachSearch | ascii_downcase))

  # breaches with passwords
  and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)

Note the use of data enrichment via $breachDetails[0] which gets passed on the commandline with --slurpfile.

Finally, we need to build the required dictionary for the matching breaches:

# build the dictionary to return
| {
  AccountName: $accountName,
  BreachName: .,
  BreachTitle: $breachDetails[0][.].Title,
  BreachedDataClasses: $breachDetails[0][.].DataClasses
}

Note the use of both our saved variable, and yet more data enrichment.

For bonus credit, the additional challenge was to search both the name and title for a match. This is because the name visible on the HIBP website is actually the Title, and unlike the Name it can contain spaces and characters other than just letters and numbers. You’ll find the full solution in the file pbs165-challengeSolution-Bonus.jq in the instalment ZIP.

Before we look at the one change needed to get the bonus credit, let’s see why this functionality was worth adding with an example.

On the HIBP website there is a breach labled ‘Kayo.moe Credential Stuffing List’, but if you search for that with 'kayo.moe' using our basic solution nothing is matched, even though mwkelly is caught up in that breach:

jq -f pbs165-challengeSolution-Basic.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch 'kayo.moe' hibp-pbs.demo.json
# Output: []

But, if we use our bonus solution we do see that mwkelly was caught up in that breach:

jq -f pbs165-challengeSolution-Bonus.jq --slurpfile breachDetails hibp-breaches-20240518.json --arg breachSearch 'kayo.moe' hibp-pbs.demo.json
# Output:
# [
#  {
#    "AccountName": "mwkelly",
#    "BreachName": "KayoMoe",
#    "BreachTitle": "Kayo.moe Credential Stuffing List",
#    "BreachedDataClasses": [
#      "Email addresses",
#      "Passwords"
#    ]
#  }
# ]

There’s only one change between the basic and bonus sollutions — an extra clause in the select:

# the basic solution
| select(
  # breaches that match the search string
  (ascii_downcase | contains($breachSearch | ascii_downcase))

  # breaches with passwords
  and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)

# the bonus solution
| select(
  # breaches that match the search string
  (
    # check the name
    (ascii_downcase | contains($breachSearch | ascii_downcase))

    # check the title
    or ($breachDetails[0][.].Title | ascii_downcase | contains($breachSearch | ascii_downcase))
  )

  # breaches with passwords
  and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
)

Manipulating without Exploding

Fundamentally there are three big-picture tasks we’ll be exploring today, with a few little bonus extras along the way:

Distilling multiple items down to a single value using an accumulator with the reduce operator.
Applying the same edit to every element in an array in one step with the map function.
Applying the same edit to every value in a dictionay in one step with the map_values function.

Distilling Data with `reduce`

The reduce operator is extremely powerful, so powerful that if you peep under the hood you’ll find many standard jq functions are just convenient wrappers around reduce.

The reduce operator allows you to loop over multiple items and build up the output value as you go.

The syntax is reduce GENERATOR_EXPRESSION as $ITEM_VARIABLE (INITIALISER_EXPRESSION; UPDATE_EXPRESSION), where:

GENERATOR_EXPRESSION — a filter that produces zero or more results. Often, this will simply be an array explosion. The value of . within this part of the reduce statement is the full input to the filter containing the reduce keyword.
$ITEM_VARIABLE — a variable that will store the genereated value currently being processed on each pass through the loop.
$INITIALISER_EXPRESSION — an expression to generate the initial value for what will become the output. This value will be updated by the UPDATE_EXPRESSION on each pass through the loop as it builds the outputted value. This expression is often as simple as just 0 or 1. The value of . within this part of the reduce statement is the full input to the filter containing the reduce statement.
UPDATE_EXPRESSION — an expression to generate the next current value for what will become the output based on the current future output value, available as ., and the value currently being processed, which is available as $ITEM_VARIABLE.

That sounds a lot more complicated than it really is, and to prove that, let’s reimplement some built-in functions with reduce.

Finding the Length of an Array with `reduce`

The filter to build our own implementation of the built-in length function is simply:

reduce .[] as $item (0; . + 1)

The way this works is that we loop over all values in the array passed to this filter by using .[] as the generator, we set the initial value of what will become the output to 0, then for each value produced by exploding the array (stored as the un-used $item), we increment the output-in-the-making by one. When all the values have been processed, the final accumulated value is returned as the output.

We can see this in action with:

# -n tells jq not to expect any input
jq -n '[2, 4, 6] | reduce .[] as $item (0; . + 1)'
# outputs 3

Adding all the Numbers in an Array with `reduce`

The filter to build our own version of the add function is simply:

reduce .[] as $item (0; . + $item)

Again, this works by exploding the input array to generate the items to loop over, then initialising the output to zero, and adding the current item being processed to the accumulating output each time.

We can see this filter in action with:

jq -n '[2, 4, 6] | reduce .[] as $item (0; . + $item)'
# outputs 12

Multiplying all the Numbers in an Array with `reduce`

As it happens, jq does not provide a built-in function to multiply all the numbers in an array together, but if it did, it could be implemented with the following simple filter:

reduce .[] as $item (1; . * $item)

We can see this filter in action with:

jq -n '[2, 4, 6] | reduce .[] as $item (1; . * $item)'
# outputs 48

Implementing Factorial with `reduce`

The generator expression does not have to be a simple array explosion, it can be anything that produces zero or more values. To prove the point, let’s use the range() function as a generator to implement factorial. As a reminder, the factorial of a number is the result of multiplying every whole number from 1 to that number together. I.e. the factorial of 3 is 1 * 2 * 3, i.e. 6, and the factorial of 4 is 1 * 2 * 3 * 4, i.e. 24.

We can implement this in jq with the following filter:

reduce range(1; . + 1) as $i (1; . * $i)

Remember that the two-argument version of range starts with the first argument, and then counts up until the number before the second argument, so range(1; 4) will produce 1, 2 & 3. We need to loop over 1 to the number passed to the filter, so we need the range from 1 to . + 1.

Let’s see this filter in action with the number 5:

jq -n '5 | reduce range(1; . + 1) as $i (1; . * $i)'
# outputs 120

Processing Arrays with the `map` Function

The map function applies a filter to every element in an array. The array to be manipulated should be passed as the input to map, and the filter to apply is passed as the only argument. Within the passed filter . represents ‘the array element being processed’. The function’s output is the updated array.

In effect, map lets you loop over an aray in a single function call, and process each element without exploding the array.

For example, we can convert all the numbers in an array to their absolute values with:

# -c tells jq to render the output in a compact format
jq -nc '[1, -2, 3, -42] | map(abs)'
# outputs [1,2,3,42]

Simpler Array Filtering with `map(select())`

One of the most common uses of map is in conjunction with select, allowing you to filter an array down to only the elements that meet a particular criteria without needing to explode and re-assemble it.

As an example, let’s filter the Nobel Prizes data set down to an array with just the physics prizes, with their order reversed so the oldest prize come first, and the newest last. Here’s how we would do that without using map:

jq '[.prizes[] | select(.category == "physics")] | reverse' NobelPrizes.json

Note the need to explode the prizes with .prizes[], and then to re-assemble the array by wrapping the explosion and the select filters in square brackets ( [ & ]).

With map, we can simplify things:

jq '.prizes | map(select(.category == "physics")) | reverse' NobelPrizes.json

We now have a simple filter chain without the need for nesting part of it within square brackets, making it easier to read and to maintain.

Note that this technique works because the select function returns empty rather than null when called with a value that does not match the specified criteria, transforming un-matched entries into absolute nothingless, hence deleting them!

Converting Dictionaries to Arrays with `map`

One final point to note — map will not throw an error when the input is a dictionary — it will simply treat the values within the key-value pairs as array elements, and output an array of their processed results. In other words, even when you pass it a dictionary, map will always return an array.

At first glance this might seem useless, how often do you have a dictionary who’s keys you don’t need? But actually, it is occassionaly useful. Imagine you have weekly sales data in a dictionary indexed by day-of-the-week, like so (weeklySales.json in the instalment ZIP):

{
  "mon": 2343,
  "tue": 4324,
  "wed": 4121,
  "thur": 8762,
  "fri": 11452,
  "sat": 32394,
  "sun": 0
}

To convert this to an array of numbers we can call map with a filter that passes the values through un-changed, i.e. simply .:

jq -c 'map(.)' weeklySales.json
# outputs [2343,4324,4121,8762,11452,32394,0]

Notice that the values in the array are in the order the keys were added to the dictionary, not in any kind of sorted order.

Processing Dictionaries with the `map_values` Function

In many ways the map_values function is very similar to the map function, but while map is primarily designed for processing arrays, map_values is primarily designed for processing dictionaries.

It’s structure is similar — the input should be a dictionary, and it takes a filter to apply to each value in the dictionary as the only argument. Like with map, within the filter passed as the argument, . is the value currently being processed. Note that only the dictionary’s values are altered, and there is no way to access the matching keys from within the passed filter.

So, to double all the sales data in our weekly sales dictionary, we would run the command:

jq 'map_values(. * 2)' weeklySales.json

This produces the following dictionary:

{
  "mon": 4686,
  "tue": 8648,
  "wed": 8242,
  "thur": 17524,
  "fri": 22904,
  "sat": 64788,
  "sun": 0
}

Similar to how map doesn’t throw an error when passed a dictionary, map_values won’t throw an error when pased an array — it will simply process the array, and return an updated array (not a dictionary).

Beware of a Subtle Difference in how `map` & `map_values` Handle Multiple Filter Outputs

With map, when the filter passed as the argument produces more than one value for a given input, all those values get added to the output array, but map_values will only include the first output produced from each input. In effect, map_values behaves as if the result from the filter passed as the argument is piped through first before being assigned to the output dictionary or array.

As a somewhat contrived example to illustrate this difference, let’s pass an array of arrays to both functions and see what happens when we pass a filter that explodes these sub-arrays to both map and map_values:

# use map to explode sub-arrays
jq -nc '[[1], [2, 2], [3, 3, 3]] | map(.[])'
# outputs [1,2,2,3,3,3]

# try use map_values to explode sub-arrays
jq -nc '[[1], [2, 2], [3, 3, 3]] | map_values(.[])'
# outputs [1,2,3]

As you can see, map added all the values the passed filter produced to the output, but map_values ignored all but the first value produced.

This same difference is also present when working with dictionaries:

# use map to explode array values
jq -nc '{a: [1], b: [2, 2], c: [3, 3, 3]} | map(.[])'
# outputs [1,2,2,3,3,3]

# try use map_values to explode array values
jq -nc '{a: [1], b: [2, 2], c: [3, 3, 3]} | map_values(.[])'
# outputs {"a":1,"b":2,"c":3}

Again, as you can see, map added all the values the passed filter produced to the output, but map_values ignored all but the first value. Also note that, as stated above, map always outputs an array, even when passed a dictionary.

An Optional Challenge

Use map, map_values, and reduce to enrich a HIBP export like hibp-pbs.dmo.json in two ways:

Remove the top-level .Breaches and .Pastes keys, and have the contents of the .Breaches lookup table become the top-level element.
Use map_values to replace the value of each key in the new top-level lookup table (initially a simple array of breach names) with a dictionary that contains two keys:
1. Breaches — an array of dictionaries indexed by Name, Title & DataClasses . Use the map function to process the entire array in one step.
2. ExposureScore — a value calculated from the enriched breach details that starts at 0, and adds 1 for each breach the user is caught up in that does not contain passwords, and 10 for each breach that does. Use the reduce function to perform this calculation.

Final Thoughts

If our exploration of the jq were a book, this would be the page that says The End, and the next page would have the heading Epilogue. It’s difficult to estimate these things, but I think we’ve now covered all the jq 90% of jq users will ever need, and then some but I’d say we’ve only covered about two-thirds, or maybe three-quarters, of the language’s full suite of features. The remainder is extremely useful to a small number of people in a limited set of scenarios. When you need these additional features you’ll be delighted they exist, but most of us, most of the time, will never need them. So, if your brain is full at this stage, that’s not a problem, because you should have all the jq you need to get your work done.

But, some of the features we haven’t covered strike me a too useful to completely omit from the series, so I’ve been collecitng them up, and I’ll combine them in one final instalment, which is coming up next. I expect there’s a good chance those of you who’ve really caught the jq bug will find at least some of those feature extremely interesting, but I doubt many people will fine all of them useful. The problem of course is that I think the subset people care about will vary widely 🙂

Consider the next instalment a kind of tasting menu — more courses than you’d normally eat, but they’ll all just be little tasters of what’s possible!

Variables jq Miniseries Recursion, Syntactic Sugar, Some old Friends and a Few Honourable Mentions

165. Variables Programming by Stealth 167. Recursion, Syntactic Sugar, Some old Friends and a Few Honourable Mentions

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack

PBS 166 of X: Processing Arrays & Dictionaries sans Explosions (jq)

Matching Podcast Episode

Installment Resources

PBS 165 Challenge Solution

Manipulating without Exploding

Distilling Data with reduce

Finding the Length of an Array with reduce

Adding all the Numbers in an Array with reduce

Multiplying all the Numbers in an Array with reduce

Implementing Factorial with reduce

Processing Arrays with the map Function

Simpler Array Filtering with map(select())

Converting Dictionaries to Arrays with map

Processing Dictionaries with the map_values Function

Beware of a Subtle Difference in how map & map_values Handle Multiple Filter Outputs