PBS 165 of X: Variables (jq)

11 May 2024

In recent instalments we’ve been focusing on the philosophy of lookup tables in general and the practicalities of working with them in jq specifically. It had been my intention to leave variables in jq to the very end of the series because in jq, variables really are a rarely needed advanced feature. However, while preparing the previous instalment I discovered that transforming a lookup table representing a many-to-many relationship from one key to another is impossible without variables, so I changed my plans 🙂

In this instalment we’ll learn how to bind values to variable names in jq, and we’ll use that knowledge to finish our exploration of lookup tables.

Matching Podcast Episode

Listen along to this instalment on episode 793 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_05_11

Installment Resources

The instalment ZIP file — pbs165.zip

PBS 164 Challenge Solution

The challenge set at the end of the previous instalment was to create a jq script that finds all users caught in a breach that matches a given search pattern case-insensitively.

This challenge encompassed five key concepts we’ve covered thus far in the series:

Converting a lookup-table-style dictionary into an array of entry-type dictionaries with to_entries.
The design pattern of exploding an array, processing its contents in some way, and then re-collecting the pieces using array construction — i.e. code of the form: [ .[] | … ].
The use of the two-argument version of any within a select to filter items that contain an array based on the value of any one element within the array.
The use of ascii_downcase on both strings being compared to make the comparison case-insensitive.
The use of the --arg command line option to pass a variable into jq.

With all that in mind, you’ll find a sample solution in the instalment ZIP as pbs164-challengeSolution-Basic.jq:

# Find users caught up in any breach that matches a given search string.
# Input:    JSON as downloaded from the HIBP service
# Output:   An array of account names (the parts of email addresses to the left of the @).
# Variables:
# - $breachSearch		The string to search breach names by un a case-insensitive way

# transform the lookup of breaches by AccountName into a list of entries:
# - keys will be account names
# - values will be arrays of breach IDs
.Breaches | to_entries

# select only the entries for users caught up in a breach with a matching ID
| [
  # explode the list of entries
  .[]

  # select only the entries for users caught up in the given breach
  | select(any(.value[]; . | ascii_downcase | contains($breachSearch | ascii_downcase)))

  # keep just the account name
  | .key
]

To run this script against the demo Have-I-Been-Pwnd domain breach export in the file hibp-pbs.demo.json to find all users caught up in a breach that contains drop in its breach name use the command:

jq -f ./pbs164-challengeSolution-Basic.jq --arg breachSearch drop hibp-pbs.demo.json

This produces the output:

[
  "egreen",
  "mwkelly"
]

With the comments removed, there is actually very little code in the script:

.Breaches | to_entries
| [
	.[]
	| select(any(.value[]; . | ascii_downcase | contains($breachSearch | ascii_downcase)))
  | .key
]

For context, the sample breach report used as input looks like this:

{
    "Breaches": {
      "josullivan": [
        "OnlinerSpambot"
      ],
      "egreen": [
        "Dropbox"
      ],
      "mwkelly": [
        "Dropbox",
        "KayoMoe",
        "LinkedIn",
        "LinkedInScrape",
        "PDL"
      ],
      "ahawkins": [
        "iMesh",
        "OnlinerSpambot"
      ],
      "ptraynor": [
        "Collection1"
      ]
    },
    "Pastes": {}
  }  

We are only interested in .Breaches, which is a lookup-table-style dictionary index by the username part of email addresses (the bit to the right of the @), mapping to arrays of breach names in the HIBP database.

We start by converting this lookup table into an array of entries with to_entries, where each entry will have a form like:

{
  "key": "mwkelly",
  "value": ["Dropbox","KayoMoe","LinkedIn","LinkedInScrape","PDL"]
}

We then need to keep only the entries that contain at least one breach matching the search string. To keep or discard an entire entry we need to use the select function with a condition as its only argument. The condition needs to check if any string in the .value array matches our string, so to do that we use the two-argument version of the any function to evaluate to true if any item in our array meets our criteria.

As a reminder, the first argument in the two-argument form of .any needs to be a filter that produces one or more values to test, and the second argument is a filter that performs the test.

In this case, we need to check every string in the .value array, so the first argument is simply that array exploded (.value[]). The second argument needs to be a filter that checks the current value being tested (.) against the passed search string in a case-insensitive way. The key here is to start by converting the current value to lowercase by piping the current value to ascii_downcase before piping it to the contains() function to check if it contains the lowercase version of our search string ($breachSearch).

At this stage, only entries matching our search string remain, and we only want to keep the email usernames, which are in the .key key of each entry, so we pipe the remaining entries to a simple filter that keeps just that key’s value.

All our processing is wrapped in square braches so the remaining email usernames get grouped into an array.

For bonus credit, there was a second challenge, to expand this solution to ignore any breach that does not contain passwords.

The key to this bonus challenge is to use the --slurpfile command line flag to import a downloaded version of the HIBP breach database and use it for data enrichment. You’ll find a copy of this database in the file hibp-breaches-202040329.json in the installment ZIP, and you’ll find my sample solution for bonus credit in pbs164-challengeSolution-Bonus.jq:

# Find users caught up in any breach that leaked passwords that matches a given search string.
# Input:    JSON as downloaded from the HIBP service
# Output:   An array of account names (the parts of email addresses to the left of the @).
# Variables:
# - $breachSearch   The string to search breach names by un a case-insensitive way
# - $breachDetails	An array containing a single entry, the JSON for the
#                   latest lookup table of HIBP breaches indexed by breach
#                   name

# transform the lookup of breaches by AccountName into a list of entries:
# - keys will be account names
# - values will be arrays of breach IDs
.Breaches | to_entries

# filter down to just the users caught up in breaches matching the given search string that leaked passwords
| [
	# explode the list of entries
	.[]

	# select only the entries for users caught up in a breach with a matching ID
	| select(any(.value[];
		# breaches that match the search string
		(. | ascii_downcase | contains($breachSearch | ascii_downcase))

		# breaches with passwords
		and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
	))

	# keep just the account name
  | .key
]

We can use this script to find any users caught up in a breach with an ID that contains ‘onliner’ that contained passwords with the command:

jq -f ./pbs164-challengeSolution-Bonus.jq --slurpfile breachDetails hibp-breaches-202040329.json --arg breachSearch onliner hibp-pbs.demo.json

This will return two users caught up in the OnlinerSpambot breach which did contain passwords:

[
  "josullivan",
  "ahawkins"
]

Notice that this script contains just two changes from the basic version.

Firstly, the --slurpfile command line argument is used to pass the database of all breaches to the script as the variable $breachDetails[0]. Remember that --slurpfile always wraps the contents loaded from the JSON file in an array in case the JSON file contains more than one element. In this case, the JSON file contains a single dictionary of breach data indexed by breach name. For each breach, the following keys are defined:

["AddedDate","BreachDate","DataClasses","Description","Domain","IsFabricated","IsMalware","IsRetired","IsSensitive","IsSpamList","IsSubscriptionFree","IsVerified","LogoPath","ModifiedDate","Name","PwnCount","Title"]

The one we’re interested in is DataClasses, which is an array of strings defining the types of data caught up in the breach, and for our purposes, we only need to know that any breach which contained passwords will have a .DataClasses array that contains the string "Passwords".

You can see this for yourself by extracting the data classes for the OnlinerSpambot breach with the command:

jq '.OnlinerSpambot.DataClasses' hibp-breaches-202040329.json

This shows that that breach contained the following:

[
  "Email addresses",
  "Passwords"
]

We can also see that the PDL breach that the user mwkelly was caught up on does not contain passwords:

jq '.PDL.DataClasses' hibp-breaches-202040329.json
#[
#  "Email addresses",
#  "Employers",
#  "Geographic locations",
#  "Job titles",
#  "Names",
#  "Phone numbers",
#  "Social media profiles"
#]

This lets us verify that the bonus script works as intended. If we use the basic script to search for pdl we get one user:

jq -f ./pbs164-challengeSolution-Basic.jq --arg breachSearch pdl hibp-pbs.demo.json
#[
#  "mwkelly"
#]

But if we use the bonus solution we get none:

jq -f ./pbs164-challengeSolution-Bonus.jq --slurpfile breachDetails hibp-breaches-202040329.json --arg breachSearch pdl hibp-pbs.demo.json
# []

Now, let’s look at the second change, the part of the script that makes use of the external data file to filter the results — the test within the call to select. In the basic version, this call was simply:

| select(any(.value[]; . | ascii_downcase | contains($breachSearch | ascii_downcase)))

But in the bonus version there is much more logic embedded within it:

| select(any(.value[];
  # breaches that match the search string
  (. | ascii_downcase | contains($breachSearch | ascii_downcase))

  # breaches with passwords
  and ($breachDetails[0][.].DataClasses | contains(["Passwords"]))
))

The single condition in the basic version has been replaced with two conditions, each grouped within parentheses, and separated from each other by the and operator. The first condition is the original one, it’s the second condition that’s new:

$breachDetails[0][.].DataClasses | contains(["Passwords"])

The first thing to note is that all this is happening inside the call to any() which is checking each breach ID in the list of breaches for a user one by one, so the value of the item currently being processed, i.e. ., is a breach name like "Dropbox", "LinkedIn", or "PDL".

This means the details of the breach currently being checked are in $breachDetails[0][.], and the list of data types the breach contained is in $breachDetails[0][.].DataClasses.

Remember that when the input and the argument to the contains function are both strings, it checks if the argument is a sub-string of the input, but when the input and the argument are both arrays, it checks if the input array contains every element in the argument array. We want to check the array of breached data types for the single string "Passwords", so we pipe the entire array of breached data types to the contains() function and give it the argument ["Passwords"].

How jq Thinks About Variables

As mentioned previously, jq has been designed to very rarely need variables — jq filter inputs and outputs are automatically plumbed together, and sub-expressions within the same filter see the same input, so they effectively share data automatically. This means that most of the time, you simply don’t need variables, and if you’re using them, you’re likely overcomplicating your code! Or, as the documentation puts it:

“there’s generally a cleaner way to solve most problems in jq than defining variables”

The example the documentation gives is the simple task of averaging an array of numbers.

In most languages, this simple task requires at least one, and usually two, variables. To illustrate this point, let’s blow the dust off our JavaScript skills and write a little commandline JavaScript script to average an array of numbers in JSON format. We’ll use the NodeJS JavaScript runtime to do this.

A few things to know before we start:

NodeJS scripts can read the contents of any stream, including STDIN, using the readFileSync() function in NodeJS’s standard file system module fs
NodeJS represents STDIN with the process.stdin object, whose fd (file descriptor) property can be passed to the readFileSync() function to instruct it to read from STDIN
By default, NodeJS sends the output from the console.log function to STDOUT

With that in mind, you’ll find the code for our simple NodeJS averaging script in the instalment ZIP as pbs165-a-average.js:

// A script to read the JSON for an array of numbers from STDIN and write their
// average to STDOUT

// read and parse STDIN
const numArray = JSON.parse(require('fs').readFileSync(process.stdin.fd, 'utf-8'));

//  calculate and output the average
let total = 0;
for(const num of numArray){
    total += num;
}
console.log(total/numArray.length);

We can use this script to average the numbers 1, 2, and 3 like so:

echo "[1, 2, 3]" | node pbs165-a-average.js
# 2

This outputs the correct average, 2.

Note that our code contains two variables:

A variable to hold the running total as we loop over the nubers
A variable to hold the array of numbers so we can both loop over it and access its length property later

We definitely can’t avoid the first variable, but what happens if we try avoid the second? What would happen is we made that second array anonymous by reading, parsing, and looping over it all at once without ever storing it in a variable?

We can calculate the total this way:

let total = 0;
for(const num of JSON.parse(require('fs').readFileSync(process.stdin.fd, 'utf-8'))){
    total += num;
}

But now we can’t calculate the average because we have no way of knowing how many numbers there were, so we’re forced to add another variable to count the numbers. You can see this alternative code in pbs165-a-average-alt.js:

// A script to read the JSON for an array of numbers from STDIN and write their
// average to STDOUT

// read, parse, and process STDIN
let count = 0;
let total = 0;
for(const num of JSON.parse(require('fs').readFileSync(process.stdin.fd, 'utf-8'))){
    count++;
    total += num;
}

// calculate and output the average
console.log(total/count);

The key point is that while we can change which two variables we need, but we still need two!

Look Mom — no Variables!

How many does jq need to accomplish the same task? None!

As proof, you’ll find working code to average an array of numbers in pbs165-a-average.jq in the instalment ZIP:

# output the average of an array of numbers passed as input
add/length

Before we look at why such simple code works, let’s just prove to ourselves that it really does work with:

echo "[1, 2, 3]" | jq -f pbs165-a-average.jq
# 2

There we have it — an average with just two function calls and one operator, and not a single variable needed!

This works because both the add and length functions receive the filter’s input, i.e. they both get the full array anonymously.

Reminder — We Already Undestand the Usefullness of Externally Defined Variables

Before we go on, let’s remind ourselves of something we already know — it can be very useful to access variables defined elsewhere from within our jq code. We’ve already seen two common uses of such externally defined variables:

Using the command line arguments --arg & --argjson to pass values as named variables into our filters/scripts to make them more reusable
Using the command line argument (--slurpfile) to load JSON data from external files into our filters/script as named variables for data enrichment

Rarely is not Never!

With all that said, we also know that there are times when you really do need a variable to store a value in one filter in your chain for reuse in another filter later in the chain. The reason we are learning about variables at this point in the series is that it is not possible to transform a lookup table indexed on one one-to-many key to an equivalent lookup table indexed on a different one-to-many key. Specifically, without variables, we can’t transform a Have-I-Been-Pwned export in its default format, mapping breach names to arrays of user names, to a lookup table mapping usernames to arrays of breach names.

So, while the need for internal jq variable definitions is rare, it is real, and as we’re about to learn, it can be done using the as keyword. But, my advice is still to always stop and think before using as — ask yourself ‘do I really need a variable, or am I not thinking in a jq-way?’.

Defining Variables Within jq Code With `as`

Before we look at how as works I want to take a moment to clarify some jargon, otherwise you’ll have a hard time finding the relevant sections in the jq manual.

Most languages use the term assignment operator to refer to the language’s operator(s) for assigning values to variables, but jq is different. We’ve already met jq’s suite of assignment operators, including =, |=, +=, and many more, but they have nothing to do with variables — they manipulate the data flowing through your filters instead.

So if jq does not use the term assignment for assigning values to variables, what term does it use? In jq you bind values to variable names using the symbolic binding operator as.

The as operator behaves similarly to the debug function in that it behaves as a passthrough — the input it receives passes unchanged to the output it returns. But, it has a desired side effect. For the debug function the side effect is that a message gets written to STDOUT, and for as the side effect is that a new variable comes into existence for all subsequent filters in the chain.

The syntax for as is quite simple. To the left is an expression that generates a value, and to the right is a variable name:

EXPRESSION as $variableName

Remember that in jq all variable names must start with a dollar symbol ($).

As a basic example, let’s redo our averaging script so it outputs a message of the form ‘The average of those N numbers is X’ rather than just outputting the number. Without variables, we have no way to remember the length of the original array at the end of our filter chain, but now that we know about variables we can keep track of it for later. You’ll find the code in the file pbs165-b-averageMessage.jq in the instalment ZIP:

# output the average of an array of numbers passed as input as a message of the
# form 'The average of those N numbers is X'

# store the length for later use
# NOTE - the original input is passed through
length as $numNumbers

# calculate the average
# NOTE — the input is the full array
| add/length

# assemble the message
# NOTE — the input is the average as a number
| "The average of those \($numNumbers) numbers is \(.)"

We can test this script like so:

echo "[1, 2, 3]" | jq -f pbs165-b-averageMessage.jq
# "The average of those 3 numbers is 2"

The key point to note is that the output of the first filter in the chain is not the length of the array; that has been stored in the variable $numNumbers, the output is the original input unchanged, i.e. the full array.

The full array is then used as the input for both the add and length functions in the filter that calculates the average, and that filter outputs the average as a number to the final filter. The final filter assembles the desired string by incorporating both its input, i.e. the average as a number, and the saved length from $numNumbers.

The `as` Operator Loops

When the expression to the left of the as operator generates one value the input gets passed through unchanged and the rest of the filter chain runs with the variable bound to that one value.

But, when the expression returns multiple values, it behaves similarly to a filter that explodes an array — the remaining filters in the chain run multiple times.

When you explode an array it’s the input passed to the next filter in the chain that changes each time, but as does something subtly and importantly different.

When the expression to the left of the as returns multiple values, the remaining filters in the chain run once for each value. Each time the input is the same, but the value bound to the variable changes.

To illustrate this behaviour, let’s write a script that takes a single number as the input and then produces the n-times tables for that number.

Before we look at the code, I need to introduce a very useful jq function, range, which can be used to generate sequences of numbers. It can take one, two, or three arguments:

range(n) returns the values 0 to n-1 in increments of 1, e.g. jq -cn '[range(3)]' outputs [0,1,2] (-n for no input & -c for compact output).
range(n; m) returns the values n to m-1 in increments of 1, e.g. jq -cn '[range(3; 5)]' outputs [3,4].
range(n; m; s) returns the values n to m-s in increments of s, e.g. jq -cn '[range(5; 25; 5)]' outputs [5,10,15,20].

OK, now that we know how to generate sequences of numbers, let’s generate our tables. You’ll find the code in pbs165-c-timesTables.jq:

# Given a single number as input, output the multiplication table for that
# number as a series of strings.

# Loop over the numbers from 1 to 10
# NOTE: the output for each iteration is the original input, 
#       i.e. the number received by the script
range(1; 11) as $n

# generate the appriate row of the table
# NOTE: the input is the number received by the script
| "\(.) x \($n) = \(. * $n)"

We can use this to print the 5-times tables with:

# -r for raw output, i.e. strings not wrapped in ""
echo '5' | jq -f pbs165-c-timesTables.jq -r
# 5 x 1 = 5
# 5 x 2 = 10
# 5 x 3 = 15
# 5 x 4 = 20
# 5 x 5 = 25
# 5 x 6 = 30
# 5 x 7 = 35
# 5 x 8 = 40
# 5 x 9 = 45
# 5 x 10 = 50

If we wanted the output as an array rather than as a series of strings we could wrap the loop in square braces like we do to recollect exploded arrays. You’ll find the code for this in the file pbs165-c-timesTables-array.jq:

# Given a single number as input, output the multiplication table for that
# number as an array of strings.

# start an array
[
    # Loop over the numbers from 1 to 10
    # NOTE: the output for each iteration is the original input, 
    #       i.e. the number received by the script
    range(1; 11) as $n

    # generate the appriate row of the table
    # NOTE: the input is the number received by the script
    | "\(.) x \($n) = \(. * $n)"
]

We can see this in action with:

echo '5' | jq -f pbs165-c-timesTables-array.jq
# [
#   "5 x 1 = 5",
#   "5 x 2 = 10",
#   "5 x 3 = 15",
#   "5 x 4 = 20",
#   "5 x 5 = 25",
#   "5 x 6 = 30",
#   "5 x 7 = 35",
#   "5 x 8 = 40",
#   "5 x 9 = 45",
#   "5 x 10 = 50"
# ]

We can of course combine this kind of looping with array exploding to generate multiple tables at once, as demonstrated in pbs165-c-timesTables-multiple.jq

# Given an array of numbers as input, output the multiplication tables for
# those numbers as an array of arrays of strings.

# start an outer array
[
    # explode the array of input numbers
    .[]

    # start an inner array for each input number
    | [
        # Loop over the numbers from 1 to 10
        # NOTE: the output for each iteration is the input number currently
        #       being processed
        range(1; 11) as $n

        # generate the appriate row of the current table
        | "\(.) x \($n) = \(. * $n)"
    ]
]

We can see that this works as desired with:

echo '[1, 2, 3]' | jq -f pbs165-c-timesTables-multiple.jq
# [
#   [
#     "1 x 1 = 1",
#     "1 x 2 = 2",
#     "1 x 3 = 3",
#     "1 x 4 = 4",
#     "1 x 5 = 5",
#     "1 x 6 = 6",
#     "1 x 7 = 7",
#     "1 x 8 = 8",
#     "1 x 9 = 9",
#     "1 x 10 = 10"
#   ],
#   [
#     "2 x 1 = 2",
#     "2 x 2 = 4",
#     "2 x 3 = 6",
#     "2 x 4 = 8",
#     "2 x 5 = 10",
#     "2 x 6 = 12",
#     "2 x 7 = 14",
#     "2 x 8 = 16",
#     "2 x 9 = 18",
#     "2 x 10 = 20"
#   ],
#   [
#     "3 x 1 = 3",
#     "3 x 2 = 6",
#     "3 x 3 = 9",
#     "3 x 4 = 12",
#     "3 x 5 = 15",
#     "3 x 6 = 18",
#     "3 x 7 = 21",
#     "3 x 8 = 24",
#     "3 x 9 = 27",
#     "3 x 10 = 30"
#   ]
# ]

This is a good example in that it uses both of the most common loop-like features in jq — .[] for looping over an array, and range(;) as for looping over a series of numbers.

A Practical Example — Transforming a HIBP Export

As mentioned previously, when you export a breach report for one of your domains in Troy Hunt’s Have-I-Been-Pwnd service in JSON format the .Breaches key is a lookup-type dictionary that indexes an array of breach names by the username part of the breached email address, i.e. the part to the left of the @.

This default output is perfect for answering questions like ‘What breaches has Bob been snared in?’, but it’s terrible for answering questions of the form ‘Who all is caught up in the big breach discovered last week?’.

The needed information is all there, but it’s in a very inconvenient shape for these kinds of questions. So, let’s combine what we learned in previous instalments with what we’ve just learned to transform our demo HIBP report to a lookup table indexing arrays of usernames by breach name.

You’ll find the full code in pbs165-d-hibp-namesByBreach.jq in the instalment ZIP:

# Build a lookup of usernames by breach name.
# Input:    A JSON export from HIBP (.Breaches is a lookup of breach names by username)
# Output:   A lookup table of usernames by breach name

# convert the breaches lookup to and array of entries
.Breaches | to_entries

# explode and re-collect the entries, expanding and reversing each
| [
    .[]

    # save the username for the current entry
    | .key as $username

    # expand the array of breaches in the current entry into multiple reversed entries
    | .value[] | {
        key: .,
        value: $username
    }
]

# group the entries by breach name, i.e. .key
| group_by(.key)

# explode and re-assemble into new grouped entries
| [
    .[]

    | {
        key: .[0].key,
        value: [ .[] | .value ]
    }
]

# build the final lookup
| from_entries

There’s a lot to digest in this code, so before we do, let’s see what it does. We’ll be starting with our sample HIBP breach export (hibp-pbs.demo.json):

{
    "Breaches": {
      "josullivan": [
        "OnlinerSpambot"
      ],
      "egreen": [
        "Dropbox"
      ],
      "mwkelly": [
        "Dropbox",
        "KayoMoe",
        "LinkedIn",
        "LinkedInScrape",
        "PDL"
      ],
      "ahawkins": [
        "iMesh",
        "OnlinerSpambot"
      ],
      "ptraynor": [
        "Collection1"
      ]
    },
    "Pastes": {}
}

As you can see, .Breaches is a lookup table listing breach names by username. We can transform that into the inverse lookup table, listing usernames by breach name with our script:

jq -f pbs165-d-hibp-namesByBreach.jq hibp-pbs.demo.json

This produces the following JSON:

{
  "Collection1": [
    "ptraynor"
  ],
  "Dropbox": [
    "egreen",
    "mwkelly"
  ],
  "KayoMoe": [
    "mwkelly"
  ],
  "LinkedIn": [
    "mwkelly"
  ],
  "LinkedInScrape": [
    "mwkelly"
  ],
  "OnlinerSpambot": [
    "josullivan",
    "ahawkins"
  ],
  "PDL": [
    "mwkelly"
  ],
  "iMesh": [
    "ahawkins"
  ]
}

To understand how the script works, let’s zoom right in and notice that two users in the original file have Dropbox in their list of breaches, egreen & mwkelly. When we look at the entry for Dropbox in the output we see it has those two users:

…
"Dropbox": [
  "egreen",
  "mwkelly"
]
…

Let’s use these data points to understand what the script does.

Firstly, the script converts the original .Breaches lookup table into an array of entry-type dictionaries with to_entries. Included in the output at this stage in the script’s execution are our two entries that reference Dropbox:

[
  …
  {
    "key": "egreen",
    "value": [
      "Dropbox"
    ]
  },
  {
    "key": "mwkelly",
    "value": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ]
  },
  …
]

To get where we need to go, we need to transform these entries in two ways:

Expand these one-to-many entries into more one-to-one entries, an entry for each combination of user name and breach name.
Swap the .key and .value keys in these one-to-one entries so the breach names become the keys, and the usernames the values.

Zooming in to just the entry for the user mwkelly, we need to transform this:

{
  "key": "mwkelly",
  "value": [
    "Dropbox",
    "KayoMoe",
    "LinkedIn",
    "LinkedInScrape",
    "PDL"
  ]
}

Into these:

{
  "key": "Dropbox",
  "value": "mwkelly"
}
{
  "key": "KayoMoe",
  "value": "mwkelly"
}
{
  "key": "LinkedIn",
  "value": "mwkelly"
}
{
  "key": "LinkedInScrape",
  "value": "mwkelly"
}
{
  "key": "PDL",
  "value": "mwkelly"
}

The following lines of code implement these transformations:

# explode and recollect the entries, expanding and reversing each
| [
    .[]

    # save the username for the current entry
    | .key as $username

    # expand the array of breaches in the current entry into multiple reversed entries
    | .value[] | {
        key: .,
        value: $username
    }
]

Notice that we save the .key in the original single entries into a variable named $username before we explode the original .value arrays — this is the only way we can retain access to both pieces of information after we explode the original .value array.

At this stage in our pipeline, we have a single array with individual entries for every single unique breach name to user name mapping, including the entries related to Dropbox:

[
  …
  {
    "key": "Dropbox",
    "value": "egreen"
  },
  {
    "key": "Dropbox",
    "value": "mwkelly"
  },
  …
]

Next, we use the group_by function to convert our single flat array into an array of arrays, one child array for each breach:

# group the entries by breach name, i.e. .key
| group_by(.key)

The two entries for the Dropbox breach are now together in a single nested array:

[
  …
  [
    {
      "key": "Dropbox",
      "value": "egreen"
    },
    {
      "key": "Dropbox",
      "value": "mwkelly"
    }
  ],
  …
]

Next we need to convert this array of arrays of entries into just an array of single entries, one for each nested array. I.e. we want to convert the above into:

[
  …
  {
    "key": "Dropbox",
    "value": [
      "egreen",
      "mwkelly"
    ]
  },
  …
]

The code for doing that is:

# explode and re-assemble into new grouped entries
| [
    .[]

    | {
        key: .[0].key,
        value: [ .[] | .value ]
    }
]

Remember that what we are exploding is an array of arrays of entries, and the desired result is to turn each of those nested arrays into a single entry, where .key is the breach they all relate to, and .value is an array of all the values. Because the entries are grouped by their key, the key in each child entry is the same, so we just take it from the first one with .[0].key. For the new .value we need an array of all the child .value strings, so we explode and reassemble the array, keeping just .value.

We now have a single array of entries in the desired form, so the last thing to do is just to convert them into a lookup table with to_entries.

Another Practical Example — Nobel Prizes by Laureate Name

Another dataset we’ve worked with a lot is the list of Nobel Prize winners from the Nobel Committee’s website.

A logical lookup table to build from this dataset is one that maps laureate names to arrays of dictionaries describing the two key features of each prize — its category and year. This would allow us to quickly find the years and categories for Marie Curie’s prizes.

You’ll find the code to build this lookup in the file pbs165-e-nobelPrizesByName.jq in the instalment ZIP:

# Build a lookup of nobel prize years and catetories by recipient.
# Input:    The JSON data set from the Nobel Prize committee
# Output:   A lookup table mapping recipient names to arrays of dictionaries
#           indexed by year and category

# explode the prizes, transform them to entries, and re-assemble
[
    .prizes[]

    # remember the current year and category
    | .year as $year
    | .category as $category

    # explode the laureates, if there are any
    | .laureates[]?

    # build an entry for each laureate
    | {
	    	# Note: pretty name logic from PBS 161 challenge solution
        key: ([.firstname, .surname // empty] | join(" ")),
        value: {
            year: $year,
            category: $category
        }
    }
]

# group the entries by name
| group_by(.key)

# explode the array of groups and collapse each into a single combined entry
| [
    .[] | {
        key: .[0].key,
        value: [ .[] | .value ]
    }
]

# assemble the entries into a lookup table
| from_entries

Before we look at how it works, let’s use it to answer our question about Marie Curie:

jq -f pbs165-e-nobelPrizesByName.jq NobelPrizes.json | jq '."Marie Curie"'
# [
#   {
#     "year": "1911",
#     "category": "chemistry"
#   },
#   {
#     "year": "1903",
#     "category": "physics"
#   }
# ]

The logic is quite simple — start by building individual entries for each laureate for each prize, then group those by name, collapse the groups into single entries with arrays as values and finally assemble the entries into a lookup table.

The only reason we could not do this earlier in the series is that until now, once we exploded the laureates, we lost access to the details of the prize those laureates were awarded. Now, we can capture the year and category in variables before we explode the laureates.

An Optional Challenge

Starting with pbs164-challengeSolution-Bonus.jq (or your own bonus credit solution to the previous challenge), update the script to capture the details from every matched breach for every user. For each matching user+breach combination, return a dictionary indexed by:

AccountName: the user’s email account name
BreachName: the name of the matching breach
BreachTitle: the title of the matching breach
BreachedDataClasses: the list of breached data classes

Hint: you’ll need to explode the list of breaches for each user while retaining their account name in a variable.

For bonus credit, update the matching logic to check both the breach title and breach ID for a case-insensitive match against the search string. Hint: you’ll need more parentheses and the or operator.

Final Thoughts

We’re getting close to the end of our exploration if jq. In this instalment we’ve wrapped up both our exploration of lookup tables and variables in jq.

In the next instalment we’ll learn how to transform all elements in an array or dictionary without needing to disassemble and then reassemble them. This won’t open up any new capabilities, but it will let you greatly simplify your filters, making them easier to write and maintain.

After that, I’m planning one final instalment to collect together a few other useful advanced features that haven’t fit cleanly into our narrative.

Working with Lookup Tables jq Miniseries Processing Arrays & Dictionaries sans Explosions

164. Working with Lookup Tables Programming by Stealth 166. Processing Arrays & Dictionaries sans Explosions

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack