PBS 164 of X: Working with Lookup Tables (jq)

30 Mar 2024

We dedicated most of the previous instalment to exploring conceptual ideas around different classifications of dictionaries — starting with record-style dictionaries which represent rows in a conceptual table, and then lookup-type dictionaries, also known as lookup tables which are analogous to indexes on a database table or column. We also learned that jq makes use of an intermediary format it refers to as entries for transforming data into lookup tables, and as we’ll see today, for disassembling lookup tables into arrays. A jq entry is simply a dictionary with a key named key and another named value, and it represents a single conceptual entry in a lookup table.

In the previous instalment we saw how quick and easy it is to find the data we need when we have lookup tables indexed on the appropriate field. If we have a lookup table that indexed employee records by email address, and we have specific email addresses we need to find the matching records for, it works wonderfully! But, what if we’re not looking for the data that matches a specific indexed value, but that meets a broader criteria? What if we need all employees with email addresses starting with a b? Worse still, we have 500 surnames we need to find the records for, but all we have is a lookup table indexed by email address?

In these situations, we need to disassemble the lookup table to be able to process it, and that’s what we’ll spend most of this instalment doing.

Another very important use of lookup tables is to map data in one data set to data in another. We have a data set we’re trying to process, and we have one or more lookup tables containing extra information about values we may find in that data set. In these situations, we need to be able to pull the lookup tables into our script as a variable, so we can use them to enrich the data we find in our primary data set.

A great example of this from the real world is processing JSON-formatted log files that contain usernames. You’re likely to want to map those usernames to real identities, but that data is not in the logs, it’s in a lookup table indexed by username. If you can reach into that lookup table you can use its data to enrich the output from your log search — rather than saying bartb was working late and logged into the VPN at 3 AM, you can say Bart Buscshots was working late and logged into the VPN!

We’ll finish this instalment learning how to do that kind of data enrichment with jq.

Matching Podcast Episode

Listen along to this instalment on episode 790 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_03_30

Installment Resources

The instalment ZIP file — pbs164.zip

PBS 163 Challenge Solution

The challenge set in the previous instalment was to build a lookup table indexing Nobel Prize records from the Nobel Prizes data set we’ve been using by the year they were awarded.

The code for my sample solution is available in the instalment ZIP as pbs163-challengeSolution-basic.jq:

# Build a lookup table of Nobel Prizes indexed by year
# Input:    JSON as published by the Nobel Committee
# Output:   a lookup-style dictionary of record-style dictionaries
#           describing Nobel Prizes indexed by years as strings

# start by grouping the prizes by year
.prizes | group_by(.year)

# build the entries list
| [
    # explode the array of arrays of prizes
    .[]

    # build the entries
    | {
        # use the year from the first record in the child array
        # note: aready a string in the input data
        key: .[0].year,

        # use the entire child array of records as the value
        value: .
    }
]

# build the lookup table
| from_entries

The first thing to note is that because there are multiple prizes awarded each year, each year in the lookup table will need to map to multiple prizes, so the prizes will need to be stored in an array. This is why the first step is to break the .prizes array into an array of arrays by their year, hence starting with .prizes | group_by(.year). We now have an array of arrays of prizes, and within each child array, all the prizes belong to the same year.

The next step is to build our array of entry-type dictionaries, where the keys for the entries will be years, and the values will be arrays of Nobel Prize record-type dictionaries. Following our now familiar design pattern, we wrap this entire step in square braces to re-collect the pieces back into a single array, and we start the step by exploding our array of arrays of Nobel Prize Records. With our array of arrays exploded, we pipe each top-level array to a filter that constructs a new dictionary with the following key-value pairs:

A key named key with the value of the year key in the first Nobel Prize record in the array of prize records being processed, i.e .[0].year. Why the first one? Because there is always a first element in an array created by group_by, and since we grouped by year, every element in the array has the same value for year.
A key named value with the value of the array currently being processed, i.e. ..

We now have an array of entry-type dictionaries, so all we have to do to build our lookup table is pipe that array to jq’s from_entries function.

We could save this lookup table to a file so we can re-use it many times with a command like:

jq -f pbs163-challengeSolution-basic.jq NobelPrizes.json > NobelPrizes-byYear.json

But, for simplicity, let’s use our lookup table directly by piping the output of the jq command for building the lookup straight to a jq command for using it — let’s look up the prizes for 1980:

jq -f pbs163-challengeSolution-basic.jq NobelPrizes.json | jq '."1980"'

Note that because the keys in the lookup table are strings we need to quote the year — if you try .1980 jq thinks you mean the number 0.198!

Anyway, the output will be a full list of the 1980 prizes.

The key thing to note is that armed with this lookup table, we can find the prizes for a given value of our chosen index, the year, with a trivially simple filter — simply the desired year as a key in the lookup table, i.e. the 1980 prizes are at ."1980", the 2023 prizes are at ."2023", and so on. Compare that to the filter needed to get the same data without the lookup table:

.prizes[] | select(.year == "1980")

This effect is magnified even more when you use multi-level indexes, hence my choice of bonus challenge — to build a two-level lookup table indexing the Nobel Prize records by year and category. This lookup table answers the most likely question to ask the data set — “show me the prize X awarded in Y year”!

The key to building this kind of two-level lookup table is to conceptualise it as a lookup table of lookup tables of records. The outer lookup table indexes the inner lookup tables by year, and the inner lookup tables index the records by category. You’ll find my sample solution in the instalment resources folder as pbs163-challengeSolution-bonusCredit.jq:

# Build a lookup table of Nobel Prizes by year
# Input:    JSON as published by the Nobel Committee
# Output:   a lookup-style dictionary of record-style dictionaries
#           describing Nobel Prizes indexed by years as strings

# start by grouping the prizes by year
.prizes | group_by(.year)

# build the top-level entries list
| [
    # explode the array of arrays of prizes
    .[]

    # build the top-level entries
    | {
        # use the year from the first record in the child array
        # note: aready a string in the input data
        key: .[0].year,

        # build the second-level lookup table to use as the value
        value: (
            # Note: because there is only one prize per category per year,
            # there is no need to group by anything

            # build the entries for the second-level lookup table (category)
            [
                # explode the array records
                .[]

                # build the entries
                | {
                    # use the category as the key
                    key: .category,

                    # use the entire record as the value 
                    value: .
                }
            ]

            # assemble the second-level lookup table (category)
            | from_entries
        )
    }
]

# build the top-level lookup table (year)
| from_entries

The first thing to note is that the outer layers of this solution are the same as that for the basic solution because the outer lookup table still indexes the prizes by year, the difference is in the value for the value key in the top-level array of entry-type dictionaries. Instead of simply using the array of records produced by group_by, we build an entire new lookup table, this time, of records indexed by category. Because there is exactly one prize per category per year, the building of the inner lookup table is simpler, with no need to group by anything.

Again, we could save this lookup table to a file with a command like:

jq -f pbs163-challengeSolution-bonusCredit.jq NobelPrizes.json > NobelPrizes-byYearByCategory.json

But, for simplicity, we’ll just use the lookup table directly by piping it to another jq command to find the physics prize for 1980:

jq -f pbs163-challengeSolution-bonusCredit.jq NobelPrizes.json | jq '."1980".physics'

This shows that in 1980, the physics prize went to James Cronin and Val Fitch for the discovery of violations of fundamental symmetry principles in the decay of neutral K-mesons.

Again, notice that armed with the lookup table to filter to get a specific prize for a specific year is as simple as using the year and the category as keys on the lookup table. So, the filter to get the Peace Prize for 2023 is simply ."2023".peace. Now, compare that to the filter to get that prize directly from the raw data set:

.prizes[] | select(.year == "2023") | select(.category == "peace")

BTW, If you’re wondering, the 2023 Peace Prize was awarded to Narges Mohammadi for her fight against the oppression of women in Iran and her fight to promote human rights and freedom for all, which gets 👍👍 from me 🙂

Converting Lookups to Records with `to_entries`

Before we start this section, I want to introduce a new data set we’ll be using — a semi-fictional list of breached email addresses in the format provided by the wonderful Have-I-Been-Pwnd (HIBP) service. If you own your own domain, you can sign up to have HIBP monitor all breaches of accounts on your domain and email you each time one of your addresses gets caught up on a newly discovered breach (this service is free for small domains). When you sign up and validate your domain, you can download the full list of breaches on your domain in JSON format at any time. The file hibp-pbs.demo.json contains a sample domain download from HIBP constructed from a much larger genuine download that has been randomly sampled, had all the account names replaced, and all details from leaks on paste-bin-like sites removed completely:

{
  "Breaches": {
    "josullivan": [
      "OnlinerSpambot"
    ],
    "egreen": [
      "Dropbox"
    ],
    "mwkelly": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ],
    "ahawkins": [
      "iMesh",
      "OnlinerSpambot"
    ],
    "ptraynor": [
      "Collection1"
    ]
  },
  "Pastes": {}
}

Note the structure of the file — it is a dictionary indexed by two top-level keys, one named Breaches, containing the details of the domain’s breached users, and one named Pastes which has been intentionally emptied.

Looking at the value of the top-level Breaches key we see it’s a lookup table mapping account names to arrays of breach names. Presumably, to make the file dumps safer from a privacy POV, the domain name is omitted from the email addresses — what josullivan represents is the part of the breached email address before the @ symbol, so, since our data is for the imaginary domain pbs.demo, josullivan is the entry for josullivan@pbs.demo. The values in the arrays are the names of breaches on the HIBP website, so josullivan@pbs.demo was caught up in just one breach, the OnlinerSpambot. If you search for that name on the list of all breaches you’ll discover it was a breach in 2017 that exposed over 700M email addresses and passwords. You’ll also see that each breach has a permalink of the form https://haveibeenpwned.com/PwnedWebsites#BREACH_NAME, so the permalink for the OnlinerSpambot breach is https://haveibeenpwned.com/PwnedWebsites#OnlinerSpambot.

For a large domain that has been in use for many years, this list of breached accounts can be very large, often with hundreds, or even thousands, of breached email addresses. This data is stored as a lookup table indexed by account name. What if you need to search the data by breach instead of by account? Try to write the jq filter that takes that data as it is, and finds all users caught up in the OnlinerSpambot breach — I tried, but did not succeed!

To use this data for anything but a lookup by account name, we need to transform it from a lookup table into a list of records, and to do that we need the inverse of the from_entries function, which, unsurprisingly, is named to_entries!

Jq’s to_entries function takes a lookup table as its input and outputs an array of entry-type dictionaries, i.e. dictionaries with the two keys key & value. We can see what to_entries does by piping the .Breaches lookup table to it with the command:

jq '.Breaches | to_entries' hibp-pbs.demo.json

This outputs the following JSON:

[
  {
    "key": "josullivan",
    "value": [
      "OnlinerSpambot"
    ]
  },
  {
    "key": "egreen",
    "value": [
      "Dropbox"
    ]
  },
  {
    "key": "mwkelly",
    "value": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ]
  },
  {
    "key": "ahawkins",
    "value": [
      "iMesh",
      "OnlinerSpambot"
    ]
  },
  {
    "key": "ptraynor",
    "value": [
      "Collection1"
    ]
  }
]

To get a traditional array of record-type dictionaries we need to transform each of those entries into dictionaries indexed by AccountName and BreachNames. We can do that with the following jq filter, which you’ll find in the instalment ZIP as hibp-toRecordList.jq:

# Convert a HIBP export to a list of records
# Input:    JSON as downloaded from the HIBP service
# Output:   an array of records, each indexed by AccountName and BreachNames

# start by disassembling the Breaches lookup table
.Breaches | to_entries

# transform the array of entries to an array of records
| [
    # explode the array of entries
    .[]

    # build the records
    | {
        # use the key from the entry as the account name
        AccountName: .key,

        # use the value from the entry as the list of breach names
        BreachNames: .value
    }
]

If you remove the comments you’ll see that the filter is actually very straightforward — convert the lookup table to a list of entries, then convert each entry in that list to a record-type dictionary and re-assemble those into the final array:

.Breaches | to_entries | [
    .[] | {
        AccountName: .key,
        BreachNames: .value
    }
]

This gives us our data in a structure we can easily process with jq:

[
  {
    "AccountName": "josullivan",
    "BreachNames": [
      "OnlinerSpambot"
    ]
  },
  {
    "AccountName": "egreen",
    "BreachNames": [
      "Dropbox"
    ]
  },
  {
    "AccountName": "mwkelly",
    "BreachNames": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ]
  },
  {
    "AccountName": "ahawkins",
    "BreachNames": [
      "iMesh",
      "OnlinerSpambot"
    ]
  },
  {
    "AccountName": "ptraynor",
    "BreachNames": [
      "Collection1"
    ]
  }
]

Let’s save this record list to a file named hibp-pbs.demo-records.json with the command:

jq -f hibp-toRecordList.jq hibp-pbs.demo.json > hibp-pbs.demo-records.json

We can now query it similarly to how we’ve been querying our list of Nobel Prize records, so let’s get that list of users caught up on the OnlinerSpambot breach:

# -r for raw output
jq -r '.[] | select(any(.BreachNames[]; . == "OnlinerSpambot")) | .AccountName' hibp-pbs.demo-records.json

This shows us we have two affected users:

josullivan
ahawkins

We’ve not used the two-argument version of the any function in some time. As a reminder, the first argument is the so-called generator which builds a list of values for the second argument, the so-called test to evaluate to a boolean. If any one of the generated values evaluates to true, then the any function also returns true, so the select will pass its entire input as its output. If none of the generated values evaluate to true then any will return false, and the select will output nothing at all, vanishing the record into oblivion.

By transforming the original lookup table and saving it as a list of records, we have assumed we’ll be querying these records many times, but that may of course not be true.

Querying a Lookup Table Without Saving the Records First

If you need to query a lookup table just once, you can simply transform it to entries and query those entries directly, for example, we can perform the same query on the original lookup table with the following command:

jq -r '.Breaches | to_entries | .[] | select(any(.value[]; . == "OnlinerSpambot")) | .key' hibp-pbs.demo.json

Filtering a Lookup Table

Something else you may want to do is filter a lookup table down to only entries that meet certain criteria. In other words, you want the output to be the original lookup table, but with fewer entries. The algorithm for achieving this task is simply to:

Convert the lookup table to a list of entries
Filter the list of entries
Convert the list of entries back to a lookup table

As an example, let’s filter our original lookup table from HIBP down to just the entries caught up in the infamous Dropbox breach which leaked over 68M email addresses and passwords in 2012.

Because the lookup table we want to filter is not the top-level file, but a key within it, we will use the update assignment operator to redefine the .Breaches key, then we’ll follow the standard algorithm to calculate the new value for the .Breaches key. You’ll find the jq filter in the file pbs164a.jq in the instalment ZIP:

# Filter a HIBP export down to just the accounts caught up in the
# Dropbox breach
# Input:    JSON as downloaded from the HIBP service
# Output:   The original input with the Breaches lookup table filtered down

# update the Breaches lookup table in place
.Breaches |= (
	# start by converting the lookup table to a list of entries
	to_entries
	
	# filter the entries down to just those caught up on the Dropbox breach
	| [
		.[] | select(any(.value[]; . == "Dropbox"))
	]
	
	# re-assemble the remaining entries into a lookup table
	| from_entries
)

We can now run this filter with the following command:

jq -f pbs164-a.jq hibp-pbs.demo.json

When we do, we see we get the original JSON, but with the Breaches lookup table reduced to just the users caught up on the Dropbox Breach:

{
  "Breaches": {
    "egreen": [
      "Dropbox"
    ],
    "mwkelly": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ]
  },
  "Pastes": {}
}

It seems silly to hard-code the name of the breach into this filter, so let’s use a variable to make this search generic. You’ll find this filter in the file hibp-filterByBreach.jq in the instalment ZIP:

# Filter a HIBP export down to just the accounts caught up a given breach
# Input:    JSON as downloaded from the HIBP service
# Output:   The original input with the Breaches lookup table filtered down
# Variables:
# - $breach:    The name of the breach to filter by

# update the Breaches lookup table in place
.Breaches |= (
	# start by converting the lookup table to a list of entries
	to_entries
	
	# filter the entries down to just those caught up on the Dropbox breach
	| [
		.[] | select(any(.value[]; . == $breach))
	]
	
	# re-assemble the remaining entries into a lookup table
	| from_entries
)

We can now use this filter to query for any breach, for example, the equally infamous LinkedIn breach which leaked over 161M email addresses and passwords in 2022:

jq -f hibp-filterByBreach.jq --arg breach LinkedIn hibp-pbs.demo.json

This shows only one of our users was caught up:

{
  "Breaches": {
    "mwkelly": [
      "Dropbox",
      "KayoMoe",
      "LinkedIn",
      "LinkedInScrape",
      "PDL"
    ]
  },
  "Pastes": {}
}

Reindexing Lookups

Note that because we have not yet learned how to create variables in jq, and because the values in the original HIBP lookup table are arrays, we cannot yet transform the HIBP lookup table from being indexed by account name to being indexed by breach name. Why? Because it would require us to explode the array of breach names, at which point we lose access to the account name.

Because we’re now well into our exploration of jq, and purely as a bonus extra, I’ve included the needed jq filter to do that transformation, with the help of a variable named $accountName, in the file hibp-transformToByAccountName.jq in the instalment ZIP. There are detailed comments in the file, and the syntax for creating a variable is quite intuitive, so you may be pleasantly surprised and find that you can follow the code, but there is absolutely no expectation that that will be the case.

To see what the result of transforming the lookup table in our HIBP sample data from being indexed by account name to being indexed by breach name looks like, you can run the command:

jq -f hibp-transformToByAccountName.jq hibp-pbs.demo.json

You’ll see that we get:

{
  "Breaches": {
    "Collection1": [
      "ptraynor"
    ],
    "Dropbox": [
      "egreen",
      "mwkelly"
    ],
    "KayoMoe": [
      "mwkelly"
    ],
    "LinkedIn": [
      "mwkelly"
    ],
    "LinkedInScrape": [
      "mwkelly"
    ],
    "OnlinerSpambot": [
      "josullivan",
      "ahawkins"
    ],
    "PDL": [
      "mwkelly"
    ],
    "iMesh": [
      "ahawkins"
    ]
  },
  "Pastes": {}
}

So, if we can’t learn to reindex lookup tables with our HIBP data, what can we use? Let’s use a nice simple lookup of PBS creators indexed by username (the contact details are entirely fictitious!), which you’ll find in the file pbsStaff-byUsername.json in the instalment ZIP:

{
    "allisons": {
        "Name": "Allison Sheridan",
        "Username": "allisons",
        "Email": "allison@pbs.demo",
        "Website": "https://www.podfeet.com/"
    },
    "bartb": {
        "Name": "Bart Busschots",
        "Username": "bartb",
        "Email": "bart@pbs.demo",
        "Website": "https://www.bartb.ie/"
    },
    "helmavdl": {
        "Name": "Helma van der Linden",
        "Username": "helmavdl",
        "Email": "helma@pbs.demo"
    }
}

Now, let’s transform this lookup table into a new one indexed by email address.

The algorithm for this kind of transformation is always the same:

Disassemble the original lookup table into an array of entries
Transform the entries into their desired new shape
Re-assemble the entries into the new lookup table

The jq filter to implement that algorithm for our example can be found in the file pbsStaff-transformToByEmail.jq in the instalment zip:

# Transform the PBS staff by username lookup table into a lookup table by email
# Input:    A lookup table of PBS staff member records by username
# Output:   A loopkup table of PBS staff member records by email

# start by converting the original lookup table to a list of entries
to_entries

# transform the entries to change the keys to email addresses
| [
    # explode the list of entries
    .[]

    # update the value of the key named key using plain assignment
    # Note: must use plain assignment so . represents the entire entry
    | .key = .value.Email
]

# assemble the updated entries into the new lookup table
| from_entries

As you can see, with the comments removed this is actually a very simple filter:

to_entries| [
    .[] | .key = .value.Email
] | from_entries

The important subtlety to notice is the use of the plain assignment operator = to update the value of the key named key. Remember that the difference between plain assignment and update assignment is the value of . on the right-hand side of the expression. With update assignment the value of . would be the current value of .key, which means we would have no way to access the email address, which is what we need to change the value of .key to. But, by using plain assignment, the value for . is the item currently being processed, which in this case is the full entry-type dictionary, so we can reach the email address inside the key named value with .value.Email.

We can run this filter against our JSON file with the command:

jq -f pbsStaff-transformToByEmail.jq pbsStaff-byUsername.json

Which produces the expected lookup table indexed by email address:

{
  "allison@pbs.demo": {
    "Name": "Allison Sheridan",
    "Username": "allisons",
    "Email": "allison@pbs.demo",
    "Website": "https://www.podfeet.com/"
  },
  "bart@pbs.demo": {
    "Name": "Bart Busschots",
    "Username": "bartb",
    "Email": "bart@pbs.demo",
    "Website": "https://www.bartb.ie/"
  },
  "helma@pbs.demo": {
    "Name": "Helma van der Linden",
    "Username": "helmavdl",
    "Email": "helma@pbs.demo"
  }
}

Loading Data Structures from External Files

One of the most common uses for lookup tables is to enrich data. To illustrate this point, let’s switch back to our example HIBP domain breach report, as it would be downloaded from HIBP (hibp-pbs.demo.json):

{
    "Breaches": {
      "josullivan": [
        "OnlinerSpambot"
      ],
      "egreen": [
        "Dropbox"
      ],
      "mwkelly": [
        "Dropbox",
        "KayoMoe",
        "LinkedIn",
        "LinkedInScrape",
        "PDL"
      ],
      "ahawkins": [
        "iMesh",
        "OnlinerSpambot"
      ],
      "ptraynor": [
        "Collection1"
      ]
    },
    "Pastes": {}
  }  

Now, imagine the Onliner Spambot breach had just happened, and we wanted to send all our affected users an email warning them about the breach, wouldn’t it be useful to be able to add more information than just the name of the breach?

To enrich this data what we really need is a lookup table with breach details indexed by breach name. As it happens, you can download a list of breach records from the HIBP site, and we can easily transform that into our desired lookup table.

You can download the up-to-date list of breaches from the URL https://haveibeenpwned.com/api/v3/breaches. This gives the JSON for an array of records with the names of the breaches stored within each record in the key named Name. To transform that into the desired lookup table we can use the command:

curl 'https://haveibeenpwned.com/api/v3/breaches' | jq '[.[] | {key: .Name, value: .} ] | from_entries'

You’ll find the result of running that command on the 29th of March 2024 in the file hibp-breaches-20240329.json in the instalment ZIP.

Each entry in this lookup takes the following form:

{
  "Name": "OnlinerSpambot",
  "Title": "Onliner Spambot",
  "Domain": "",
  "BreachDate": "2017-08-28",
  "AddedDate": "2017-08-29T19:25:56Z",
  "ModifiedDate": "2017-08-29T19:25:56Z",
  "PwnCount": 711477622,
  "Description": "In August 2017, a spambot by the name of <a href=\"https://benkowlab.blogspot.com.au/2017/08/from-onliner-spambot-to-millions-of.html\" target=\"_blank\" rel=\"noopener\">Onliner Spambot was identified by security researcher Benkow moʞuƎq</a>. The malicious software contained a server-based component located on an IP address in the Netherlands which exposed a large number of files containing personal information. In total, there were 711 million unique email addresses, many of which were also accompanied by corresponding passwords. A full write-up on what data was found is in the blog post titled <a href=\"https://www.troyhunt.com/inside-the-massive-711-million-record-onliner-spambot-dump\" target=\"_blank\" rel=\"noopener\">Inside the Massive 711 Million Record Onliner Spambot Dump</a>.",
  "LogoPath": "https://haveibeenpwned.com/Content/Images/PwnedLogos/Email.png",
  "DataClasses": [
    "Email addresses",
    "Passwords"
  ],
  "IsVerified": true,
  "IsFabricated": false,
  "IsSensitive": false,
  "IsRetired": false,
  "IsSpamList": true,
  "IsMalware": false,
  "IsSubscriptionFree": false
}

To make use of this lookup table in our filters, we can load it into a variable using the jq option --slurpfile. Similar to the --arg option, this option requires two values: a variable name, and a file path. To use this lookup table in our filters as $breachDetails we would use jq --slurpfile breachDetails hibp-breaches-20240329.json.

The --slurpfile option is designed to support files containing multiple JSON data definitions, separated by newline characters, so the variable will always be an array, even when the file only contains one JSON data definition.

We now have all the pieces we need to write a jq command to construct a CSV file that could be used as the input for a mail merge to all affected users. Let’s start by building the jq filer, assuming we will name our slurped lookup table $breachDetails. You’ll find the code in the file pbs1674-b.jq in the instalment ZIP:

# Build the CSV for a mail merge warning users caught up in the OnlinerSpambot
# breach that they have been breached
# Input:    JSON as downloaded from the HIBP service
# Output:   CSV data with the columns TBD
# Variables:
# - $breachDetails	An array containing a single entry, the JSON for the
#                   latest lookup table of HIBP breaches indexed by breach
#                   name               


# transform the lookup of breaches by AccountName into a list of entries
.Breaches | to_entries

# filter down to just the users caught up in the OnlinerSpambot breach
| [
	# explode the list of entries
	.[]

	# select only the entries for users caught up in the OnlinerSpambot breach
	| select(any(.value[]; . == "OnlinerSpambot"))

	# transform the remaining entries into a record for the mail merge
	| {
		# address the email to the account name (the entry key) at the domain name
		To: "\(.key)@pbs.demo",
		BreachTitle: $breachDetails[0].OnlinerSpambot.Title,
		BreachDate: $breachDetails[0].OnlinerSpambot.BreachDate,
		BreachDescriptionHTML: $breachDetails[0].OnlinerSpambot.Description,
		BreachedData: ($breachDetails[0].OnlinerSpambot.DataClasses | join(", "))
	}
]

# -- format the data as CSV --

# first output a header row
|  (["To", "BreachTitle", "BreachDate", "BreachDescriptionHTML", "BreachedData"] | @csv)

# then output the data
, (.[] | [.To, .BreachTitle, .BreachDate, .BreachDescriptionHTML, .BreachedData] | @csv)

This script starts by converting the Breaches lookup table to a list of entries and then filters that list down to just those whose list of breach names contains OnlinerSpambot. So far, nothing new. The next part is where the data enrichment happens.

The entries are piped to a filter that uses jq’s dictionary construction syntax to build new dictionaries containing the account name, and additional information about the breach from an external lookup table in $breachDetails. Note that because the --slurpfile flag always imports an array, our lookup table is actually at $breachDetails[0]. We then reach into this lookup table to get the details for our breach at $breachDetails[0].OnlinerSpambot.

Finally, we need to output these nice new dictionaries in CSV format. We do that by piping the list of dictionaries to one final filer that does two things — outputs a CSV header line, and then all the matching entries in CSV format. Notice the use of the and also operator (,) which we’ve not seen in a while. Since the @csv formatter only works on arrays, we need to build our headers as an array of strings before piping them to the formatter, and similarly, we need to explode our array of dictionaries, and then build arrays with the data before piping to the formatter for final output.

We can run this script with the command below to build a CSV file named hibp-mailmerge.csv:

jq -r -f pbs164-b.jq --slurpfile breachDetails hibp-breaches-20240329.json hibp-pbs.demo.json > hibp-mailmerge.csv

This example jq script is useful, but we’ve hard-coded the breach name into the script. Wouldn’t it be better to replace the hard coding with a variable so we could reuse our script for all future breaches? Let’s do that. You’ll find the improved code in the file hibp-buildMailMergeCSV.jq in the instalment ZIP:

# Build the CSV for a mail merge warning users caught up in a given breach
# Input:    JSON as downloaded from the HIBP service
# Output:   CSV data with the columns TBD
# Variables:
# - $breachName     The name of the breach to notify users about 
# - $breachDetails	An array containing a single entry, the JSON for the
#                   latest lookup table of HIBP breaches indexed by breach
#                   name

# transform the lookup of breaches by AccountName into a list of entries
.Breaches | to_entries

# filter down to just the users caught up in the given breach
| [
	# explode the list of entries
	.[]

	# select only the entries for users caught up in the given breach
	| select(any(.value[]; . == $breachName))

	# transform the remaining entries into a record for the mail merge
	| {
		# address the email to the account name (the entry key) at the domain name
		To: "\(.key)@pbs.demo",
		BreachTitle: $breachDetails[0].[$breachName].Title,
		BreachDate: $breachDetails[0].[$breachName].BreachDate,
		BreachDescriptionHTML: $breachDetails[0].[$breachName].Description,
		BreachedData: ($breachDetails[0].[$breachName].DataClasses | join(", "))
	}
]

# -- format the data as CSV --

# first output a header row
|  (["To", "BreachTitle", "BreachDate", "BreachDescriptionHTML", "BreachedData"] | @csv)

# then output the data
, (.[] | [.To, .BreachTitle, .BreachDate, .BreachDescriptionHTML, .BreachedData] | @csv)

The only difference in this generalised script is that everywhere we had hard-coded OnlinerSpambot we now use $breachName. One important subtle is that to use the value of a variable as the key in a lookup table you must wrap it in square braces, so lines like:

BreachTitle: $breachDetails[0].OnlinerSpambot.Title,

Need to be rewritten as lines like:

BreachTitle: $breachDetails[0].[$breachName].Title,

We can use this generic jq script to build a mail merge for any breach, for example, the command below re-builds the mail merge for the Dropbox breach:

jq -r -f hibp-buildMailMergeCSV.jq --arg breachName Dropbox --slurpfile breachDetails hibp-breaches-20240329.json hibp-pbs.demo.json > hibp-mailmerge.csv

An Optional Challenge

Write a jq filter that takes as an argument a search string, and filters a HIBP export down to just the users caught up in any breach that matches the search string. The search should be case-insensitive, so linkedin should match all three of the breaches at LinkedIn (from 2012, 2021 & 2023).

For bonus credit, update your filter to make use of the breaches data file from HIBP to ignore any breaches that did not expose passwords.

Final Thoughts

This exploration of lookup tables proved to be a deeper well than I had expected, and writing these instalments has triggered a slight rethink in terms of the order and focus of our remaining jq instalments. I’ve decided to move our exploration of variables forward, so that will be the topic for the next instalment. After that we’ll look at editing the contents of arrays and dictionaries in place, greatly reducing the need to explode arrays. We’ll then finish the series with a brief look at some advanced topics.

Lookup Tables & Records jq Miniseries Variables

163. Lookup Tables & Records Programming by Stealth 165. Variables

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack