Logo
Logo

Programming by Stealth

A blog and podcast series by Bart Busschots & Allison Sheridan.

PBS 167 of X — jq: Recursion, Syntactic Sugar, Some old Friends and a Few Honourable Mentions

When we started this series within a series I expected it to be a short little interlude, but it proved to be a very rich vein with some great content. At the time I was just learning jq myself for processing JSON data from APIs in work, and I was well motivated to learn as much as a I could as quickly as I could. There’s no better way to learn than to try teach, so for this entire series I have been just a few instalments ahead of the most recent recorded episode!

This instalment is one final push to sweep up some of the parts of jq that are useful but more than most people need most of the time. The vast bulk of this instalment is dedicated to ways of doing things we can already do more efficiently, but we start with one entirely new topic — processing recursive data.

Most data is not recursive, that is to say arbitrarily deeply nested in self-similar layers, but when your data is recursive, you’ll need tools beyond what we’ve covered so far.

Once we have that last piece of heavy lifting out of the way, everything else is about doing what we can already do in potentially better ways. If you never learned any of it, you’d still be able to achieve the same outcomes, though perhaps with quite a bit more typing!

Matching Podcast Episode

Listen along to this instalment on episode 795 of the Chit Chat Across the Pond Podcast.

You can also Download the MP3

Read an unedited, auto-generated transcript with chapter marks: CCATP_2024_06_07

Installment Resources

PBS 166 Challenge Solution

The final js challenge in this series-within-a-series was to use the map, map_values, and reduce functions we learned about in the previous instalment to refactor and enrich a domain data breach report from the wonderful Have-I-Been-Pwnd service (HIBP) in two ways:

  1. Remove the top-level .Breaches and .Pastes keys, and have the contents of the .Breaches lookup table become the top-level element.
  2. Use map_values to replace the value of each key in the new top-level lookup table (initially a simple array of breach names) with a dictionary that contains two keys:
    1. Breaches — an array of dictionaries indexed by Name, Title & DataClasses . Use the map function to process the entire array in one step.
    2. ExposureScore — a value calculated from the enriched breach details that starts at 0, and adds 1 for each breach the user is caught up in that does not contain passwords, and 10 for each breach that does. Use the reduce function to perform this calculation.

We’ll use the dummy HIBP export for the imaginary domain demo.bartificer.net to test our solution, and you’ll find that export in the instalment ZIP as hibp-pbs.demo.json. To enrich the data we’ll need an export of all the breaches HIBP knows, which we can download for free as described in instalment 164. For convenience, the instalment ZIP includes a copy of the database as it was on the 4th of June 2024 in the file hibp-breaches-20240604.json.

You’ll find a full sample solution in the file pbs166-challengeSolution.jq in the instalment zip:

# Enrich the breach data in a domain report from the Have-I-Been-Pwned service.
# Input:    JSON as downloaded from the HIBP service
# Output:   A Lookup-style dictionary with breach data for domain users indexed
#   by email username (the part to the left of the @). For each user lookup
#   table maps to a dictionary indexed by:
#   - Breaches:         An array of dictionaries describing the breaches the
#                       user was snared in.
#   - Exposure Score:   A measure of how exposed the user is.
# Variables:
# - $breachDetails	An array containing a single entry, the JSON for the
#                   latest lookup table of HIBP breaches indexed by breach
#                   name

# Keep just the Breach details
.Breaches

# re-build the values for every key
| map_values({
    # build the enriched breach data
    Breaches: map({
        Name: .,
        Title: $breachDetails[0][.].Title,
        DataClasses: $breachDetails[0][.].DataClasses
    }),

    # calculate the exposure score
    ExposureScore: (reduce .[] as $breachName (0; 
        . + (($breachDetails[0][$breachName].DataClasses | contains(["Passwords"]) // empty | 10 ) // 1 )
    ))
})

As you can see, promoting the breaches lookup table to the top-level element is as simple as filtering down to it by name, so that is simply .Breaches.

At the top level we now have a lookup table with arrays if breach names indexed by email usernames like this:

{
  "josullivan": [
    "OnlinerSpambot"
  ],
  "egreen": [
    "Dropbox"
  ],
  "mwkelly": [
    "Dropbox",
    "KayoMoe",
    "LinkedIn",
    "LinkedInScrape",
    "PDL"
  ],
  "ahawkins": [
    "iMesh",
    "OnlinerSpambot"
  ],
  "ptraynor": [
    "Collection1"
  ]
}

We need to transform the values in this lookup table from arrays of strings to arrays of dictionaries, so we’ll start by wrapping all our work in a call to map_values. The first thing we will need the filter passed as the only argument to map_values to do is construct a new dictionary with two keys, Breaches and ExposureScore, so big-picture wise our solution will be structured like this:

| map_values({
    # build the enriched breach data
    Breaches: SOMETHING,
    ExposureScore: SOMETHING
})

Let’s look at how to calculate each value in turn, starting with the new breaches list. The input to this filter will be a single current value from the top-level dictionary, so something like:

[
  "iMesh",
  "OnlinerSpambot"
]

That is to say, we have an array of strings, and we want an array of dictionaries. We transform the strings to dictionaries without exploding and re-capturing the array using the map function. In the filter passed as the only argument to map, the current value will be a single breach name as a string. That means that this part of our solution will have the following basic form:

map({
  Name: SOMETHING,
  Title: SOMETHING,
  DataClasses: SOMETHING
})

Since the original value being transformed is a breach name, Name can simply be given the value .. The name is also used to index our list of all known breaches from HIBP, so if we import that list into our script with --slurpfile and give it the name $breachDetails then we can get the title and data classes with $breachDetails[0][.].Title & $breachDetails[0][.].DataClasses. Remember, --slurpfile wraps the data it imports into an array in case there are multiple values, so our lookup table is not at $breachDetails, but at $breachDetails[0]. Putting all that together we can calculate our new enrich arrays of breach details with:

map({
  Name: .,
  Title: $breachDetails[0][.].Title,
  DataClasses: $breachDetails[0][.].DataClasses
})

The final piece of the puzzle is to calculate our exposure score. Here we are condensing a list of breaches into a single number, so we are reducing an array, hence, we use the reduce operator. We start with an exposure score of 0, and for each breach that has no passwords we add 1, and for each breach that does have passwords we add 10. We can figure out whether or not a breach has passwords using our enrichment data in $breaches[0] — if the array breachDetails[0][$breachName].DataClasses contains the string Passwords then the breach contained passwords. One small subtlety to remember is that to use the contains function to check array containment rather than substring containment we need to pass an array both as input and as the argument, so it needs to be contains(["Passwords"]) rather than simply contains("Passwords").

Next, we need to add the appropriate amount each time, and the key to that is the alternate operator, or //. Remember that if the left-hand-side evaluates to anything that’s not empty, false, or null, then the right-hand-side is evaluated. In effect, we need to implement the simple logic ‘if true, evaluate to 10, and if false evaluate to 1’. Because we’ve not yet learned about jq’s if operator we need to achieve this logic with the alternative operator (//), which is actually quite tricky!

The key to making this work is that if you pipe any value to a filter with a literal value the output is that literal value, e.g.:

# -n for no input
jq -n 'true | 10' # outputs 10

But, if you pipe nothing to anything, the pipe never happens, so the output is still nothing, i.e. empty. We can prove this to ourselves with the following commands which pipe nothing to a filter with a literal value — both output nothing at all rather than 10:

# create then explode an empty array
jq -n '[] | .[] | 10' # no output

# explicitly send no input
jq -n 'empty | 10' # no output

We can use this jq behaviour in conjunction with two alternative operators to get our 1 or 10 as appropriate with:

(($breachDetails[0][$breachName].DataClasses | contains(["Passwords"]) // empty | 10 ) // 1

To understand why this works, let’s break it down for a breach which does contain passwords, and then again for one that doesn’t. Note that the output from the contains filter will be true or false.

So, when a breach does contain a password what we in effect have is:

(true // empty | 10 ) // 1

The first alternative operator has a value on the left that is not empty, null, or false, so the right-hand-side never happens, and true gets piped to the filter 10, producing a 10 on the left of the second alternative operator, so it’s right-hand side also does not happen, and the final output is 10.

Now, what happens when the breach does not contain a password? In that case we effectively have:

(false // empty | 10 ) // 1

In this case, the left-hand side of the first alternative operator is false, so the right-hand side does happen, and empty gets piped to 10, producing empty as the left-hand side of the second alternative operator. This means that its right-hand side happens too, so the final output is 1.

Finally, we have all the pieces we need to use the reduce operator:

(reduce .[] as $breachName (0; 
  . + (($breachDetails[0][$breachName].DataClasses | contains(["Passwords"]) // empty | 10 ) // 1 )
))

That’s all the pieces of our solution filled out, so we can now see it in action with:

jq --slurpfile breachDetails hibp-breaches-20240604.json -f pbs166-challengeSolution.jq hibp-pbs.demo.json

This gives us lots of output, including the following entry for mwkelly:

"mwkelly": {
  "Breaches": [
    {
      "Name": "Dropbox",
      "Title": "Dropbox",
      "DataClasses": [
        "Email addresses",
        "Passwords"
      ]
    },
    {
      "Name": "KayoMoe",
      "Title": "Kayo.moe Credential Stuffing List",
      "DataClasses": [
        "Email addresses",
        "Passwords"
      ]
    },
    {
      "Name": "LinkedIn",
      "Title": "LinkedIn",
      "DataClasses": [
        "Email addresses",
        "Passwords"
      ]
    },
    {
      "Name": "LinkedInScrape",
      "Title": "LinkedIn Scraped Data (2021)",
      "DataClasses": [
        "Education levels",
        "Email addresses",
        "Genders",
        "Geographic locations",
        "Job titles",
        "Names",
        "Social media profiles"
      ]
    },
    {
      "Name": "PDL",
      "Title": "Data Enrichment Exposure From PDL Customer",
      "DataClasses": [
        "Email addresses",
        "Employers",
        "Geographic locations",
        "Job titles",
        "Names",
        "Phone numbers",
        "Social media profiles"
      ]
    }
  ],
  "ExposureScore": 32
},

The first thing to note is just how much enrichment we have added, this was their original entry:

[
  "Dropbox",
  "KayoMoe",
  "LinkedIn",
  "LinkedInScrape",
  "PDL"
]

Secondly, notice that two of the five breaches had no passwords, and three did, so the exposure score was correctly calculated as 10+10+10+1+1 which gives 32.

Querying Nested Data Structures with recurse & ..

Let’s start with one last heavy lift — using jq to interrogate nested data structures. The most common example of a nested structure is a file system – folders contain files and folders which contain files and folders which contain files and folders …

You may well never need to process this kind of nested, or recursive, data with jq, but if you do, nothing we’ve learned so far will help you get the job done. You need one more concept — recursive descent.

The most generic form of recursive descent jq supports is the recurse() function which supports up to two arguments, both optional:

recurse(GENERATOR_FILTER; CONDITION)

The GENERATOR_FILTER generates one or more matching values for each level of the nested structure, and it will be applied over and over again until there are no more results. The GENERATOR_FILTER filter is .[]?, but we’ll come back to that later.

In theory recursion can be infinite, so if you want to assert some kind of control you can pass an optional CONDITION. Recursion will only continue while this condition evaluates to true. The default CONDITION is true.

If you end up using recurse at all you’re most likely to use its single-argument form, i.e. recurse(GENERATOR_FILTER).

There are two important subtleties to note — firstly the recurse function behaves like the explode operator, creating many outputs, so you’ll need to wrap your use of it in square braces [] if you want to collect the results in an array.

Secondly, the recurse function returns every element at every level, so it will give you more output than you may expect.

To illustrate this point, consider the very basic nested array [1, [2, 3]], this will return 5 items:

  1. The entire top-level array: [1, [2, 3]]
  2. The first item in the top-level array: 1
  3. The second item in the top-level array: [2, 3]
  4. The first item in the nested array: 2
  5. The second item in the nested array: 3

We can prove this with the simple command:

# -nc for no input and compact output
jq -nc '[1, [2, 3]] | recurse(.[]?)'

Which outputs the following five values:

[1,[2,3]]
1
[2,3]
2
3

To properly experiment with recursion, I used the dir-to-json NodeJS module to save a JSON representation of the current state of the installmentResources folder in the copy of this series GIT repository currently checked out on my Mac to the file pbsInstallmentResourcesDir.json . This file contains a nested representation of every file and folder in that directory, in the following nested format:

{
  "parent": "pbs167",
  "path": "",
  "name": "..",
  "type": "directory",
  "children": [
    {
      "parent": "",
      "path": "pbs100",
      "name": "pbs100",
      "type": "directory",
      "children": [
        {
          "parent": "pbs100",
          "path": "pbs100/PBS96-SampleSolution-Allison",
          "name": "PBS96-SampleSolution-Allison",
          "type": "directory",
          "children": [
            {
              "parent": "pbs100/PBS96-SampleSolution-Allison",
              "path": "pbs100/PBS96-SampleSolution-Allison/index.html",
              "name": "index.html",
              "type": "file"
            },
            {
              "parent": "pbs100/PBS96-SampleSolution-Allison",
              "path": "pbs100/PBS96-SampleSolution-Allison/script.js",
              "name": "script.js",
              "type": "file"
            }
          ]
        },
        {
          "parent": "pbs100",
          "path": "pbs100/PBS96-SampleSolution-Bart",
          "name": "PBS96-SampleSolution-Bart",
          "type": "directory",
          "children": [
            {
              "parent": "pbs100/PBS96-SampleSolution-Bart",
              "path": "pbs100/PBS96-SampleSolution-Bart/index.html",
              "name": "index.html",
              "type": "file"
            }
          ]
        }
      ]
    },
    // ...
  ]
}

For completeness, I generated this file with the following (rather complex) shell command:

node -e 'require("dir-to-json")("..").then(dirTree => {console.log(JSON.stringify(dirTree));})' > pbsInstallmentResourcesDir.json

To get the details of every single file and folder we would recursively descend into the .children arrays for each nested entry. We can do that with the filter recurse(.children[]?). Note that because not all entries have a children key, we need to add the error suppression operator (?) or we’ll generate errors about not being able to recurse over null. Running this filter as-is would generate a lot of output, so let’s simply count the total number of files and folders with the following command:

jq '[recurse(.children[]?)] | length' pbsInstallmentResourcesDir.json
# outputs 2079

To filter this down to just the files we can add a select:

jq '[recurse(.children[]?) | select(.type == "file")] | length' pbsInstallmentResourcesDir.json
# outputs 1887

Finally, to see the paths to all index.html files at all levels in the hierarchy we can use the command:

jq '[recurse(.children[]?) | select(.type == "file" and .name == "index.html") | .path]' pbsInstallmentResourcesDir.json

The returns 51 paths, so here’s a snippet that shows we have indeed recursively descended into the structure:

[
  "pbs100/PBS96-SampleSolution-Allison/index.html",
  "pbs100/PBS96-SampleSolution-Bart/index.html",
  "pbs104/pbs104a/index.html",
  // ...
]

Recursively Descending into Arrays (the .. Operator)

When dealing with nested arrays, we can use the zero-argument version of the recurse function. We can demonstrate this by simplifying our first example command:

jq -nc '[1, [2, 3]] | recurse'

You’ll see it produces the identical output as before

Because descending into recursive arrays is a common thing to want to do, jq provides a convenient operator for this specific type of recursion, the recursive descent operator .., so, we can re-write the above as simply:

jq -nc '[1, [2, 3]] | ..'

Finally, if you only want the bare values in the arrays you can use select(type !== "array") to filter out the arrays. The following example command does this, and also collects the resulting bare values into an array:

jq -nc '[1, [2, 3]] | [.. | select(type != "array")]'
# outputs: [1,2,3]

Note that the .. operator is purely a convenience, it adds no new capabilities to the language. This makes it the perfect transition point to our next topic!

Some Syntactic Sugar

With the heavy lifting of recursion out of the way, let’s switch gears completely with a little so-called syntactic sugar. Basically, useful optional extras in a language’s syntax that don’t add any new capabilities, but make your code a little nicer!

Dictionary Construction Shortcuts

Something I’ve noticed as we worked our way through this series is that names of dictionary keys tend to stay the same as you move through filter chains, so, when you use dictionary construction you often end up writing very duplicative things like {Name: .Name, Title: .Title, DataClasses: .DataClasses}. Good news, you can avoid all that by simply specifying the name of the key without a value, so the previous example can be written as simply {Name, Title, DataClasses}. We can illustrate this with the following command to get the basic information from all LinkedIn Breaches:

jq 'to_entries | map(select(.value.Name | contains("LinkedIn"))) | map(.value |= {Name, Title, DataClasses}) | from_entries' hibp-breaches-20240604.json

This outputs the following JSON:

{
  "LinkedIn": {
    "Name": "LinkedIn",
    "Title": "LinkedIn",
    "DataClasses": [
      "Email addresses",
      "Passwords"
    ]
  },
  "LinkedInScrape2023": {
    "Name": "LinkedInScrape2023",
    "Title": "LinkedIn Scraped and Faked Data (2023)",
    "DataClasses": [
      "Email addresses",
      "Genders",
      "Geographic locations",
      "Job titles",
      "Names",
      "Professional skills",
      "Social media profiles"
    ]
  },
  "LinkedInScrape": {
    "Name": "LinkedInScrape",
    "Title": "LinkedIn Scraped Data (2021)",
    "DataClasses": [
      "Education levels",
      "Email addresses",
      "Genders",
      "Geographic locations",
      "Job titles",
      "Names",
      "Social media profiles"
    ]
  }
}

We tend to get very similar duplication with variable names because a name that makes sense for a variable probably also makes sense for a dictionary key! Again, we can simplify duplicative dictionary key-value pair definitions like {category: $category, year: $year} with just {$category, $year}. We can combine this with our previous syntactic sugar cube to get the key basic information of all Nobel prize winners with the following command:

jq '[.prizes[] | (.year | tonumber) as $year | .category as $category | .laureates[]? | {$year, $category, firstname, surname}]' NobelPrizes.json

This produces an array of dictionaries like:

[
  // ...
  {
    "year": 1910,
    "category": "peace",
    "firstname": "Permanent International Peace Bureau",
    "surname": null
  },
  {
    "year": 1910,
    "category": "physics",
    "firstname": "Johannes Diderik",
    "surname": "van der Waals"
  },
  // ...
]

Complex Assignments (in Arrays & Dictionaries)

Remember — in jq the word assignment is used to refer to setting values within the data passing through a filter, specifically a value at an array index or a dictionary key. More importantly, assignment does not refer to binding values to variable names!

We’ve see how we can use the simple and updated assignment operators (=, |=, +=, -= etc.) to assign values to one array index or dictionary key at a time. We can use this repeatedly assigns values to as many indexes and keys as we need, but, jq offers us some syntactic sugar to let us assign multiple values at once!

When we learned about the assignment operators we described there being a path to the left of the operator and a value to the right. That’s true, but it’s only part of the truth, you can have filters that produce multiple paths to the left of an assignment to update multiple values at once!

For example, we can update all the keys in a HIBP domain export from email usernames to full usernames with a single assignment operation by converting the list of breaches to entries, and then setting a new value for every key with the single assignment filter .[].key |= . + "@demo.bartificer.net", and then converting the updated entries back to a lookup table with the command:

jq '.Breaches | to_entries | .[].key |= . + "@demo.bartificer.net" | from_entries' hibp-pbs.demo.json

This produces the following JSON:

{
  "josullivan@demo.bartificer.net": [
    "OnlinerSpambot"
  ],
  "egreen@demo.bartificer.net": [
    "Dropbox"
  ],
  "mwkelly@demo.bartificer.net": [
    "Dropbox",
    "KayoMoe",
    "LinkedIn",
    "LinkedInScrape",
    "PDL"
  ],
  "ahawkins@demo.bartificer.net": [
    "iMesh",
    "OnlinerSpambot"
  ],
  "ptraynor@demo.bartificer.net": [
    "Collection1"
  ]
}

We can even take things a step further and use a select within the left-hand-side to pre-fix the email usernames for all users caught up in more than one breach with the string NAUGHTY with the assignment filter (.[] | select((.value | length) >= 2) | .key) |= "NAUGHTY " + .. The following command converts the breaches lookup to entries, the uses this single update filter to update all the appropriate keys at once, then converts the entries back into a lookup table:

jq '.Breaches | to_entries | (.[] | select((.value | length) >= 2) | .key) |= "NAUGHTY " + . | from_entries' hibp-pbs.demo.json

This produces the following JSON:

{
  "josullivan": [
    "OnlinerSpambot"
  ],
  "egreen": [
    "Dropbox"
  ],
  "NAUGHTY mwkelly": [
    "Dropbox",
    "KayoMoe",
    "LinkedIn",
    "LinkedInScrape",
    "PDL"
  ],
  "NAUGHTY ahawkins": [
    "iMesh",
    "OnlinerSpambot"
  ],
  "ptraynor": [
    "Collection1"
  ]
}

While the possibilities are technically infinite, not every filter will work on the left-hand side of an assignment — jq needs to be able to map the results from the left-hand filter to a list of paths to update. Basically, your left-hand-side filters must only filter the input data structure, they can’t alter it in any way at all. If you get it wrong and use an inappropriate filter you’ll get an error that starts with ‘Invalid path expression near attempt to iterate through …‘.

Destructuring (Advanced Variable Binding)

Again, a reminder that in jq you bind variable names to values, you do not assign values to variables!

We know we can bind values to single variables with the as operator, and we can use the as keyword in as many filters as we need to create all our desired variables, but, jq provides us with the syntactic sugar to bind multiple variables with a single as!

When using as the filter on the left-hand-side produces values, and the filter on the right binds those values to a variable name. Normally, each value gets assigned to a single variable name, and if there are multiple values, they are looped over. That looping behaviour is always true, but if each produced value is an array or dictionary, then you can bind multiple values within the array or dictionary to multiple variable names for each pass through the loop. These kinds of bindings are referred to as destructuring.

They key to getting destructuring to work is to be sure the structure of the values produced by the left-hand side of the as matches the pattern on the right-hand side.

The most common use of this technique is to process data that’s been converted to JSON from some other format. To show this in action I created a dummy log file in CSV format that has three un-named columns, a timestamp, a severity, and a message, and I added this file in the instalment ZIP as dummyLog.csv:

2023-12-25T23:30:03,INFO,"Carrot & Cookie placed on mantle piece"
2023-12-25T23:45:42,NOTICE,"Vibration detected by roof sensor"
2023-12-25T23:50:32,NOTICE,"Motion detected by chimney sensor"

Then, I converted this CSV to 2D JSON array with the csvtojson NPM module and saved it to dummyLog-2dArray.json and added it to the instalment ZIP too:

[
  [
    "2023-12-25T23:30:03",
    "INFO",
    "Carrot & Cookie placed on mantle piece"
  ],
  [
    "2023-12-25T23:45:42",
    "NOTICE",
    "Vibration detected by roof sensor"
  ],
  [
    "2023-12-25T23:50:32",
    "NOTICE",
    "Motion detected by chimney sensor"
  ]
]

For completeness, after installing csvtojson with the command npm install csvtojson I used the following (rather long) terminal command to do the conversion:

cat dummyLog.csv | node -e 'require("csvtojson")({noheader:true, output: "csv"}).fromString(require("fs").readFileSync(0, "utf-8")).then((csvRow)=>{console.log(JSON.stringify(csvRow))})' > dummyLog-2dArray.json

We can now loop over this log file one line at a time and create three sensibly named variables in the process with the single as filter:

.[] as [$timestamp, $severity, $message]

This works because each value in the exploded top-level array is itself an array with three values, and that shape is mirrored on the right of the as.

We can see this in action with the following command:

# -r for raw string output (no wrapping "")
jq -r '.[] as [$timestamp, $severity, $message] | "\($severity): \($message) (logged @ \($timestamp))"' dummyLog-2dArray.json

Which outputs:

INFO: Carrot & Cookie placed on mantle piece (logged @ 2023-12-25T23:30:03)
NOTICE: Vibration detected by roof sensor (logged @ 2023-12-25T23:45:42)
NOTICE: Motion detected by chimney sensor (logged @ 2023-12-25T23:50:32)

When converting something like CSV to JSON there’s a second, probably better, option — rather than an array of arrays, you can convert to an array of record-style dictionaries. I’ve re-converted the logs CSV in this format and included it in the instalment ZIP as ZIP as dummyLog-dictionaryArray.json:

[
  {
    "timestamp": "2023-12-25T23:30:03",
    "severity": "INFO",
    "message": "Carrot & Cookie placed on mantle piece"
  },
  {
    "timestamp": "2023-12-25T23:45:42",
    "severity": "NOTICE",
    "message": "Vibration detected by roof sensor"
  },
  {
    "timestamp": "2023-12-25T23:50:32",
    "severity": "NOTICE",
    "message": "Motion detected by chimney sensor"
  }
]

Again, just for completeness, this is the shell command I used to do the conversion:

cat dummyLog.csv | node -e 'require("csvtojson")({noheader:true, headers: ["timestamp", "severity", "message"], output: "json"}).fromString(require("fs").readFileSync(0, "utf-8")).then((csvRow)=>{console.log(JSON.stringify(csvRow))})' > dummyLog-dictionaryArray.json

We can destructure these log entries into variables too, we just need to update the pattern on the right of the as to reflect the new structure of each element in the array:

.[] as {$timestamp, $severity, $message}

This works because when we explode the array of log entries we now get dictionaries with keys named timestamp, severity & message which matches our updated pattern.

We can see this in action with the command:

jq -r '.[] as {$timestamp, $severity, $message} | "\($severity): \($message) (logged @ \($timestamp))"' dummyLog-dictionaryArray.json

This outputs exactly the same log text as before.

Usually, the names of fields in the dictionaries make sense as variable names, so the above works perfectly, but we don’t have to accept the names from dictionary keys, we can re-name by using a more explicit pattern:

.[] as {timestamp: $ts, severity: $sev, message: $msg}

We can see this in action with the command:

jq -r '.[] as {timestamp: $ts, severity: $sev, message: $msg} | "\($sev): \($msg) (logged @ \($ts))"' dummyLog-dictionaryArray.json

Again, this outputs the same log text as the previous two commands.

The Destructuring Alternative Operator ?//

Basic destructuring is very powerful and very useful, but jq has one more superpower — the ability to specify multiple possible shapes to extract the data from in a single as statement. You can do this by separating the possible right hand sides with the destructuring alternative operator ?//.

To illustrate this, let’s return to our CSV log example — there are two reasonable ways of converting the raw CSV to JSON, and in a large organisation with multiple log sources it seems reasonable that you’ll end up with some logs in each format. It would be great be able to ingest the logs into jq regardless of which format they happen to be in. We can do that with the following filter:

.[] as [$timestamp, $severity, $message] ?// {$timestamp, $severity, $message}

You can actually specify as many possible patterns to the right of the as as you like, and jq will try them one by one starting on the left until one works. Under the hood jq is catching and suppressing errors until it runs out of possibilities, then it emits the error generated by attempting to use the right-most pattern.

To show this in action we can use the following command to send both JSON log files through the same jq filter:

jq -r '.[] as [$timestamp, $severity, $message] ?// {$timestamp, $severity, $message} | "\($severity): \($message) (logged @ \($timestamp))"' dummyLog-*Array.json

You’ll see it successfully outputs the log information twice, once from each source file.

More Powerful Conditionals with Traditional(ish) if-statements

Most of the time jq does not need traditional if statements, the alternative (//) or error suppression (?) operators are usually sufficient, and they’re much simpler than a traditional conditional statement, but, as our challenge solution illustrates perfectly, sometimes things get really messy when you limit yourself to those operators!

It’s fair to say the following implementation of ‘10 if there are passwords in the breach, otherwise 1’ is neither concise nor clear:

(($breachDetails[0][$breachName].DataClasses | contains(["Passwords"]) // empty | 10 ) // 1

Clearly, the following is better in every way:

if $breachDetails[0][$breachName].DataClasses | contains(["Passwords"]) then 10 else 1 end

This is an example of the longest form of the if statement in jq, for which the syntax is as follows, with CONDITION being a filter that evaluates to a boolean, and TRUE_VALUE and FALSE_VALUE being filters called to generate the output when the condition evaluates to true or false respectively:

if CONDITION then TRUE_VALUE else FALSE_VALUE end

The file pbs166-challengeSolution-if.jq in the instalment ZIP contains a version of the PBS 166 challenge solution with the confusing double-alternative-operator line above replaced with the much clearer if-statement (also above). We can see that does exactly the same thing by getting mwkelly’s exposure score with both scripts:

# Note: a trailing \ at the end of a line allows a shell
#       command to be split over multiple lines

# the script with the alternative operator
jq --slurpfile breachDetails hibp-breaches-20240604.json -f pbs166-challengeSolution.jq hibp-pbs.demo.json \
| jq '.mwkelly.ExposureScore'
# outputs 32

# the script with the if statement
jq --slurpfile breachDetails hibp-breaches-20240604.json -f pbs166-challengeSolution-if.jq hibp-pbs.demo.json \
| jq '.mwkelly.ExposureScore'
# outputs 32

For a simpler example, the following command will square numbers but preserve their sign, i.e. 2 will become 4 and -2 will become -4:

# -c for compact output
jq -nc '[-2, -1, 0, 1, 2] | map(if . < 0 then 0 - (. * .) else . * . end)'
# output: [-4,-1,0,1,4]

Using the Implied else

Like in other languages, you can use if statements without a matching else in jq:

if CONDITION then TRUE_VALUE end

However, the implicit default else filter may catch you by surprise.

In most languages, the implied else is do nothing, for example, the following JavaScript if statements do the same thing:

let x;

// implied else
x = 5;
if(x < 0){
	x += 10;  
}
console.log(x) // outputs 5

// equivalent explicit else
x = 5;
if(x < 0){
	x += 10;
}else{
  ; //explicitly do nothing
}
console.log(x) // outputs 5

You’ll find this code in pbs167a-impliedElse.js in the instalment ZIP, and if you have the NodeJS JavaScript interpreter installed, you can run it with the command:

node ./pbs167a-impliedElse.js

In jq the implied else ., i.e., the implied else is pass the input un-changed. In other words, when you omit the else and the condition is false, the output is the input! To help me remember this detail I mentally think of if in jq as ‘alter if’.

The following two statements are identical:

# implied else
5 | if . < 0 then . + 10 end
# results in 5

# equivalent explicit else
,5 | if . < 0 then . + 10 else . end
# results in 5

You’ll find this code in pbs167a-impliedElse.jq in the instalment ZIP, and you can run it with the command:

jq -nf pbs167a-impliedElse.jq

More Powerful Error Handling with try-catch

When your data is just a little variable it’s easy enough to work around it using the error suppression operator ? , or, by normalising the data, but when your data is very noisy that can become very tedious. That’s where jq’s try-catch operator comes in. You can contain the filter(s) that handle the noisy data within a try statement, and then define your response in case of any error at all in the matching catch filter.

The syntax is simply as shown below, where ERROR_PRONE_FILTER and ERROR_HANDLING_FILTER are jq filters:

try ERROR_PRONE_FILTER catch ERROR_HANDLING_FILTER

The jq interpreter will try the ERROR_PRONE_FILTER and if it runs without error, then its output will be used, but if it triggers an error that error will be prevented from stopping execution like it normally would, and the error string will be passed to the ERROR_HANDLING_FILTER as its input (.), and the output of that filter will be used instead. Note that the input to the catch is the error string, not the original input! Sadly, there’s no way I have found to access the original input from within the ERROR_HANDLING_FILTER.

How you choose to deal with errors is up to you — you could output the error as a string of some kind, you could output some kind of default value, or you could remove the problem input from the processing chain by returning empty. Let’s use our Nobel Prises data set to illustrate each approach.

We’re going to use a rather contrived example so we can intentionally trigger an error on some, but not all, of our data. The rather esoteric problem we’re going to solve is to output the number of letters in the names of all the winners of each Nobel physics prize in the 1940s as an array. This works as an example because we need to dive into the array of laureates within the prizes to get the lengths of their names, and, there are years in the 1940s where no prizes were awarded, so those entries have no .laureates array at all.

Before we explore our options for handling errors, let’s remind ourselves what happens when we don’t handle our errors! The file pbs167b1-noErrorHandling.jq in the instalment ZIP does the needed calculations without any error handling:

# Output the number of letters in the names of the laureates who were awarded
# Nobel prizes in physics each year in the 1940s.
# Input:    The official list of Nobel Prizes in JSON format
# Output:   an array of numbers

# filter down to just the physics prizes in the appropriate years
.prizes | map(
    (.year | tonumber) as $year
    | select($year >= 1940 and $year < 1950 and .category == "physics")

    # calculate the number of letters in the laureate names without error handling
    .laureates | map("\(.firstname)\(.surname)" | length) | add
)

When we try to run it we get an error message, and no output:

jq -f pbs167b1-noErrorHandling.jq NobelPrizes.json
# jq: error (at NobelPrizes.json:0): Cannot iterate over null (null)

The line that does the calculation and throws the error is:

.laureates | map("\(.firstname)\(.surname)" | length) | add

So, this is the filter we’ll add our error handling to in the examples below.

Outputting Errors as Strings

The simplest thing we can do is simply convert the error into output. Because the input to the catch filter is the error message as a string, we can simply output the error with catch .. You can see this in context in the file pbs167b2-rawError.jq:

# try calculate the number of letters in the laureate names
| try (
  .laureates | map("\(.firstname)\(.surname)" | length) | add
)
# if there is an error, output the error as-is
catch .

If we run this script we can see it now completes, and the errors appear in the output array as strings:

jq -f pbs167b2-rawError.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9,
#   "Cannot iterate over null (null)",
#   "Cannot iterate over null (null)",
#   "Cannot iterate over null (null)"
# ]

We can of course add some context around our error as illustrated in pbs167b3-enrichedError.jq:

catch "Failed to calculate total name length for \($year) with error: \(.)"

We can see this in action with the command:

jq -f pbs167b3-enrichedError.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9,
#   "Failed to calculate total name length for 1942 with error: Cannot iterate over null (null)",
#   "Failed to calculate total name length for 1941 with error: Cannot iterate over null (null)",
#   "Failed to calculate total name length for 1940 with error: Cannot iterate over null (null)"
# ]

Of course, there’s no need to include the cryptic error messages from jq itself at all, as demonstrated in pbs167b4-humanError.jq:

catch "No laureates in \($year)!"

This gives the following:

jq -f pbs167b4-humanError.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9,
#   "No laureates in 1942!",
#   "No laureates in 1941!",
#   "No laureates in 1940!"
# ]

Outputting Default Values

If your output is intended to be consumed by a human it makes sense to give a nice error message, but if your output is intended to be processed by another computer, then it’s better to output a sane default value. In our case, the length of all the names when there are no names is zero! The file pbs167b5-default.jq outputs 0 when there’s an error:

catch 0

We can see this in action with the command:

jq -f pbs167b5-default.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9,
#   0,
#   0,
#   0
# ]

Silently Swallowing Errors

Finally, we can choose to have errors simply vanish by outputting empty when something goes wrong, as per pbs167b6-swallowErrors.jq:

catch empty

Which we can see in action with the command:

jq -f pbs167b6-swallowErrors.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9
# ]

The Implied catch

Like with if statements, the catch is optional, and if you leave it out, there is an implied action, which is to silently swallow the error, so try ERROR_PRONE_FILTER is equivalent to try ERROR_PRONE_FILTER catch empty .

The file pbs167b7-impliedCatch.jq illustrates this:

# try calculate the number of letters in the laureate names
| try ( .laureates | map("\(.firstname)\(.surname)" | length) | add )

We can see this in action with the command:

jq -f pbs167b7-impliedCatch.jq NobelPrizes.json
# [
#   12,
#   20,
#   17,
#   16,
#   13,
#   16,
#   9
# ]

Honourable Mentions

Before finishing this series, there are a few other features I’d like to make you aware of, but that I won’t describe:

  1. If you need to apply some kind of processing to all the inputs at once, like reducing them, you can get them one at a time with the input function, or all at once with the inputs function. In both cases you need to be sure to use the -n command line flag or the first input will be lost. This is an extremely advanced technique, so you’re unlikely to need it, but if you do, you’ll really need it, so it’s worth remembering it’s possible.
  2. While you can probably achieve all the iteration you need with the as and reduce keywords, jq does offer some additional options:
    1. The foreach keyword is similar to reduce, but instead of outputting a single final accumulated value, it outputs all intermediate values and the final value as an array. I think of it as a version of reduce that shows its workings as if it were in a math exam!
    2. The repeat, while, and until functions allow you to repeat filters without binding variables.
  3. While we can import data into a variable from the command line with --slurpfile, we can load JSON data directly into variables from within jq scripts using the import keyword.
  4. It is possible to define your own functions, to group them into modules, and to import these modules into your scripts. This only makes sense if you’re building a large collection of jq scripts to work on a complex data set.

You’ll find the documentation for all these features, and the many features we’ve not covered in the official jq docs at jqlang.github.io/jq/manual/.

Final Thoughts

Believe it or not, even with those last few honourable mentions, there are still entire aspects of jq’s more esoteric functionality we’ve not mentioned at all! This has proven to be so much richer and more powerful a language than I had ever imagined. I’ve enjoyed this learning journey immensely, I hope you have too!

The next topic on the agenda should be substantially shorter and simpler — the human-friendly data markup language YAML. It’s used for specifying metadata in many Markdown-based authoring tools, and it’s becoming an ever more popular language for configuration files, so I’m meeting it more and more, and while it’s very simple to read, you do need some understanding of the rules to write it well.

Join the Community

Find us in the PBS channel on the Podfeet Slack.

Podfeet Slack